[jira] [Commented] (ARROW-10260) [Python] Missing MapType to Pandas dtype
[ https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211552#comment-17211552 ] Derek Marsh commented on ARROW-10260: - I appreciate the opportunity to contribute. https://github.com/apache/arrow/pull/8422 > [Python] Missing MapType to Pandas dtype > > > Key: ARROW-10260 > URL: https://issues.apache.org/jira/browse/ARROW-10260 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype > mapping for {{to_pandas_dtype()}} > > {code:java} > In [2]: d = pa.map_(pa.int64(), pa.float64()) >In [3]: d.to_pandas_dtype() > > > --- > NotImplementedError Traceback (most recent call last) > in > > 1 > d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi > in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10260) [Python] Missing MapType to Pandas dtype
[ https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10260: --- Labels: pull-request-available (was: ) > [Python] Missing MapType to Pandas dtype > > > Key: ARROW-10260 > URL: https://issues.apache.org/jira/browse/ARROW-10260 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype > mapping for {{to_pandas_dtype()}} > > {code:java} > In [2]: d = pa.map_(pa.int64(), pa.float64()) >In [3]: d.to_pandas_dtype() > > > --- > NotImplementedError Traceback (most recent call last) > in > > 1 > d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi > in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType
[ https://issues.apache.org/jira/browse/ARROW-10261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211537#comment-17211537 ] Jorge Leitão commented on ARROW-10261: -- Makes sense to me. :) > [Rust] [BREAKING] Lists should take Field instead of DataType > - > > Key: ARROW-10261 > URL: https://issues.apache.org/jira/browse/ARROW-10261 > Project: Apache Arrow > Issue Type: Sub-task > Components: Integration, Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Major > > There is currently no way of tracking nested field metadata on lists. For > example, if a list's children are nullable, there's no way of telling just by > looking at the Field. > This causes problems with integration testing, and also affects Parquet > roundtrips. > I propose the breaking change of [Large|FixedSize]List taking a Field instead > of Box, as this will overcome this issue, and ensure that the Rust > implementation passes integration tests. > CC [~andygrove] [~jorgecarleitao] [~alamb] [~jhorstmann] ([~carols10cents] > as this addresses some of the roundtrip failures). > I'm leaning towards this landing in 3.0.0, as I'd love for us to have > completed or made significant traction on the Arrow Parquet writer (and > reader), and integration testing, by then. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10260) [Python] Missing MapType to Pandas dtype
[ https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211521#comment-17211521 ] Bryan Cutler commented on ARROW-10260: -- Should be a quick fix, so marking this for 2.0.0 > [Python] Missing MapType to Pandas dtype > > > Key: ARROW-10260 > URL: https://issues.apache.org/jira/browse/ARROW-10260 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Bryan Cutler >Priority: Major > Fix For: 2.0.0 > > > The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype > mapping for {{to_pandas_dtype()}} > > {code:java} > In [2]: d = pa.map_(pa.int64(), pa.float64()) >In [3]: d.to_pandas_dtype() > > > --- > NotImplementedError Traceback (most recent call last) > in > > 1 > d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi > in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10260) [Python] Missing MapType to Pandas dtype
[ https://issues.apache.org/jira/browse/ARROW-10260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-10260: - Fix Version/s: 2.0.0 > [Python] Missing MapType to Pandas dtype > > > Key: ARROW-10260 > URL: https://issues.apache.org/jira/browse/ARROW-10260 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Bryan Cutler >Priority: Major > Fix For: 2.0.0 > > > The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype > mapping for {{to_pandas_dtype()}} > > {code:java} > In [2]: d = pa.map_(pa.int64(), pa.float64()) >In [3]: d.to_pandas_dtype() > > > --- > NotImplementedError Traceback (most recent call last) > in > > 1 > d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi > in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map double>{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8810) [R] Add documentation about Parquet format, appending to stream format
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8810. Resolution: Fixed > [R] Add documentation about Parquet format, appending to stream format > -- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Reporter: Carl Boettiger >Assignee: Neal Richardson >Priority: Minor > Fix For: 2.0.0 > > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10257) [R] Prepare news/docs for 2.0 release
[ https://issues.apache.org/jira/browse/ARROW-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-10257. - Resolution: Fixed Issue resolved by pull request 8421 [https://github.com/apache/arrow/pull/8421] > [R] Prepare news/docs for 2.0 release > - > > Key: ARROW-10257 > URL: https://issues.apache.org/jira/browse/ARROW-10257 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType
Neville Dipale created ARROW-10261: -- Summary: [Rust] [BREAKING] Lists should take Field instead of DataType Key: ARROW-10261 URL: https://issues.apache.org/jira/browse/ARROW-10261 Project: Apache Arrow Issue Type: Sub-task Components: Integration, Rust Affects Versions: 1.0.1 Reporter: Neville Dipale There is currently no way of tracking nested field metadata on lists. For example, if a list's children are nullable, there's no way of telling just by looking at the Field. This causes problems with integration testing, and also affects Parquet roundtrips. I propose the breaking change of [Large|FixedSize]List taking a Field instead of Box, as this will overcome this issue, and ensure that the Rust implementation passes integration tests. CC [~andygrove] [~jorgecarleitao] [~alamb] [~jhorstmann] ([~carols10cents] as this addresses some of the roundtrip failures). I'm leaning towards this landing in 3.0.0, as I'd love for us to have completed or made significant traction on the Arrow Parquet writer (and reader), and integration testing, by then. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet
[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-9812: Description: Hi, I'm having problems using 'map' data type in Arrow/parquet/pandas. I'm able to convert a pandas data frame to Arrow with a map data type. When I write Arrow to Parquet, it seems to work, but I'm not sure if the data type is written correctly. When I read back Parquet to Arrow, it fails saying "reading list of structs" is not supported. It seems that map is stored as list of structs. There are two problems here: # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151 # Map data type doesn't get written to or read from Arrow -> Parquet. Questions: 1. Am I doing something wrong? Is there a way to get these to work? 2. If these are unsupported features, will this be fixed in a future version? Do you plans or ETA? The following code example (followed by output) should demonstrate the issues: I'm using Arrow 1.0.0 and Pandas 1.0.5. Thanks! Mayur {code:java} $ cat arrowtest.py import pyarrow as pa import pandas as pd import pyarrow.parquet as pq import traceback as tb import io print(f'PyArrow Version = {pa.__version__}') print(f'Pandas Version = {pd.__version__}') df1 = pd.DataFrame({'a': [[('b', '2')]]}) print(f'df1') print(f'{df1}') print(f'Pandas -> Arrow') try: t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', pa.map_(pa.string(), pa.string()))])) print('PASSED') print(t1) except: print(f'FAILED') tb.print_exc() print(f'Arrow -> Pandas') try: t1.to_pandas() print('PASSED') except: print(f'FAILED') tb.print_exc()print(f'Arrow -> Parquet') fh = io.BytesIO() try: pq.write_table(t1, fh) print('PASSED') except: print('FAILED') tb.print_exc() print(f'Parquet -> Arrow') try: t2 = pq.read_table(source=fh) print('PASSED') print(t2) except: print('FAILED') tb.print_exc() {code} {code:java} $ python3.6 arrowtest.py PyArrow Version = 1.0.0 Pandas Version = 1.0.5 df1 a 0 [(b, 2)] Pandas -> Arrow PASSED pyarrow.Table a: map child 0, entries: struct not null child 0, key: string not null child 1, value: string Arrow -> Pandas FAILED Traceback (most recent call last): File "arrowtest.py", line 26, in t1.to_pandas() File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 1115, in _table_to_blocks list(extension_columns.keys())) File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type map is known. Arrow -> Parquet PASSED Parquet -> Arrow FAILED Traceback (most recent call last): File "arrowtest.py", line 43, in t2 = pq.read_table(source=fh) File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in read_table use_pandas_metadata=use_pandas_metadata) File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in read use_threads=use_threads File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list not null> not null {code} Updated to indicate to Pandas conversion done, but not yet for Parquet. was: Hi, I'm having problems using 'map' data type in Arrow/parquet/pandas. I'm able to convert a pandas data frame to Arrow with a map data type. But, -Arrow to Pandas doesn't work.- Fixed in ARROW-10151 When I write Arrow to Parquet, it seems to work, but I'm not sure if the data type is written correctly. When I read back Parquet to Arrow, it fails saying "reading list of structs" is not supported. It seems that map is stored as list of structs. There are two problems here: # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151 # Map data type doesn't get written to or read from Arrow -> Parquet. Questions: 1. Am I doing something wrong? Is there a way to get these to work? 2. If these are unsupported features, will this be fixed in a future version? Do you plans or ETA? The following code example (followed by output) should demonstrate the issues: I'm using Arrow 1.0.0 and Pandas 1.0.5. Thanks! Mayur
[jira] [Updated] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet
[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-9812: Description: Hi, I'm having problems using 'map' data type in Arrow/parquet/pandas. I'm able to convert a pandas data frame to Arrow with a map data type. But, -Arrow to Pandas doesn't work.- Fixed in ARROW-10151 When I write Arrow to Parquet, it seems to work, but I'm not sure if the data type is written correctly. When I read back Parquet to Arrow, it fails saying "reading list of structs" is not supported. It seems that map is stored as list of structs. There are two problems here: # -Map data type doesn't work from Arrow -> Pandas-. Fixed in ARROW-10151 # Map data type doesn't get written to or read from Arrow -> Parquet. Questions: 1. Am I doing something wrong? Is there a way to get these to work? 2. If these are unsupported features, will this be fixed in a future version? Do you plans or ETA? The following code example (followed by output) should demonstrate the issues: I'm using Arrow 1.0.0 and Pandas 1.0.5. Thanks! Mayur {code:java} $ cat arrowtest.py import pyarrow as pa import pandas as pd import pyarrow.parquet as pq import traceback as tb import io print(f'PyArrow Version = {pa.__version__}') print(f'Pandas Version = {pd.__version__}') df1 = pd.DataFrame({'a': [[('b', '2')]]}) print(f'df1') print(f'{df1}') print(f'Pandas -> Arrow') try: t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', pa.map_(pa.string(), pa.string()))])) print('PASSED') print(t1) except: print(f'FAILED') tb.print_exc() print(f'Arrow -> Pandas') try: t1.to_pandas() print('PASSED') except: print(f'FAILED') tb.print_exc()print(f'Arrow -> Parquet') fh = io.BytesIO() try: pq.write_table(t1, fh) print('PASSED') except: print('FAILED') tb.print_exc() print(f'Parquet -> Arrow') try: t2 = pq.read_table(source=fh) print('PASSED') print(t2) except: print('FAILED') tb.print_exc() {code} {code:java} $ python3.6 arrowtest.py PyArrow Version = 1.0.0 Pandas Version = 1.0.5 df1 a 0 [(b, 2)] Pandas -> Arrow PASSED pyarrow.Table a: map child 0, entries: struct not null child 0, key: string not null child 1, value: string Arrow -> Pandas FAILED Traceback (most recent call last): File "arrowtest.py", line 26, in t1.to_pandas() File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 1115, in _table_to_blocks list(extension_columns.keys())) File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type map is known. Arrow -> Parquet PASSED Parquet -> Arrow FAILED Traceback (most recent call last): File "arrowtest.py", line 43, in t2 = pq.read_table(source=fh) File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in read_table use_pandas_metadata=use_pandas_metadata) File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in read use_threads=use_threads File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list not null> not null {code} Updated to indicate to Pandas conversion done, but not yet for Parquet. was: Hi, I'm having problems using 'map' data type in Arrow/parquet/pandas. I'm able to convert a pandas data frame to Arrow with a map data type. But, Arrow to Pandas doesn't work. When I write Arrow to Parquet, it seems to work, but I'm not sure if the data type is written correctly. When I read back Parquet to Arrow, it fails saying "reading list of structs" is not supported. It seems that map is stored as list of structs. There are two problems here: # Map data type doesn't work from Arrow -> Pandas. # Map data type doesn't get written to or read from Arrow -> Parquet. Questions: 1. Am I doing something wrong? Is there a way to get these to work? 2. If these are unsupported features, will this be fixed in a future version? Do you plans or ETA? The following code example (followed by output) should demonstrate the issues: I'm using Arrow 1.0.0 and Pandas 1.0.5. Thanks!
[jira] [Updated] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Parquet
[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-9812: Summary: [Python] Map data types doesn't work from Arrow to Parquet (was: [Python] Map data types doesn't work from Arrow to Pandas and Parquet) > [Python] Map data types doesn't work from Arrow to Parquet > -- > > Key: ARROW-9812 > URL: https://issues.apache.org/jira/browse/ARROW-9812 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mayur Srivastava >Priority: Major > > Hi, > I'm having problems using 'map' data type in Arrow/parquet/pandas. > I'm able to convert a pandas data frame to Arrow with a map data type. > But, Arrow to Pandas doesn't work. > When I write Arrow to Parquet, it seems to work, but I'm not sure if the data > type is written correctly. > When I read back Parquet to Arrow, it fails saying "reading list of structs" > is not supported. It seems that map is stored as list of structs. > There are two problems here: > # Map data type doesn't work from Arrow -> Pandas. > # Map data type doesn't get written to or read from Arrow -> Parquet. > Questions: > 1. Am I doing something wrong? Is there a way to get these to work? > 2. If these are unsupported features, will this be fixed in a future version? > Do you plans or ETA? > The following code example (followed by output) should demonstrate the issues: > I'm using Arrow 1.0.0 and Pandas 1.0.5. > Thanks! > Mayur > {code:java} > $ cat arrowtest.py > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > import traceback as tb > import io > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df1 = pd.DataFrame({'a': [[('b', '2')]]}) > print(f'df1') > print(f'{df1}') > print(f'Pandas -> Arrow') > try: > t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', > pa.map_(pa.string(), pa.string()))])) > print('PASSED') > print(t1) > except: > print(f'FAILED') > tb.print_exc() > print(f'Arrow -> Pandas') > try: > t1.to_pandas() > print('PASSED') > except: > print(f'FAILED') > tb.print_exc()print(f'Arrow -> Parquet') > fh = io.BytesIO() > try: > pq.write_table(t1, fh) > print('PASSED') > except: > print('FAILED') > tb.print_exc() > > print(f'Parquet -> Arrow') > try: > t2 = pq.read_table(source=fh) > print('PASSED') > print(t2) > except: > print('FAILED') > tb.print_exc() > {code} > {code:java} > $ python3.6 arrowtest.py > PyArrow Version = 1.0.0 > Pandas Version = 1.0.5 > df1 > a 0 [(b, 2)] > > Pandas -> Arrow > PASSED > pyarrow.Table > a: map > child 0, entries: struct not null > child 0, key: string not null > child 1, value: string > > Arrow -> Pandas > FAILED > Traceback (most recent call last): > File "arrowtest.py", line 26, in t1.to_pandas() > File "pyarrow/array.pxi", line 715, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File > "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in > table_to_blockmanager blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line > 1115, in _table_to_blocks list(extension_columns.keys())) > File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File > "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for > Arrow data of type map is known. > > Arrow -> Parquet > PASSED > > Parquet -> Arrow > FAILED > Traceback (most recent call last): File "arrowtest.py", line 43, in > t2 = pq.read_table(source=fh) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in > read_table use_pandas_metadata=use_pandas_metadata) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in > read use_threads=use_threads > File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 122, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet > files not yet supported: key_value: list null, value: string> not null> not null > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet
[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211472#comment-17211472 ] Bryan Cutler edited comment on ARROW-9812 at 10/9/20, 11:50 PM: Hi [~admrsh] , I implemented Map types to Pandas conversion recently in ARROW-10151, but looks like I forgot that line you pointed out in {{types.pxi}}. That should be in for the upcoming release, if you are able to do a PR for it's cut - likely today or tomorrow - that would be great. Otherwise, I can go ahead and add it. I will update this Jira to reflect Pandas conversion is complete. I made ARROW-10260 to add {{to_pandas_dtype}}. Thanks! was (Author: bryanc): Hi [~admrsh] , I implemented Map types to Pandas conversion recently in ARROW-10151, but looks like I forgot that line you pointed out in {{types.pxi}}. That should be in for the upcoming release, if you are able to do a PR for it's cut - likely today or tomorrow - that would be great. Otherwise, I can go ahead and add it. I will update this Jira to reflect Pandas conversion is complete. Thanks! > [Python] Map data types doesn't work from Arrow to Pandas and Parquet > - > > Key: ARROW-9812 > URL: https://issues.apache.org/jira/browse/ARROW-9812 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mayur Srivastava >Priority: Major > > Hi, > I'm having problems using 'map' data type in Arrow/parquet/pandas. > I'm able to convert a pandas data frame to Arrow with a map data type. > But, Arrow to Pandas doesn't work. > When I write Arrow to Parquet, it seems to work, but I'm not sure if the data > type is written correctly. > When I read back Parquet to Arrow, it fails saying "reading list of structs" > is not supported. It seems that map is stored as list of structs. > There are two problems here: > # Map data type doesn't work from Arrow -> Pandas. > # Map data type doesn't get written to or read from Arrow -> Parquet. > Questions: > 1. Am I doing something wrong? Is there a way to get these to work? > 2. If these are unsupported features, will this be fixed in a future version? > Do you plans or ETA? > The following code example (followed by output) should demonstrate the issues: > I'm using Arrow 1.0.0 and Pandas 1.0.5. > Thanks! > Mayur > {code:java} > $ cat arrowtest.py > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > import traceback as tb > import io > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df1 = pd.DataFrame({'a': [[('b', '2')]]}) > print(f'df1') > print(f'{df1}') > print(f'Pandas -> Arrow') > try: > t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', > pa.map_(pa.string(), pa.string()))])) > print('PASSED') > print(t1) > except: > print(f'FAILED') > tb.print_exc() > print(f'Arrow -> Pandas') > try: > t1.to_pandas() > print('PASSED') > except: > print(f'FAILED') > tb.print_exc()print(f'Arrow -> Parquet') > fh = io.BytesIO() > try: > pq.write_table(t1, fh) > print('PASSED') > except: > print('FAILED') > tb.print_exc() > > print(f'Parquet -> Arrow') > try: > t2 = pq.read_table(source=fh) > print('PASSED') > print(t2) > except: > print('FAILED') > tb.print_exc() > {code} > {code:java} > $ python3.6 arrowtest.py > PyArrow Version = 1.0.0 > Pandas Version = 1.0.5 > df1 > a 0 [(b, 2)] > > Pandas -> Arrow > PASSED > pyarrow.Table > a: map > child 0, entries: struct not null > child 0, key: string not null > child 1, value: string > > Arrow -> Pandas > FAILED > Traceback (most recent call last): > File "arrowtest.py", line 26, in t1.to_pandas() > File "pyarrow/array.pxi", line 715, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File > "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in > table_to_blockmanager blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line > 1115, in _table_to_blocks list(extension_columns.keys())) > File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File > "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for > Arrow data of type map is known. > > Arrow -> Parquet > PASSED > > Parquet -> Arrow > FAILED > Traceback (most recent call last): File "arrowtest.py", line 43, in > t2 = pq.read_table(source=fh) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in > read_table use_pandas_metadata=use_pandas_metadata) > File
[jira] [Created] (ARROW-10260) [Python] Missing MapType to Pandas dtype
Bryan Cutler created ARROW-10260: Summary: [Python] Missing MapType to Pandas dtype Key: ARROW-10260 URL: https://issues.apache.org/jira/browse/ARROW-10260 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Bryan Cutler The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype mapping for {{to_pandas_dtype()}} {code:java} In [2]: d = pa.map_(pa.int64(), pa.float64()) In [3]: d.to_pandas_dtype() --- NotImplementedError Traceback (most recent call last) in > 1 d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet
[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211472#comment-17211472 ] Bryan Cutler commented on ARROW-9812: - Hi [~admrsh] , I implemented Map types to Pandas conversion recently in ARROW-10151, but looks like I forgot that line you pointed out in {{types.pxi}}. That should be in for the upcoming release, if you are able to do a PR for it's cut - likely today or tomorrow - that would be great. Otherwise, I can go ahead and add it. I will update this Jira to reflect Pandas conversion is complete. Thanks! > [Python] Map data types doesn't work from Arrow to Pandas and Parquet > - > > Key: ARROW-9812 > URL: https://issues.apache.org/jira/browse/ARROW-9812 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mayur Srivastava >Priority: Major > > Hi, > I'm having problems using 'map' data type in Arrow/parquet/pandas. > I'm able to convert a pandas data frame to Arrow with a map data type. > But, Arrow to Pandas doesn't work. > When I write Arrow to Parquet, it seems to work, but I'm not sure if the data > type is written correctly. > When I read back Parquet to Arrow, it fails saying "reading list of structs" > is not supported. It seems that map is stored as list of structs. > There are two problems here: > # Map data type doesn't work from Arrow -> Pandas. > # Map data type doesn't get written to or read from Arrow -> Parquet. > Questions: > 1. Am I doing something wrong? Is there a way to get these to work? > 2. If these are unsupported features, will this be fixed in a future version? > Do you plans or ETA? > The following code example (followed by output) should demonstrate the issues: > I'm using Arrow 1.0.0 and Pandas 1.0.5. > Thanks! > Mayur > {code:java} > $ cat arrowtest.py > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > import traceback as tb > import io > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df1 = pd.DataFrame({'a': [[('b', '2')]]}) > print(f'df1') > print(f'{df1}') > print(f'Pandas -> Arrow') > try: > t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', > pa.map_(pa.string(), pa.string()))])) > print('PASSED') > print(t1) > except: > print(f'FAILED') > tb.print_exc() > print(f'Arrow -> Pandas') > try: > t1.to_pandas() > print('PASSED') > except: > print(f'FAILED') > tb.print_exc()print(f'Arrow -> Parquet') > fh = io.BytesIO() > try: > pq.write_table(t1, fh) > print('PASSED') > except: > print('FAILED') > tb.print_exc() > > print(f'Parquet -> Arrow') > try: > t2 = pq.read_table(source=fh) > print('PASSED') > print(t2) > except: > print('FAILED') > tb.print_exc() > {code} > {code:java} > $ python3.6 arrowtest.py > PyArrow Version = 1.0.0 > Pandas Version = 1.0.5 > df1 > a 0 [(b, 2)] > > Pandas -> Arrow > PASSED > pyarrow.Table > a: map > child 0, entries: struct not null > child 0, key: string not null > child 1, value: string > > Arrow -> Pandas > FAILED > Traceback (most recent call last): > File "arrowtest.py", line 26, in t1.to_pandas() > File "pyarrow/array.pxi", line 715, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File > "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in > table_to_blockmanager blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line > 1115, in _table_to_blocks list(extension_columns.keys())) > File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File > "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: No known equivalent Pandas block for > Arrow data of type map is known. > > Arrow -> Parquet > PASSED > > Parquet -> Arrow > FAILED > Traceback (most recent call last): File "arrowtest.py", line 43, in > t2 = pq.read_table(source=fh) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1586, in > read_table use_pandas_metadata=use_pandas_metadata) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/parquet.py", line 1474, in > read use_threads=use_threads > File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table > File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table > File "pyarrow/error.pxi", line 122, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet > files not yet supported: key_value:
[jira] [Created] (ARROW-10259) [Rust] Support field metadata
Neville Dipale created ARROW-10259: -- Summary: [Rust] Support field metadata Key: ARROW-10259 URL: https://issues.apache.org/jira/browse/ARROW-10259 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Neville Dipale The biggest hurdle to adding field metadata is HashMap and HashSet not implementing Hash, Ord and PartialOrd. I was thinking of implementing the metadata as a Vec<(String, String)> to overcome this limitation, and then serializing correctly to JSON. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10258) [Rust] Support extension arrays
Neville Dipale created ARROW-10258: -- Summary: [Rust] Support extension arrays Key: ARROW-10258 URL: https://issues.apache.org/jira/browse/ARROW-10258 Project: Apache Arrow Issue Type: New Feature Components: Integration, Rust Affects Versions: 1.0.1 Reporter: Neville Dipale This should include: * supporting the Arrow format * supporting field metadata We can optionally: * support recognising known extensions (like UUID) I'm mainly opening this up for wider visibility, I noticed that I was catching strays from metadata integration tests failing because Field doesn't support metadata :( -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10258) [Rust] Support extension arrays
[ https://issues.apache.org/jira/browse/ARROW-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10258: --- Fix Version/s: 3.0.0 > [Rust] Support extension arrays > --- > > Key: ARROW-10258 > URL: https://issues.apache.org/jira/browse/ARROW-10258 > Project: Apache Arrow > Issue Type: New Feature > Components: Integration, Rust >Affects Versions: 1.0.1 >Reporter: Neville Dipale >Priority: Major > Fix For: 3.0.0 > > > This should include: > * supporting the Arrow format > * supporting field metadata > We can optionally: > * support recognising known extensions (like UUID) > I'm mainly opening this up for wider visibility, I noticed that I was > catching strays from metadata integration tests failing because Field doesn't > support metadata :( -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10257) [R] Prepare news/docs for 2.0 release
[ https://issues.apache.org/jira/browse/ARROW-10257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10257: --- Labels: pull-request-available (was: ) > [R] Prepare news/docs for 2.0 release > - > > Key: ARROW-10257 > URL: https://issues.apache.org/jira/browse/ARROW-10257 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8810) [R] Add documentation about Parquet format, appending to stream format
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211468#comment-17211468 ] Neal Richardson commented on ARROW-8810: Doing in ARROW-10257 > [R] Add documentation about Parquet format, appending to stream format > -- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Reporter: Carl Boettiger >Assignee: Neal Richardson >Priority: Minor > Fix For: 2.0.0 > > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10257) [R] Prepare news/docs for 2.0 release
Neal Richardson created ARROW-10257: --- Summary: [R] Prepare news/docs for 2.0 release Key: ARROW-10257 URL: https://issues.apache.org/jira/browse/ARROW-10257 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8296) [C++][Dataset] IpcFileFormat should support writing files with compressed buffers
[ https://issues.apache.org/jira/browse/ARROW-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8296. Resolution: Fixed Issue resolved by pull request 8389 [https://github.com/apache/arrow/pull/8389] > [C++][Dataset] IpcFileFormat should support writing files with compressed > buffers > - > > Key: ARROW-8296 > URL: https://issues.apache.org/jira/browse/ARROW-8296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9870) [R] Friendly interface for filesystems (S3)
[ https://issues.apache.org/jira/browse/ARROW-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-9870. Resolution: Fixed Issue resolved by pull request 8351 [https://github.com/apache/arrow/pull/8351] > [R] Friendly interface for filesystems (S3) > --- > > Key: ARROW-9870 > URL: https://issues.apache.org/jira/browse/ARROW-9870 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The Filesystem methods don't provide a human-friendly interface for basic > operations like ls, mkdir, etc. Since we provide access to S3 and potentially > other cloud storage, it would be nice to have simple methods for exploring it. > Additional ideas: > * S3Bucket class/constructor: it's basically a SubTreeFileSystem containing > S3FS and a path, except that we can auto-detect a bucket's region. > * Add a class like the FileLocator C++ struct list(fs, path). _also_ kinda > like a SubTreeFileSystem, but with different methods and intents. Aside from > use in ls/mkdir/cp, it could be used in file reader/writers instead of having > an extra {{filesystem}} argument added everywhere, e.g. > {{fs$path("path/to/file")}}. See > https://github.com/apache/arrow/pull/8197#discussion_r494325934 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10114) [R] Segfault in to_dataframe_parallel with deeply nested structs
[ https://issues.apache.org/jira/browse/ARROW-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-10114. - Fix Version/s: (was: 3.0.0) 2.0.0 Resolution: Fixed Issue resolved by pull request 8411 [https://github.com/apache/arrow/pull/8411] > [R] Segfault in to_dataframe_parallel with deeply nested structs > > > Key: ARROW-10114 > URL: https://issues.apache.org/jira/browse/ARROW-10114 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: > sessionInfo() > R version 3.6.3 (2020-02-29) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Linux Mint 19.3 > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=sv_SE.UTF-8LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=sv_SE.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] arrow_1.0.1 > loaded via a namespace (and not attached): > [1] tidyselect_1.1.0 bit_4.0.4compiler_3.6.3 magrittr_1.5 > [5] assertthat_0.2.1 R6_2.4.1 glue_1.4.1 Rcpp_1.0.5 > [9] bit64_4.0.2 vctrs_0.3.2 rlang_0.4.7 purrr_0.3.4 >Reporter: Markus Skyttner >Assignee: Romain Francois >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: Dockerfile, Makefile, reprex_10114.R > > Time Spent: 50m > Remaining Estimate: 0h > > A .jsonl file (newline separated JSON) created from open data available at > [ftp://ftp.libris.kb.se/pub/spa/swepub-deduplicated-2019-12-29.zip] is used > with the R package arrow (installed from CRAN) using the following statement: > > arrow::read_json_arrow("~/.config/swepub/head.jsonl") > It crashes RStudio with no error message. At the R prompt, the error message > is: > Error in Table__to_dataframe(x, use_threads = option_use_threads()) : > SET_VECTOR_ELT() can only be applied to a 'list', not a 'integer' > The file "head.jsonl" above was created from the extracted zip's .jsonl file > with the *nix "head -1 $BIG_JSONL_FILE" command. It can be parsed with > jsonlite and tidyjson. > Also got this error message at one point: > > arrow::read_json_arrow("head.jsonl", as_data_frame = TRUE) > *** caught segfault *** > address 0x8, cause 'memory not mapped' > Traceback: > 1: structure(x, extra_cols = colonnade[extra_cols], class = > "pillar_squeezed_colonnade") > 2: new_colonnade_sqeezed(out, colonnade = x, extra_cols = extra_cols) > 3: pillar::squeeze(x$mcf, width = width) > 4: format.trunc_mat(mat) > 5: format(mat) > 6: format.tbl(x, ..., n = n, width = width, n_extra = n_extra) > 7: format(x, ..., n = n, width = width, n_extra = n_extra) > 8: paste0(..., collapse = "\n") > 9: cli::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra)) > 10: print.tbl(x) > 11: (function (x, ...) UseMethod("print"))(x) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
[ https://issues.apache.org/jira/browse/ARROW-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10255: --- Labels: pull-request-available (was: ) > [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking > --- > > Key: ARROW-10255 > URL: https://issues.apache.org/jira/browse/ARROW-10255 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: 0.17.1 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Presently most of our public classes can't be easily > [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library > consumers. This is a problem for libraries that only need to use parts of > Arrow. > For example, the vis.gl projects have an integration test that imports three > of our simpler classes and tests the resulting bundle size: > {code:javascript} > import {Schema, Field, Float32} from 'apache-arrow'; > // | Bundle Size| Compressed > // | 202KB (207112) KB | 45KB (46618) KB > {code} > We can help solve this with the following changes: > * Add "sideEffects": false to our ESM package.json > * Reorganize our imports to only include what's needed > * Eliminate or move some static/member methods to standalone exported > functions > * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't > compile in its own Buffer shim > * Removing flatbuffers namespaces from generated TS because these defeat > Webpack's tree-shaking ability > Candidate functions for removal/moving to standalone functions: > * Schema.new, Schema.from, Schema.prototype.compareTo > * Field.prototype.compareTo > * Type.prototype.compareTo > * Table.new, Table.from > * Column.new > * Vector.new, Vector.from > * RecordBatchReader.from > After applying a few of the above changes to the Schema and flatbuffers > files, I was able to reduce the vis.gl's import size 90%: > {code:javascript} > // Bundle Size | Compressed > // 24KB (24942) KB | 6KB (6154) KB > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10256) [C++][Flight] Disable -Werror carefully
[ https://issues.apache.org/jira/browse/ARROW-10256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10256: --- Labels: pull-request-available (was: ) > [C++][Flight] Disable -Werror carefully > --- > > Key: ARROW-10256 > URL: https://issues.apache.org/jira/browse/ARROW-10256 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10256) [C++][Flight] Disable -Werror carefully
Kouhei Sutou created ARROW-10256: Summary: [C++][Flight] Disable -Werror carefully Key: ARROW-10256 URL: https://issues.apache.org/jira/browse/ARROW-10256 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10220) [JS] Cache javascript utf-8 dictionary keys?
[ https://issues.apache.org/jira/browse/ARROW-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10220: - Summary: [JS] Cache javascript utf-8 dictionary keys? (was: Cache javascript utf-8 dictionary keys?) > [JS] Cache javascript utf-8 dictionary keys? > > > Key: ARROW-10220 > URL: https://issues.apache.org/jira/browse/ARROW-10220 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Affects Versions: 1.0.1 >Reporter: Ben Schmidt >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > String decoding from arrow tables is a major bottleneck in using arrow in > Javascript–it can take a second to decode a million rows. For utf-8 types, > I'm not sure what could be done; but some memoization would help utf-8 > dictionary types. > Currently, the javascript implementation decodes a utf-8 string every time > you request an item from a dictionary with utf-8 data. If arrow cached the > decoded strings to a native js Map, routine operations like looping over all > the entries in a text column might be on the order of 10x faster. Here's an > observable notebook [benchmarking that and a couple other > strategies|https://observablehq.com/@bmschmidt/faster-arrow-dictionary-unpacking]. > I would file a pull request, but 1) I would have to learn some typescript to > do so, and 2) this idea may be undesirable because it creates new objects > that will increase the memory footprint of a table, rather than just using > the typed arrays. > Some discussion of how the real-world issues here affect the arquero project > is [here|https://github.com/uwdata/arquero/issues/1]. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10255) [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking
Paul Taylor created ARROW-10255: --- Summary: [JS] Reorganize imports and exports to be more friendly to ESM tree-shaking Key: ARROW-10255 URL: https://issues.apache.org/jira/browse/ARROW-10255 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Affects Versions: 0.17.1 Reporter: Paul Taylor Assignee: Paul Taylor Presently most of our public classes can't be easily [tree-shaken|https://webpack.js.org/guides/tree-shaking/] by library consumers. This is a problem for libraries that only need to use parts of Arrow. For example, the vis.gl projects have an integration test that imports three of our simpler classes and tests the resulting bundle size: {code:javascript} import {Schema, Field, Float32} from 'apache-arrow'; // | Bundle Size| Compressed // | 202KB (207112) KB | 45KB (46618) KB {code} We can help solve this with the following changes: * Add "sideEffects": false to our ESM package.json * Reorganize our imports to only include what's needed * Eliminate or move some static/member methods to standalone exported functions * Wrap the utf8 util's node Buffer detection in eval so Webpack doesn't compile in its own Buffer shim * Removing flatbuffers namespaces from generated TS because these defeat Webpack's tree-shaking ability Candidate functions for removal/moving to standalone functions: * Schema.new, Schema.from, Schema.prototype.compareTo * Field.prototype.compareTo * Type.prototype.compareTo * Table.new, Table.from * Column.new * Vector.new, Vector.from * RecordBatchReader.from After applying a few of the above changes to the Schema and flatbuffers files, I was able to reduce the vis.gl's import size 90%: {code:javascript} // Bundle Size | Compressed // 24KB (24942) KB | 6KB (6154) KB {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10254) [R] Revisit (ab)use of SubTreeFileSystem
Neal Richardson created ARROW-10254: --- Summary: [R] Revisit (ab)use of SubTreeFileSystem Key: ARROW-10254 URL: https://issues.apache.org/jira/browse/ARROW-10254 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 3.0.0 Followup to ARROW-9870 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10252) [Python] Add option to skip inclusion of Arrow headers in Python installation
[ https://issues.apache.org/jira/browse/ARROW-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10252: --- Labels: pull-request-available (was: ) > [Python] Add option to skip inclusion of Arrow headers in Python installation > - > > Key: ARROW-10252 > URL: https://issues.apache.org/jira/browse/ARROW-10252 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, Python >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We don't want to have them as part of the conda package as the single source > should be {{arrow-cpp}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10253) [Python] Don't bundle plasma-store-server in pyarrow conda package
Uwe Korn created ARROW-10253: Summary: [Python] Don't bundle plasma-store-server in pyarrow conda package Key: ARROW-10253 URL: https://issues.apache.org/jira/browse/ARROW-10253 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn We currently have it in the {{arrow-cpp}} and the {{pyarrow}} conda package, we should only have it in {{arrow-cpp}} as this is always there and also the source of the binary. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10252) [Python] Add option to skip inclusion of Arrow headers in Python installation
Uwe Korn created ARROW-10252: Summary: [Python] Add option to skip inclusion of Arrow headers in Python installation Key: ARROW-10252 URL: https://issues.apache.org/jira/browse/ARROW-10252 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn We don't want to have them as part of the conda package as the single source should be {{arrow-cpp}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel
[ https://issues.apache.org/jira/browse/ARROW-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-10251: --- Description: MemTable::load() should load partitions in parallel using async tasks, rather than loading one partition at a time. Also, we should make batch size configurable. It is currently hard-coded to 1024*1024 which can be quite inefficient. was: MemTable::load() should load partitions in parallel using async tasks, rather than loading onw partition at a time. Also, we should make batch size configurable. It is currently hard-coded to 1024*1024 which can be quite inefficient. > [Rust] [DataFusion] MemTable::load() should load partitions in parallel > --- > > Key: ARROW-10251 > URL: https://issues.apache.org/jira/browse/ARROW-10251 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Priority: Major > Labels: beginner > Fix For: 3.0.0 > > > MemTable::load() should load partitions in parallel using async tasks, rather > than loading one partition at a time. > Also, we should make batch size configurable. It is currently hard-coded to > 1024*1024 which can be quite inefficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10251) [Rust] [DataFusion] MemTable::load() should load partitions in parallel
Andy Grove created ARROW-10251: -- Summary: [Rust] [DataFusion] MemTable::load() should load partitions in parallel Key: ARROW-10251 URL: https://issues.apache.org/jira/browse/ARROW-10251 Project: Apache Arrow Issue Type: New Feature Components: Rust, Rust - DataFusion Reporter: Andy Grove Fix For: 3.0.0 MemTable::load() should load partitions in parallel using async tasks, rather than loading onw partition at a time. Also, we should make batch size configurable. It is currently hard-coded to 1024*1024 which can be quite inefficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet
[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211270#comment-17211270 ] Derek Marsh edited comment on ARROW-9812 at 10/9/20, 6:33 PM: -- Hi all, I've searched existing issues as best I can and this issue mentions "1. Map data type doesn't work from Arrow -> Pandas." I built the project from master at roughly 15:00 UTC today (October 9) and added one line before I built pyarrow: {code:java} _Type_MAP: np.object_,{code} after this line: [types.pxi|https://github.com/apache/arrow/blob/master/python/pyarrow/types.pxi#L49] This enables Table.to_pandas() to convert a MapType to List[Tuple[...]] {code:java} >>> import pyarrow as pa >>> d = pa.map_(pa.int64(), pa.float64()) >>> d.to_pandas_dtype() {code} {code:java} >>> tbl pyarrow.Table stored_on: double vals: map child 0, entries: struct not null child 0, key: int64 not null child 1, value: double >>> tbl.to_pydict() {'stored_on': [1585347700.204351], 'vals': [[(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)]]} >>> df = tbl.to_pandas() >>> df.vals 0 [(514, 12.0), (515, 1300.0), (519, 125.0), (29... Name: vals, dtype: object >>> df.vals.iloc[0] [(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)] >>> df.vals.iloc[0][0] (514, 12.0){code} I understand this is a very trivial working example, but am interested what any maintainers think about this solution and if it merits further testing/consideration. Thanks. was (Author: admrsh): Hi all, I've searched existing issues as best I can and this issue mentions "1. Map data type doesn't work from Arrow -> Pandas." I built the project from master at roughly 15:00 UTC today (October 9) and added one line before I built pyarrow: {code:java} _Type_MAP: np.object_,{code} after this line: [https://github.com/apache/arrow/blob/master/python/pyarrow/types.pxi#L49|types.pxi] This enables Table.to_pandas() to convert a MapType to List[Tuple[...]] {code:java} >>> import pyarrow as pa >>> d = pa.map_(pa.int64(), pa.float64()) >>> d.to_pandas_dtype() {code} {code:java} >>> tbl pyarrow.Table stored_on: double vals: map child 0, entries: struct not null child 0, key: int64 not null child 1, value: double >>> tbl.to_pydict() {'stored_on': [1585347700.204351], 'vals': [[(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)]]} >>> df = tbl.to_pandas() >>> df.vals 0 [(514, 12.0), (515, 1300.0), (519, 125.0), (29... Name: vals, dtype: object >>> df.vals.iloc[0] [(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)] >>> df.vals.iloc[0][0] (514, 12.0){code} I understand this is a very trivial working example, but am interested what any maintainers think about this solution and if it merits further testing/consideration. Thanks. > [Python] Map data types doesn't work from Arrow to Pandas and Parquet > - > > Key: ARROW-9812 > URL: https://issues.apache.org/jira/browse/ARROW-9812 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mayur Srivastava >Priority: Major > > Hi, > I'm having problems using 'map' data type in Arrow/parquet/pandas. > I'm able to convert a pandas data frame to Arrow with a map data type. > But, Arrow to Pandas doesn't work. > When I write Arrow to Parquet, it seems to work, but I'm not sure if the data > type is written correctly. > When I read back Parquet to Arrow, it fails saying "reading list of structs" > is not supported. It seems that map is stored as list of structs. > There are two problems here: > # Map data type doesn't work from Arrow -> Pandas. > # Map data type doesn't get written to or read from Arrow -> Parquet. > Questions: > 1. Am I doing something wrong? Is there a way to get these to work? > 2. If these are unsupported features, will this be fixed in a future version? > Do you plans or ETA? > The following code example (followed by output) should demonstrate the issues: > I'm using Arrow 1.0.0 and Pandas 1.0.5. > Thanks! > Mayur > {code:java} > $ cat arrowtest.py > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > import traceback as tb > import io > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df1 = pd.DataFrame({'a': [[('b', '2')]]}) > print(f'df1') > print(f'{df1}') > print(f'Pandas -> Arrow') > try: > t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', > pa.map_(pa.string(), pa.string()))])) > print('PASSED') > print(t1) > except: >
[jira] [Commented] (ARROW-9812) [Python] Map data types doesn't work from Arrow to Pandas and Parquet
[ https://issues.apache.org/jira/browse/ARROW-9812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211270#comment-17211270 ] Derek Marsh commented on ARROW-9812: Hi all, I've searched existing issues as best I can and this issue mentions "1. Map data type doesn't work from Arrow -> Pandas." I built the project from master at roughly 15:00 UTC today (October 9) and added one line before I built pyarrow: {code:java} _Type_MAP: np.object_,{code} after this line: [https://github.com/apache/arrow/blob/master/python/pyarrow/types.pxi#L49|types.pxi] This enables Table.to_pandas() to convert a MapType to List[Tuple[...]] {code:java} >>> import pyarrow as pa >>> d = pa.map_(pa.int64(), pa.float64()) >>> d.to_pandas_dtype() {code} {code:java} >>> tbl pyarrow.Table stored_on: double vals: map child 0, entries: struct not null child 0, key: int64 not null child 1, value: double >>> tbl.to_pydict() {'stored_on': [1585347700.204351], 'vals': [[(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)]]} >>> df = tbl.to_pandas() >>> df.vals 0 [(514, 12.0), (515, 1300.0), (519, 125.0), (29... Name: vals, dtype: object >>> df.vals.iloc[0] [(514, 12.0), (515, 1300.0), (519, 125.0), (2978, 126.0), (3236, 13107.0), (3237, 1.0), (3238, 1.0), (3239, 3.0), (3240, 3.0)] >>> df.vals.iloc[0][0] (514, 12.0){code} I understand this is a very trivial working example, but am interested what any maintainers think about this solution and if it merits further testing/consideration. Thanks. > [Python] Map data types doesn't work from Arrow to Pandas and Parquet > - > > Key: ARROW-9812 > URL: https://issues.apache.org/jira/browse/ARROW-9812 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mayur Srivastava >Priority: Major > > Hi, > I'm having problems using 'map' data type in Arrow/parquet/pandas. > I'm able to convert a pandas data frame to Arrow with a map data type. > But, Arrow to Pandas doesn't work. > When I write Arrow to Parquet, it seems to work, but I'm not sure if the data > type is written correctly. > When I read back Parquet to Arrow, it fails saying "reading list of structs" > is not supported. It seems that map is stored as list of structs. > There are two problems here: > # Map data type doesn't work from Arrow -> Pandas. > # Map data type doesn't get written to or read from Arrow -> Parquet. > Questions: > 1. Am I doing something wrong? Is there a way to get these to work? > 2. If these are unsupported features, will this be fixed in a future version? > Do you plans or ETA? > The following code example (followed by output) should demonstrate the issues: > I'm using Arrow 1.0.0 and Pandas 1.0.5. > Thanks! > Mayur > {code:java} > $ cat arrowtest.py > import pyarrow as pa > import pandas as pd > import pyarrow.parquet as pq > import traceback as tb > import io > print(f'PyArrow Version = {pa.__version__}') > print(f'Pandas Version = {pd.__version__}') > df1 = pd.DataFrame({'a': [[('b', '2')]]}) > print(f'df1') > print(f'{df1}') > print(f'Pandas -> Arrow') > try: > t1 = pa.Table.from_pandas(df1, schema=pa.schema([pa.field('a', > pa.map_(pa.string(), pa.string()))])) > print('PASSED') > print(t1) > except: > print(f'FAILED') > tb.print_exc() > print(f'Arrow -> Pandas') > try: > t1.to_pandas() > print('PASSED') > except: > print(f'FAILED') > tb.print_exc()print(f'Arrow -> Parquet') > fh = io.BytesIO() > try: > pq.write_table(t1, fh) > print('PASSED') > except: > print('FAILED') > tb.print_exc() > > print(f'Parquet -> Arrow') > try: > t2 = pq.read_table(source=fh) > print('PASSED') > print(t2) > except: > print('FAILED') > tb.print_exc() > {code} > {code:java} > $ python3.6 arrowtest.py > PyArrow Version = 1.0.0 > Pandas Version = 1.0.5 > df1 > a 0 [(b, 2)] > > Pandas -> Arrow > PASSED > pyarrow.Table > a: map > child 0, entries: struct not null > child 0, key: string not null > child 1, value: string > > Arrow -> Pandas > FAILED > Traceback (most recent call last): > File "arrowtest.py", line 26, in t1.to_pandas() > File "pyarrow/array.pxi", line 715, in > pyarrow.lib._PandasConvertible.to_pandas > File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas File > "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line 779, in > table_to_blockmanager blocks = _table_to_blocks(options, table, categories, > ext_columns_dtypes) > File "XXX/pyarrow/1/0/x/dist/lib/python3.6/pyarrow/pandas_compat.py", line > 1115, in _table_to_blocks list(extension_columns.keys())) > File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks File > "pyarrow/error.pxi", line
[jira] [Updated] (ARROW-9956) [C++][Gandiva] Implement Binary string function in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-9956: Summary: [C++][Gandiva] Implement Binary string function in Gandiva (was: Implement Binary string function in Gandiva) > [C++][Gandiva] Implement Binary string function in Gandiva > -- > > Key: ARROW-9956 > URL: https://issues.apache.org/jira/browse/ARROW-9956 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Naman Udasi >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Implementation for new binary_string function in gandiva. > Function take in a normal string or a hexadecimal string( > _Eg:\x41\x20\x42\x20\x43_) and converts it to VARBINARY (byte array). > Is generally used with CAST functions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef
[ https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão resolved ARROW-10215. -- Fix Version/s: (was: 3.0.0) 2.0.0 Resolution: Fixed Issue resolved by pull request 8408 [https://github.com/apache/arrow/pull/8408] > [Rust] [DataFusion] Rename "Source" typedef > --- > > Key: ARROW-10215 > URL: https://issues.apache.org/jira/browse/ARROW-10215 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Jorge Leitão >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > The name "Source" for this type doesn't make sense to me. I would like to > discuss alternate names for it. > {code:java} > type Source = Box; {code} > My first thoughts are: > * RecordBatchIterator > * RecordBatchStream > * SendableRecordBatchReader -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation
[ https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10206: - Component/s: C++ > [Python][C++][FlightRPC] Add client option to disable server validation > --- > > Key: ARROW-10206 > URL: https://issues.apache.org/jira/browse/ARROW-10206 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: James Duong >Assignee: James Duong >Priority: Major > Labels: pull-request-available > Time Spent: 6h 40m > Remaining Estimate: 0h > > Note that this requires using grpc-cpp version 1.25 or higher. > This requires using GRPC's TlsCredentials class, which is in a different > namespace for 1.25-1.31 vs. 1.32+ as well. > This class and its related options provide an option to disable server > certificate checks and require the caller to supply a callback to be used > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation
[ https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10206: - Component/s: Python > [Python][C++][FlightRPC] Add client option to disable server validation > --- > > Key: ARROW-10206 > URL: https://issues.apache.org/jira/browse/ARROW-10206 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Python >Reporter: James Duong >Assignee: James Duong >Priority: Major > Labels: pull-request-available > Time Spent: 6h 40m > Remaining Estimate: 0h > > Note that this requires using grpc-cpp version 1.25 or higher. > This requires using GRPC's TlsCredentials class, which is in a different > namespace for 1.25-1.31 vs. 1.32+ as well. > This class and its related options provide an option to disable server > certificate checks and require the caller to supply a callback to be used > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10206) [Python][C++][FlightRPC] Add client option to disable server validation
[ https://issues.apache.org/jira/browse/ARROW-10206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-10206. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8325 [https://github.com/apache/arrow/pull/8325] > [Python][C++][FlightRPC] Add client option to disable server validation > --- > > Key: ARROW-10206 > URL: https://issues.apache.org/jira/browse/ARROW-10206 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Python >Reporter: James Duong >Assignee: James Duong >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 6h 40m > Remaining Estimate: 0h > > Note that this requires using grpc-cpp version 1.25 or higher. > This requires using GRPC's TlsCredentials class, which is in a different > namespace for 1.25-1.31 vs. 1.32+ as well. > This class and its related options provide an option to disable server > certificate checks and require the caller to supply a callback to be used > instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10250) [FlightRPC][C++] Remove default constructor for FlightClientOptions
David Li created ARROW-10250: Summary: [FlightRPC][C++] Remove default constructor for FlightClientOptions Key: ARROW-10250 URL: https://issues.apache.org/jira/browse/ARROW-10250 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: David Li Fix For: 3.0.0 We should delete the default constructor for FlightClientOptions and require the struct to always be initialized with Defaults(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata
[ https://issues.apache.org/jira/browse/ARROW-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10248: --- Labels: pull-request-available (was: ) > [C++][Dataset] Dataset writing does not write schema metadata > - > > Key: ARROW-10248 > URL: https://issues.apache.org/jira/browse/ARROW-10248 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Not sure if this is related to the writing refactor that landed yesterday, > but `write_dataset` does not preserve the schema metadata (eg used for pandas > metadata): > {code} > In [20]: df = pd.DataFrame({'a': [1, 2, 3]}) > In [21]: table = pa.Table.from_pandas(df) > In [22]: table.schema > Out[22]: > a: int64 > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 396 > In [23]: ds.write_dataset(table, "test_write_dataset_pandas", > format="parquet") > In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema > Out[24]: > a: int64 > -- field metadata -- > PARQUET:field_id: '1' > {code} > I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't > yet look into how easy it would be to fix. > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation
[ https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211092#comment-17211092 ] Jorge Leitão commented on ARROW-10243: -- All great ideas. Yes! > [Rust] [Datafusion] Optimize literal expression evaluation > -- > > Key: ARROW-10243 > URL: https://issues.apache.org/jira/browse/ARROW-10243 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Priority: Major > Attachments: flamegraph.svg > > > While benchmarking the tpch query I noticed that the physical literal > expression takes up a sizable amount of time. I think the creation of the > corresponding array for numeric literals can be speed up by creating Buffer > and ArrayData directly without going through a builder. That also allows to > skip building a null bitmap for non-null literals. > I'm also thinking whether it might be possible to cache the created array. > For queries without a WHERE clause, I'd expect all batches except the last to > have the same length. I'm not sure though where to store the cached value. > Another possible optimization could be to cast literals already on the > logical plan side. In the tpch query the literal `1` is of type `u64` in the > logical plan and then needs to be processed by a cast kernel to convert to > `f64` for usage in an arithmetic expression. > The attached flamegraph is of 10 runs of tpch, with the data being loaded > into memory before running the queries (See ARROW-10240). > {code} > flamegraph ./target/release/tpch --iterations 10 --path ../tpch-dbgen > --format tbl --query 1 --batch-size 4096 -c1 --load > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10175) [CI] Nightly hdfs integration test job fails
[ https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-10175. - Resolution: Fixed Issue resolved by pull request 8413 [https://github.com/apache/arrow/pull/8413] > [CI] Nightly hdfs integration test job fails > > > Key: ARROW-10175 > URL: https://issues.apache.org/jira/browse/ARROW-10175 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Neal Richardson >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Two tests fail: > https://github.com/ursa-labs/crossbow/runs/1204680589 > [removed bogus investigation] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10203) [Doc] Capture guidance for endianness support in contributors guide.
[ https://issues.apache.org/jira/browse/ARROW-10203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-10203: Summary: [Doc] Capture guidance for endianness support in contributors guide. (was: Capture guidance for endianness support in contributors guide.) > [Doc] Capture guidance for endianness support in contributors guide. > > > Key: ARROW-10203 > URL: https://issues.apache.org/jira/browse/ARROW-10203 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 3h 10m > Remaining Estimate: 0h > > https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccak7z5t--hhhr9dy43pyhd6m-xou4qogwqvlwzsg-koxxjpt...@mail.gmail.com%3e -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image
[ https://issues.apache.org/jira/browse/ARROW-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-10231. - Resolution: Fixed Issue resolved by pull request 8396 [https://github.com/apache/arrow/pull/8396] > [CI] Unable to download minio in arm32v7 docker image > - > > Key: ARROW-10231 > URL: https://issues.apache.org/jira/browse/ARROW-10231 > Project: Apache Arrow > Issue Type: Improvement > Components: CI >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8355) [Python] Reduce the number of pandas dependent test cases in test_feather
[ https://issues.apache.org/jira/browse/ARROW-8355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-8355. -- Fix Version/s: (was: 3.0.0) 2.0.0 Resolution: Fixed Issue resolved by pull request 8244 [https://github.com/apache/arrow/pull/8244] > [Python] Reduce the number of pandas dependent test cases in test_feather > - > > Key: ARROW-8355 > URL: https://issues.apache.org/jira/browse/ARROW-8355 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Krisztian Szucs >Assignee: Andrew Wieteska >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > See comment https://github.com/apache/arrow/pull/6849#discussion_r404160096 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata
[ https://issues.apache.org/jira/browse/ARROW-10248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-10248: Assignee: Ben Kietzman > [C++][Dataset] Dataset writing does not write schema metadata > - > > Key: ARROW-10248 > URL: https://issues.apache.org/jira/browse/ARROW-10248 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Fix For: 2.0.0 > > > Not sure if this is related to the writing refactor that landed yesterday, > but `write_dataset` does not preserve the schema metadata (eg used for pandas > metadata): > {code} > In [20]: df = pd.DataFrame({'a': [1, 2, 3]}) > In [21]: table = pa.Table.from_pandas(df) > In [22]: table.schema > Out[22]: > a: int64 > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 396 > In [23]: ds.write_dataset(table, "test_write_dataset_pandas", > format="parquet") > In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema > Out[24]: > a: int64 > -- field metadata -- > PARQUET:field_id: '1' > {code} > I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't > yet look into how easy it would be to fix. > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7957) [Python] ParquetDataset cannot take HadoopFileSystem as filesystem
[ https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7957: -- Labels: pull-request-available (was: ) > [Python] ParquetDataset cannot take HadoopFileSystem as filesystem > -- > > Key: ARROW-7957 > URL: https://issues.apache.org/jira/browse/ARROW-7957 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Catherine >Assignee: Joris Van den Bossche >Priority: Critical > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > {{from pyarrow.fs import HadoopFileSystem}} > {{import pyarrow.parquet as pq}} > > {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} > {{hdfs, path = HadoopFileSystem.from_uri(file_name)}} > {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}} > > has error: > {{OSError: Unrecognized filesystem: 'pyarrow._hdfs.HadoopFileSystem'>}} > > When I tried using the deprecated {{HadoopFileSystem}}: > {{import pyarrow}} > {{import pyarrow.parquet as pq}} > > {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} > {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}} > {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}} > {{pa_schema = dataset.schema.to_arrow_schema()}} > {{pieces = dataset.pieces}} > {{for piece in pieces: }} > {{ print(piece.path)}} > > {{piece.path}} lose the {{hdfs://localhost:9000}} prefix. > > I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as > filesystem?}} > And {{piece.path}} should have the prefix? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7957) [Python] ParquetDataset cannot take HadoopFileSystem as filesystem
[ https://issues.apache.org/jira/browse/ARROW-7957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7957: - Fix Version/s: (was: 3.0.0) 2.0.0 > [Python] ParquetDataset cannot take HadoopFileSystem as filesystem > -- > > Key: ARROW-7957 > URL: https://issues.apache.org/jira/browse/ARROW-7957 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Catherine >Assignee: Joris Van den Bossche >Priority: Critical > Fix For: 2.0.0 > > > {{from pyarrow.fs import HadoopFileSystem}} > {{import pyarrow.parquet as pq}} > > {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} > {{hdfs, path = HadoopFileSystem.from_uri(file_name)}} > {{dataset = pq.ParquetDataset(file_name, filesystem=hdfs)}} > > has error: > {{OSError: Unrecognized filesystem: 'pyarrow._hdfs.HadoopFileSystem'>}} > > When I tried using the deprecated {{HadoopFileSystem}}: > {{import pyarrow}} > {{import pyarrow.parquet as pq}} > > {{file_name = "hdfs://localhost:9000/test/file_name.pq"}} > {{hdfs = pyarrow.hdfs.connect('localhost', 9000)}} > {{dataset = pq.ParquetDataset(file_names, filesystem=hdfs)}} > {{pa_schema = dataset.schema.to_arrow_schema()}} > {{pieces = dataset.pieces}} > {{for piece in pieces: }} > {{ print(piece.path)}} > > {{piece.path}} lose the {{hdfs://localhost:9000}} prefix. > > I think {{ParquetDataset}} should accept {{pyarrow.fs.}}{{HadoopFileSystem as > filesystem?}} > And {{piece.path}} should have the prefix? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10245) [CI] Update the conda docker images to use miniforge instead of miniconda
[ https://issues.apache.org/jira/browse/ARROW-10245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211045#comment-17211045 ] Uwe Korn commented on ARROW-10245: -- This adds ppc64le, aarch64 and osx-arm64 as supported architectures but should only be a different download URL. Be aware that miniforge doesn't include defaults as a default channel as well that it is unavailable for Windows. > [CI] Update the conda docker images to use miniforge instead of miniconda > - > > Key: ARROW-10245 > URL: https://issues.apache.org/jira/browse/ARROW-10245 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > So we could support more architectures > https://github.com/conda-forge/miniforge > cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10175) [CI] Nightly hdfs integration test job fails
[ https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10175: --- Labels: pull-request-available (was: ) > [CI] Nightly hdfs integration test job fails > > > Key: ARROW-10175 > URL: https://issues.apache.org/jira/browse/ARROW-10175 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Neal Richardson >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Two tests fail: > https://github.com/ursa-labs/crossbow/runs/1204680589 > [removed bogus investigation] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10175) [CI] Nightly hdfs integration test job fails
[ https://issues.apache.org/jira/browse/ARROW-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210982#comment-17210982 ] Joris Van den Bossche commented on ARROW-10175: --- OK, needed to pass some {{use_legacy_dataset=True}}, because the default is now to use the dataset implemenation, which of course doesn't work when passing legacy filesystems. Now, the first error, which is reading from an URI and _not_ passing a legacy HadoopFileSystem object, seems a legitimate bug (because passing an URI should "just" use the new implementation): {code} pyarrow.lib.ArrowInvalid: Path '/tmp/pyarrow-test-838/multi-parquet-uri-48569714efc74397816722c9c6723191/0.parquet' is not relative to '/user/root' {code} > [CI] Nightly hdfs integration test job fails > > > Key: ARROW-10175 > URL: https://issues.apache.org/jira/browse/ARROW-10175 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Neal Richardson >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 2.0.0 > > > Two tests fail: > https://github.com/ursa-labs/crossbow/runs/1204680589 > [removed bogus investigation] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present
[ https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210933#comment-17210933 ] Joris Van den Bossche commented on ARROW-10246: --- This was already reported as ARROW-10237 and fixed in the meantime. But we should have notified the mailing about that, sorry about that! Thanks for looking into it anyway > [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when > duplicate values are present > - > > Key: ARROW-10246 > URL: https://issues.apache.org/jira/browse/ARROW-10246 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Matt Jadczak >Priority: Major > > Copying this from [the mailing > list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E] > We can observe the following odd behaviour when round-tripping data via > parquet using pyarrow, when the data contains dictionary arrays with > duplicate values. > > {code:java} > import pyarrow as pa > import pyarrow.parquet as pq > my_table = pa.Table.from_batches( > [ > pa.RecordBatch.from_arrays( > [ > pa.array([0, 1, 2, 3, 4]), > pa.DictionaryArray.from_arrays( > pa.array([0, 1, 2, 3, 4]), > pa.array(['a', 'd', 'c', 'd', 'e']) > ) > ], > names=['foo', 'bar'] > ) > ] > ) > my_table.validate(full=True) > pq.write_table(my_table, "foo.parquet") > read_table = pq.ParquetFile("foo.parquet").read() > read_table.validate(full=True) > print(my_table.column(1).to_pylist()) > print(read_table.column(1).to_pylist()) > assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() > {code} > Both tables pass full validation, yet the last three lines print: > {code:java} > ['a', 'd', 'c', 'd', 'e'] > ['a', 'd', 'c', 'e', 'a'] > Traceback (most recent call last): > File > "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line > 29, in > assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() > AssertionError{code} > Which clearly doesn't look right! > > It seems to me that the reason this is happening is that when re-encoding an > Arrow dictionary as a Parquet one, the function at > [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773] > is called to create a Parquet DictEncoder out of the Arrow dictionary data. > This internally uses a map from value to index, and this map is constructed > by continually calling GetOrInsert on a memo table. When called with > duplicate values as in Al's example, the duplicates do not cause a new > dictionary index to be allocated, but instead return the existing one (which > is just ignored). However, the caller assumes that the resulting Parquet > dictionary uses the exact same indices as the Arrow one, and proceeds to just > copy the index data directly. In Al's example, this results in an invalid > dictionary index being written (that it is somehow wrapped around when > reading again, rather than crashing, is potentially a second bug). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present
[ https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-10246. - Resolution: Duplicate > [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when > duplicate values are present > - > > Key: ARROW-10246 > URL: https://issues.apache.org/jira/browse/ARROW-10246 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Matt Jadczak >Priority: Major > > Copying this from [the mailing > list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E] > We can observe the following odd behaviour when round-tripping data via > parquet using pyarrow, when the data contains dictionary arrays with > duplicate values. > > {code:java} > import pyarrow as pa > import pyarrow.parquet as pq > my_table = pa.Table.from_batches( > [ > pa.RecordBatch.from_arrays( > [ > pa.array([0, 1, 2, 3, 4]), > pa.DictionaryArray.from_arrays( > pa.array([0, 1, 2, 3, 4]), > pa.array(['a', 'd', 'c', 'd', 'e']) > ) > ], > names=['foo', 'bar'] > ) > ] > ) > my_table.validate(full=True) > pq.write_table(my_table, "foo.parquet") > read_table = pq.ParquetFile("foo.parquet").read() > read_table.validate(full=True) > print(my_table.column(1).to_pylist()) > print(read_table.column(1).to_pylist()) > assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() > {code} > Both tables pass full validation, yet the last three lines print: > {code:java} > ['a', 'd', 'c', 'd', 'e'] > ['a', 'd', 'c', 'e', 'a'] > Traceback (most recent call last): > File > "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line > 29, in > assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() > AssertionError{code} > Which clearly doesn't look right! > > It seems to me that the reason this is happening is that when re-encoding an > Arrow dictionary as a Parquet one, the function at > [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773] > is called to create a Parquet DictEncoder out of the Arrow dictionary data. > This internally uses a map from value to index, and this map is constructed > by continually calling GetOrInsert on a memo table. When called with > duplicate values as in Al's example, the duplicates do not cause a new > dictionary index to be allocated, but instead return the existing one (which > is just ignored). However, the caller assumes that the resulting Parquet > dictionary uses the exact same indices as the Arrow one, and proceeds to just > copy the index data directly. In Al's example, this results in an invalid > dictionary index being written (that it is somehow wrapped around when > reading again, rather than crashing, is potentially a second bug). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9952) [Python] Use pyarrow.dataset writing for pq.write_to_dataset
[ https://issues.apache.org/jira/browse/ARROW-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9952: -- Labels: pull-request-available (was: ) > [Python] Use pyarrow.dataset writing for pq.write_to_dataset > > > Key: ARROW-9952 > URL: https://issues.apache.org/jira/browse/ARROW-9952 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Now ARROW-9658 and ARROW-9893 are in, we can explore using the > {{pyarrow.dataset}} writing capabilities in {{parquet.write_to_dataset}}. > Similarly as was done in {{pq.read_table}}, we could initially have a keyword > to switch between both implementations, eventually defaulting to the new > datasets one, and to deprecated the old (inefficient) python implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10249) [Rust]: Support Dictionary types for ListArrays in arrow json reader
[ https://issues.apache.org/jira/browse/ARROW-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahmut Bulut updated ARROW-10249: - Summary: [Rust]: Support Dictionary types for ListArrays in arrow json reader (was: [Rust]: Support Dictionary types in arrow json reader) > [Rust]: Support Dictionary types for ListArrays in arrow json reader > > > Key: ARROW-10249 > URL: https://issues.apache.org/jira/browse/ARROW-10249 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Mahmut Bulut >Priority: Major > > Currently, dictionary types are not supported in Arrow JSON reader. It would > be nice to add dictionary type support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10249) [Rust]: Support Dictionary types for ListArrays in arrow json reader
[ https://issues.apache.org/jira/browse/ARROW-10249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahmut Bulut updated ARROW-10249: - Description: Currently, dictionary types for listarrays are not supported in Arrow JSON reader. It would be nice to add dictionary type support. (was: Currently, dictionary types are not supported in Arrow JSON reader. It would be nice to add dictionary type support.) > [Rust]: Support Dictionary types for ListArrays in arrow json reader > > > Key: ARROW-10249 > URL: https://issues.apache.org/jira/browse/ARROW-10249 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Mahmut Bulut >Priority: Major > > Currently, dictionary types for listarrays are not supported in Arrow JSON > reader. It would be nice to add dictionary type support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10249) [Rust]: Support Dictionary types in arrow json reader
Mahmut Bulut created ARROW-10249: Summary: [Rust]: Support Dictionary types in arrow json reader Key: ARROW-10249 URL: https://issues.apache.org/jira/browse/ARROW-10249 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Mahmut Bulut Currently, dictionary types are not supported in Arrow JSON reader. It would be nice to add dictionary type support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10248) [C++][Dataset] Dataset writing does not write schema metadata
Joris Van den Bossche created ARROW-10248: - Summary: [C++][Dataset] Dataset writing does not write schema metadata Key: ARROW-10248 URL: https://issues.apache.org/jira/browse/ARROW-10248 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 2.0.0 Not sure if this is related to the writing refactor that landed yesterday, but `write_dataset` does not preserve the schema metadata (eg used for pandas metadata): {code} In [20]: df = pd.DataFrame({'a': [1, 2, 3]}) In [21]: table = pa.Table.from_pandas(df) In [22]: table.schema Out[22]: a: int64 -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 396 In [23]: ds.write_dataset(table, "test_write_dataset_pandas", format="parquet") In [24]: pq.read_table("test_write_dataset_pandas/part-0.parquet").schema Out[24]: a: int64 -- field metadata -- PARQUET:field_id: '1' {code} I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't yet look into how easy it would be to fix. cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field
[ https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210866#comment-17210866 ] Joris Van den Bossche commented on ARROW-10247: --- cc [~bkietz] > [C++][Dataset] Cannot write dataset with dictionary column as partition field > - > > Key: ARROW-10247 > URL: https://issues.apache.org/jira/browse/ARROW-10247 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 2.0.0 > > > When the column to use for partitioning is dictionary encoded, we get this > error: > {code} > In [9]: import pyarrow.dataset as ds > In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 > ...: table = pa.table([ > ...: pa.array(range(len(part))), > ...: pa.array(part).dictionary_encode(), > ...: ], names=['col', 'part']) > In [11]: part = ds.partitioning(table.select(["part"]).schema) > In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, > base_dir, basename_template, format, partitioning, schema, filesystem, > file_options, use_threads) > 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > --> 775 filesystem, partitioning, file_options, use_threads, > 776 ) > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowTypeError: scalar xxx (of type string) is invalid for part: > dictionary > In ../src/arrow/dataset/filter.cc, line 1082, code: > VisitConjunctionMembers(*and_.left_operand(), visitor) > In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, > [&](const std::string& name, const std::shared_ptr& value) { auto&& > _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { > ::arrow::Status __s = > ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if > ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); > _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, > "(_error_or_value28).status()"); return _st; } } while (0); } while (false); > auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const > auto& field = schema_->field(match[0]); if > (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", > value->ToString(), " (of type ", *value->type, ") is invalid for ", > field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); > }) > In ../src/arrow/dataset/file_base.cc, line 321, code: > (_error_or_value24).status() > In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() > {code} > While this seems a quit normal use case, as this column will typically be > repeated many times (and we also support reading it as such with dictionary > type, so a roundtrip is currently not possible in that case) > I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't > yet look into how easy it would be to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field
Joris Van den Bossche created ARROW-10247: - Summary: [C++][Dataset] Cannot write dataset with dictionary column as partition field Key: ARROW-10247 URL: https://issues.apache.org/jira/browse/ARROW-10247 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 2.0.0 When the column to use for partitioning is dictionary encoded, we get this error: {code} In [9]: import pyarrow.dataset as ds In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 ...: table = pa.table([ ...: pa.array(range(len(part))), ...: pa.array(part).dictionary_encode(), ...: ], names=['col', 'part']) In [11]: part = ds.partitioning(table.select(["part"]).schema) In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", partitioning=part) --- ArrowTypeErrorTraceback (most recent call last) in > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", partitioning=part) ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, --> 775 filesystem, partitioning, file_options, use_threads, 776 ) ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowTypeError: scalar xxx (of type string) is invalid for part: dictionary In ../src/arrow/dataset/filter.cc, line 1082, code: VisitConjunctionMembers(*and_.left_operand(), visitor) In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, [&](const std::string& name, const std::shared_ptr& value) { auto&& _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { ::arrow::Status __s = ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, "(_error_or_value28).status()"); return _st; } } while (0); } while (false); auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const auto& field = schema_->field(match[0]); if (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", value->ToString(), " (of type ", *value->type, ") is invalid for ", field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); }) In ../src/arrow/dataset/file_base.cc, line 321, code: (_error_or_value24).status() In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() {code} While this seems a quit normal use case, as this column will typically be repeated many times (and we also support reading it as such with dictionary type, so a roundtrip is currently not possible in that case) I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't yet look into how easy it would be to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10114) [R] Segfault in to_dataframe_parallel with deeply nested structs
[ https://issues.apache.org/jira/browse/ARROW-10114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10114: --- Labels: pull-request-available (was: ) > [R] Segfault in to_dataframe_parallel with deeply nested structs > > > Key: ARROW-10114 > URL: https://issues.apache.org/jira/browse/ARROW-10114 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.1 > Environment: > sessionInfo() > R version 3.6.3 (2020-02-29) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Linux Mint 19.3 > Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=sv_SE.UTF-8LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=sv_SE.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] arrow_1.0.1 > loaded via a namespace (and not attached): > [1] tidyselect_1.1.0 bit_4.0.4compiler_3.6.3 magrittr_1.5 > [5] assertthat_0.2.1 R6_2.4.1 glue_1.4.1 Rcpp_1.0.5 > [9] bit64_4.0.2 vctrs_0.3.2 rlang_0.4.7 purrr_0.3.4 >Reporter: Markus Skyttner >Assignee: Romain Francois >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Attachments: Dockerfile, Makefile, reprex_10114.R > > Time Spent: 10m > Remaining Estimate: 0h > > A .jsonl file (newline separated JSON) created from open data available at > [ftp://ftp.libris.kb.se/pub/spa/swepub-deduplicated-2019-12-29.zip] is used > with the R package arrow (installed from CRAN) using the following statement: > > arrow::read_json_arrow("~/.config/swepub/head.jsonl") > It crashes RStudio with no error message. At the R prompt, the error message > is: > Error in Table__to_dataframe(x, use_threads = option_use_threads()) : > SET_VECTOR_ELT() can only be applied to a 'list', not a 'integer' > The file "head.jsonl" above was created from the extracted zip's .jsonl file > with the *nix "head -1 $BIG_JSONL_FILE" command. It can be parsed with > jsonlite and tidyjson. > Also got this error message at one point: > > arrow::read_json_arrow("head.jsonl", as_data_frame = TRUE) > *** caught segfault *** > address 0x8, cause 'memory not mapped' > Traceback: > 1: structure(x, extra_cols = colonnade[extra_cols], class = > "pillar_squeezed_colonnade") > 2: new_colonnade_sqeezed(out, colonnade = x, extra_cols = extra_cols) > 3: pillar::squeeze(x$mcf, width = width) > 4: format.trunc_mat(mat) > 5: format(mat) > 6: format.tbl(x, ..., n = n, width = width, n_extra = n_extra) > 7: format(x, ..., n = n, width = width, n_extra = n_extra) > 8: paste0(..., collapse = "\n") > 9: cli::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra)) > 10: print.tbl(x) > 11: (function (x, ...) UseMethod("print"))(x) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present
[ https://issues.apache.org/jira/browse/ARROW-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Jadczak updated ARROW-10246: - Component/s: Python C++ > [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when > duplicate values are present > - > > Key: ARROW-10246 > URL: https://issues.apache.org/jira/browse/ARROW-10246 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Matt Jadczak >Priority: Major > > Copying this from [the mailing > list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E] > We can observe the following odd behaviour when round-tripping data via > parquet using pyarrow, when the data contains dictionary arrays with > duplicate values. > > {code:java} > import pyarrow as pa > import pyarrow.parquet as pq > my_table = pa.Table.from_batches( > [ > pa.RecordBatch.from_arrays( > [ > pa.array([0, 1, 2, 3, 4]), > pa.DictionaryArray.from_arrays( > pa.array([0, 1, 2, 3, 4]), > pa.array(['a', 'd', 'c', 'd', 'e']) > ) > ], > names=['foo', 'bar'] > ) > ] > ) > my_table.validate(full=True) > pq.write_table(my_table, "foo.parquet") > read_table = pq.ParquetFile("foo.parquet").read() > read_table.validate(full=True) > print(my_table.column(1).to_pylist()) > print(read_table.column(1).to_pylist()) > assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() > {code} > Both tables pass full validation, yet the last three lines print: > {code:java} > ['a', 'd', 'c', 'd', 'e'] > ['a', 'd', 'c', 'e', 'a'] > Traceback (most recent call last): > File > "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line > 29, in > assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() > AssertionError{code} > Which clearly doesn't look right! > > It seems to me that the reason this is happening is that when re-encoding an > Arrow dictionary as a Parquet one, the function at > [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773] > is called to create a Parquet DictEncoder out of the Arrow dictionary data. > This internally uses a map from value to index, and this map is constructed > by continually calling GetOrInsert on a memo table. When called with > duplicate values as in Al's example, the duplicates do not cause a new > dictionary index to be allocated, but instead return the existing one (which > is just ignored). However, the caller assumes that the resulting Parquet > dictionary uses the exact same indices as the Arrow one, and proceeds to just > copy the index data directly. In Al's example, this results in an invalid > dictionary index being written (that it is somehow wrapped around when > reading again, rather than crashing, is potentially a second bug). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10246) [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present
Matt Jadczak created ARROW-10246: Summary: [Python] Incorrect conversion of Arrow dictionary to Parquet dictionary when duplicate values are present Key: ARROW-10246 URL: https://issues.apache.org/jira/browse/ARROW-10246 Project: Apache Arrow Issue Type: Bug Reporter: Matt Jadczak Copying this from [the mailing list|https://lists.apache.org/thread.html/r8afb5aed3855e35fe03bd3a27f2c7e2177ed2825c5ad5f455b2c9078%40%3Cdev.arrow.apache.org%3E] We can observe the following odd behaviour when round-tripping data via parquet using pyarrow, when the data contains dictionary arrays with duplicate values. {code:java} import pyarrow as pa import pyarrow.parquet as pq my_table = pa.Table.from_batches( [ pa.RecordBatch.from_arrays( [ pa.array([0, 1, 2, 3, 4]), pa.DictionaryArray.from_arrays( pa.array([0, 1, 2, 3, 4]), pa.array(['a', 'd', 'c', 'd', 'e']) ) ], names=['foo', 'bar'] ) ] ) my_table.validate(full=True) pq.write_table(my_table, "foo.parquet") read_table = pq.ParquetFile("foo.parquet").read() read_table.validate(full=True) print(my_table.column(1).to_pylist()) print(read_table.column(1).to_pylist()) assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() {code} Both tables pass full validation, yet the last three lines print: {code:java} ['a', 'd', 'c', 'd', 'e'] ['a', 'd', 'c', 'e', 'a'] Traceback (most recent call last): File "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 29, in assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist() AssertionError{code} Which clearly doesn't look right! It seems to me that the reason this is happening is that when re-encoding an Arrow dictionary as a Parquet one, the function at [https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773] is called to create a Parquet DictEncoder out of the Arrow dictionary data. This internally uses a map from value to index, and this map is constructed by continually calling GetOrInsert on a memo table. When called with duplicate values as in Al's example, the duplicates do not cause a new dictionary index to be allocated, but instead return the existing one (which is just ignored). However, the caller assumes that the resulting Parquet dictionary uses the exact same indices as the Arrow one, and proceeds to just copy the index data directly. In Al's example, this results in an invalid dictionary index being written (that it is somehow wrapped around when reading again, rather than crashing, is potentially a second bug). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9695) [Rust][DataFusion] Improve documentation on LogicalPlan variants
[ https://issues.apache.org/jira/browse/ARROW-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-9695: --- Assignee: Andrew Lamb > [Rust][DataFusion] Improve documentation on LogicalPlan variants > > > Key: ARROW-9695 > URL: https://issues.apache.org/jira/browse/ARROW-9695 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > I think we could improve the documentation somewhat on LogicalPlan nodes. I > will submit a PR with a proposal. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9733) [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns
[ https://issues.apache.org/jira/browse/ARROW-9733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-9733: --- Assignee: Jorge Leitão > [Rust][DataFusion] Aggregates COUNT/MIN/MAX don't work on VARCHAR columns > - > > Key: ARROW-9733 > URL: https://issues.apache.org/jira/browse/ARROW-9733 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: repro.csv > > Time Spent: 20m > Remaining Estimate: 0h > > h2. Reproducer: > Create a table with a string column: > Repro: > {code} > CREATE EXTERNAL TABLE repro(a INT, b VARCHAR) > STORED AS CSV > WITH HEADER ROW > LOCATION 'repro.csv'; > {code} > The contents of repro.csv are as follows (also attached): > {code} > a,b > 1,One > 1,Two > 2,One > 2,Two > 2,Two > {code} > Now, run a query that tries to aggregate that column: > {code} > select a, count(b) from repro group by a; > {code} > *Actual behavior*: > {code} > > select a, count(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > *Expected Behavior*: > The query runs and produces results > {code} > a, count(b) > 1,2 > 2,3 > {code} > h2. Discussion > Using Min/Max aggregates on varchar also doesn't work (but should): > {code} > > select a, min(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > > select a, max(b) from repro group by a; > ArrowError(ExternalError(ExecutionError("Unsupported data type Utf8 for > result of aggregate expression"))) > {code} > Fascinatingly these formulations work fine: > {code} > > select a, count(a) from repro group by a; > +---+--+ > | a | count(a) | > +---+--+ > | 2 | 3| > | 1 | 2| > +---+--+ > 2 row in set. Query took 0 seconds. > > select a, count(1) from repro group by a; > +---+-+ > | a | count(UInt8(1)) | > +---+-+ > | 2 | 3 | > | 1 | 2 | > +---+-+ > 2 row in set. Query took 0 seconds. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9759) [Rust] [DataFusion] Implement DataFrame::sort
[ https://issues.apache.org/jira/browse/ARROW-9759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-9759: --- Assignee: Andy Grove > [Rust] [DataFusion] Implement DataFrame::sort > - > > Key: ARROW-9759 > URL: https://issues.apache.org/jira/browse/ARROW-9759 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Implement DataFrame::sort -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9742) [Rust] Create one standard DataFrame API
[ https://issues.apache.org/jira/browse/ARROW-9742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-9742: --- Assignee: Andy Grove > [Rust] Create one standard DataFrame API > > > Key: ARROW-9742 > URL: https://issues.apache.org/jira/browse/ARROW-9742 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > There was a discussion in last Arrow sync call about the fact that there are > numerous Rust DataFrame projects and it would be good to have one standard, > in the Arrow repo. > I do think it would be good to have a DataFrame trait in Arrow, with an > implementation in DataFusion, and making it possible for other projects to > extend/replace the implementation e.g. for distributed compute, or for GPU > compute, as two examples. > [~jhorstmann] Does this capture what you were suggesting in the call? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10231) [CI] Unable to download minio in arm32v7 docker image
[ https://issues.apache.org/jira/browse/ARROW-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-10231: - Component/s: CI > [CI] Unable to download minio in arm32v7 docker image > - > > Key: ARROW-10231 > URL: https://issues.apache.org/jira/browse/ARROW-10231 > Project: Apache Arrow > Issue Type: Improvement > Components: CI >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > See build log https://github.com/apache/arrow/runs/1224947766#step:5:2021 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9760) [Rust] [DataFusion] Implement DataFrame::explain
[ https://issues.apache.org/jira/browse/ARROW-9760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-9760: --- Assignee: Jorge Leitão > [Rust] [DataFusion] Implement DataFrame::explain > > > Key: ARROW-9760 > URL: https://issues.apache.org/jira/browse/ARROW-9760 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Affects Versions: 2.0.0 >Reporter: Andy Grove >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Implement DataFrame::explain - we already have explain implemented in the SQL > API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9793) [Rust] [DataFusion] Tests failing in master
[ https://issues.apache.org/jira/browse/ARROW-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-9793: --- Assignee: Jorge Leitão > [Rust] [DataFusion] Tests failing in master > --- > > Key: ARROW-9793 > URL: https://issues.apache.org/jira/browse/ARROW-9793 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10245) [CI] Update the conda docker images to use miniforge instead of miniconda
Krisztian Szucs created ARROW-10245: --- Summary: [CI] Update the conda docker images to use miniforge instead of miniconda Key: ARROW-10245 URL: https://issues.apache.org/jira/browse/ARROW-10245 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs So we could support more architectures https://github.com/conda-forge/miniforge cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9518) [Python] Deprecate pyarrow serialization
[ https://issues.apache.org/jira/browse/ARROW-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-9518. Resolution: Fixed Issue resolved by pull request 8255 [https://github.com/apache/arrow/pull/8255] > [Python] Deprecate pyarrow serialization > > > Key: ARROW-9518 > URL: https://issues.apache.org/jira/browse/ARROW-9518 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available, pyarrow-serialization > Fix For: 2.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > Per mailing list discussion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10216) [Rust] Simd implementation of min/max aggregation kernels for primitive types
[ https://issues.apache.org/jira/browse/ARROW-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörn Horstmann reassigned ARROW-10216: -- Assignee: Jörn Horstmann > [Rust] Simd implementation of min/max aggregation kernels for primitive types > - > > Key: ARROW-10216 > URL: https://issues.apache.org/jira/browse/ARROW-10216 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Major > > Using a similar approach as the sum kernel (ARROW-10015). Instead of > initializing the accumulator with 0 we'd need the largest/smallest possible > value for each ArrowNumericType (i.e. u64::MAX or +-Inf) > Pseudo code for min aggregation > {code} > // initialize accumulator > min_acc = +Inf > // aggregate each chunk > min_acc = min(min_acc, select(valid, value, +Inf)) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9956) Implement Binary string function in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-9956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen Kumar resolved ARROW-9956. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8201 [https://github.com/apache/arrow/pull/8201] > Implement Binary string function in Gandiva > --- > > Key: ARROW-9956 > URL: https://issues.apache.org/jira/browse/ARROW-9956 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Gandiva >Reporter: Naman Udasi >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > Implementation for new binary_string function in gandiva. > Function take in a normal string or a hexadecimal string( > _Eg:\x41\x20\x42\x20\x43_) and converts it to VARBINARY (byte array). > Is generally used with CAST functions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset
[ https://issues.apache.org/jira/browse/ARROW-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10244: --- Labels: pull-request-available (was: ) > [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset > > > Key: ARROW-10244 > URL: https://issues.apache.org/jira/browse/ARROW-10244 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset
[ https://issues.apache.org/jira/browse/ARROW-10244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-10244: - Assignee: Joris Van den Bossche > [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset > > > Key: ARROW-10244 > URL: https://issues.apache.org/jira/browse/ARROW-10244 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10244) [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset
Joris Van den Bossche created ARROW-10244: - Summary: [Python][Docs] Add docs on using pyarrow.dataset.parquet_dataset Key: ARROW-10244 URL: https://issues.apache.org/jira/browse/ARROW-10244 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation
[ https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörn Horstmann updated ARROW-10243: --- Description: While benchmarking the tpch query I noticed that the physical literal expression takes up a sizable amount of time. I think the creation of the corresponding array for numeric literals can be speed up by creating Buffer and ArrayData directly without going through a builder. That also allows to skip building a null bitmap for non-null literals. I'm also thinking whether it might be possible to cache the created array. For queries without a WHERE clause, I'd expect all batches except the last to have the same length. I'm not sure though where to store the cached value. Another possible optimization could be to cast literals already on the logical plan side. In the tpch query the literal `1` is of type `u64` in the logical plan and then needs to be processed by a cast kernel to convert to `f64` for usage in an arithmetic expression. The attached flamegraph is of 10 runs of tpch, with the data being loaded into memory before running the queries (See ARROW-10240). {code} flamegraph ./target/release/tpch --iterations 10 --path ../tpch-dbgen --format tbl --query 1 --batch-size 4096 -c1 --load {code} was: While benchmarking the tpch query I noticed that the physical literal expression takes up a sizable amount of time. I think the creation of the corresponding array for numeric literals can be speed up by creating Buffer and ArrayData directly without going through a builder. That also allows to skip building a null bitmap for non-null literals. I'm also thinking whether it might be possible to cache the created array. For queries without a WHERE clause, I'd expect all batches except the last to have the same length. I'm not sure though where to store the cached value. Another possible optimization could be to cast literals already on the logical plan side. In the tpch query the literal `1` is of type `u64` in the logical plan and then needs to be processed by a cast kernel to convert to `f64` for usage in an arithmetic expression. The attached flamegraph is of 10 runs of tpch, with the data being loaded into memory before running the queries (See ARROW-10240). > [Rust] [Datafusion] Optimize literal expression evaluation > -- > > Key: ARROW-10243 > URL: https://issues.apache.org/jira/browse/ARROW-10243 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Priority: Major > Attachments: flamegraph.svg > > > While benchmarking the tpch query I noticed that the physical literal > expression takes up a sizable amount of time. I think the creation of the > corresponding array for numeric literals can be speed up by creating Buffer > and ArrayData directly without going through a builder. That also allows to > skip building a null bitmap for non-null literals. > I'm also thinking whether it might be possible to cache the created array. > For queries without a WHERE clause, I'd expect all batches except the last to > have the same length. I'm not sure though where to store the cached value. > Another possible optimization could be to cast literals already on the > logical plan side. In the tpch query the literal `1` is of type `u64` in the > logical plan and then needs to be processed by a cast kernel to convert to > `f64` for usage in an arithmetic expression. > The attached flamegraph is of 10 runs of tpch, with the data being loaded > into memory before running the queries (See ARROW-10240). > {code} > flamegraph ./target/release/tpch --iterations 10 --path ../tpch-dbgen > --format tbl --query 1 --batch-size 4096 -c1 --load > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation
[ https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörn Horstmann updated ARROW-10243: --- Description: While benchmarking the tpch query I noticed that the physical literal expression takes up a sizable amount of time. I think the creation of the corresponding array for numeric literals can be speed up by creating Buffer and ArrayData directly without going through a builder. That also allows to skip building a null bitmap for non-null literals. I'm also thinking whether it might be possible to cache the created array. For queries without a WHERE clause, I'd expect all batches except the last to have the same length. I'm not sure though where to store the cached value. Another possible optimization could be to cast literals already on the logical plan side. In the tpch query the literal `1` is of type `u64` in the logical plan and then needs to be processed by a cast kernel to convert to `f64` for usage in an arithmetic expression. The attached flamegraph is of 10 runs of tpch, with the data being loaded into memory before running the queries (See ARROW-10240). was: While benchmarking the tpch query I noticed that the physical literal expression takes up a sizable amount of time. I think the creation of the corresponding array for numeric literals can be speed up by creating Buffer and ArrayData directly without going through a builder. That also allows to skip building a null bitmap for non-null literals. I'm also thinking whether it might be possible to cache the created array. For queries without a WHERE clause, I'd expect all batches except the last to have the same length. I'm not sure though where to store the cached value. Another possible optimization could be to cast literals already on the logical plan side. In the tpch query the literal `1` is of type `u64` in the logical plan and then needs to be processed by a cast kernel to convert to `f64` for usage in an arithmetic expression. > [Rust] [Datafusion] Optimize literal expression evaluation > -- > > Key: ARROW-10243 > URL: https://issues.apache.org/jira/browse/ARROW-10243 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Priority: Major > Attachments: flamegraph.svg > > > While benchmarking the tpch query I noticed that the physical literal > expression takes up a sizable amount of time. I think the creation of the > corresponding array for numeric literals can be speed up by creating Buffer > and ArrayData directly without going through a builder. That also allows to > skip building a null bitmap for non-null literals. > I'm also thinking whether it might be possible to cache the created array. > For queries without a WHERE clause, I'd expect all batches except the last to > have the same length. I'm not sure though where to store the cached value. > Another possible optimization could be to cast literals already on the > logical plan side. In the tpch query the literal `1` is of type `u64` in the > logical plan and then needs to be processed by a cast kernel to convert to > `f64` for usage in an arithmetic expression. > The attached flamegraph is of 10 runs of tpch, with the data being loaded > into memory before running the queries (See ARROW-10240). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation
[ https://issues.apache.org/jira/browse/ARROW-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörn Horstmann updated ARROW-10243: --- Attachment: flamegraph.svg > [Rust] [Datafusion] Optimize literal expression evaluation > -- > > Key: ARROW-10243 > URL: https://issues.apache.org/jira/browse/ARROW-10243 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Priority: Major > Attachments: flamegraph.svg > > > While benchmarking the tpch query I noticed that the physical literal > expression takes up a sizable amount of time. I think the creation of the > corresponding array for numeric literals can be speed up by creating Buffer > and ArrayData directly without going through a builder. That also allows to > skip building a null bitmap for non-null literals. > I'm also thinking whether it might be possible to cache the created array. > For queries without a WHERE clause, I'd expect all batches except the last to > have the same length. I'm not sure though where to store the cached value. > Another possible optimization could be to cast literals already on the > logical plan side. In the tpch query the literal `1` is of type `u64` in the > logical plan and then needs to be processed by a cast kernel to convert to > `f64` for usage in an arithmetic expression. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10243) [Rust] [Datafusion] Optimize literal expression evaluation
Jörn Horstmann created ARROW-10243: -- Summary: [Rust] [Datafusion] Optimize literal expression evaluation Key: ARROW-10243 URL: https://issues.apache.org/jira/browse/ARROW-10243 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jörn Horstmann While benchmarking the tpch query I noticed that the physical literal expression takes up a sizable amount of time. I think the creation of the corresponding array for numeric literals can be speed up by creating Buffer and ArrayData directly without going through a builder. That also allows to skip building a null bitmap for non-null literals. I'm also thinking whether it might be possible to cache the created array. For queries without a WHERE clause, I'd expect all batches except the last to have the same length. I'm not sure though where to store the cached value. Another possible optimization could be to cast literals already on the logical plan side. In the tpch query the literal `1` is of type `u64` in the logical plan and then needs to be processed by a cast kernel to convert to `f64` for usage in an arithmetic expression. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query
[ https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210700#comment-17210700 ] Jörn Horstmann commented on ARROW-10240: Hi [~andygrove], I was already working on this and should have assigned the ticket to me directly. > [Rust] [Datafusion] Optionally load tpch data into memory before running > benchmark query > > > Key: ARROW-10240 > URL: https://issues.apache.org/jira/browse/ARROW-10240 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The tpch benchmark runtime seems to be dominated by csv parsing code and it > is really difficult to see any performance hotspots related to actual query > execution in a flamegraph. > With the date in memory and more iterations it should be easier to profile > and find bottlenecks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query
[ https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10240: --- Labels: pull-request-available (was: ) > [Rust] [Datafusion] Optionally load tpch data into memory before running > benchmark query > > > Key: ARROW-10240 > URL: https://issues.apache.org/jira/browse/ARROW-10240 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The tpch benchmark runtime seems to be dominated by csv parsing code and it > is really difficult to see any performance hotspots related to actual query > execution in a flamegraph. > With the date in memory and more iterations it should be easier to profile > and find bottlenecks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9879) [Python] ChunkedArray.__getitem__ doesn't work with numpy scalars
[ https://issues.apache.org/jira/browse/ARROW-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-9879. Resolution: Fixed Issue resolved by pull request 8072 [https://github.com/apache/arrow/pull/8072] > [Python] ChunkedArray.__getitem__ doesn't work with numpy scalars > - > > Key: ARROW-9879 > URL: https://issues.apache.org/jira/browse/ARROW-9879 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.0, 1.0.1 >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > > {{import pyarrow as pa > import numpy as np > pa.chunked_array(pa.array([1,2]))[np.int32(0)]}} > fails with error {{TypeError: key must either be a slice or integer}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10240) [Rust] [Datafusion] Optionally load tpch data into memory before running benchmark query
[ https://issues.apache.org/jira/browse/ARROW-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörn Horstmann reassigned ARROW-10240: -- Assignee: Jörn Horstmann > [Rust] [Datafusion] Optionally load tpch data into memory before running > benchmark query > > > Key: ARROW-10240 > URL: https://issues.apache.org/jira/browse/ARROW-10240 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jörn Horstmann >Assignee: Jörn Horstmann >Priority: Minor > > The tpch benchmark runtime seems to be dominated by csv parsing code and it > is really difficult to see any performance hotspots related to actual query > execution in a flamegraph. > With the date in memory and more iterations it should be easier to profile > and find bottlenecks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1614) [C++] Add a Tensor logical value type with constant dimensions, implemented using ExtensionType
[ https://issues.apache.org/jira/browse/ARROW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210634#comment-17210634 ] Bryan Cutler commented on ARROW-1614: - [~rokm] for our purposes, it wasn't necessary to use pyarrow.Tensor, but there are some limitations with it currently so maybe there are some trade-offs. Please go ahead and start if you like and I'd be happy to help review and discuss further. > [C++] Add a Tensor logical value type with constant dimensions, implemented > using ExtensionType > --- > > Key: ARROW-1614 > URL: https://issues.apache.org/jira/browse/ARROW-1614 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Format >Reporter: Wes McKinney >Priority: Major > > In an Arrow table, we would like to add support for a column that has values > cells each containing a tensor value, with all tensors having the same > dimensions. These would be stored as a binary value, plus some metadata to > store type and shape/strides. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10215) [Rust] [DataFusion] Rename "Source" typedef
[ https://issues.apache.org/jira/browse/ARROW-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão reassigned ARROW-10215: Assignee: Jorge Leitão > [Rust] [DataFusion] Rename "Source" typedef > --- > > Key: ARROW-10215 > URL: https://issues.apache.org/jira/browse/ARROW-10215 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Jorge Leitão >Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The name "Source" for this type doesn't make sense to me. I would like to > discuss alternate names for it. > {code:java} > type Source = Box; {code} > My first thoughts are: > * RecordBatchIterator > * RecordBatchStream > * SendableRecordBatchReader -- This message was sent by Atlassian Jira (v8.3.4#803005)