[jira] [Commented] (ARROW-7538) Clarify actual and desired size in AllocationManager
[ https://issues.apache.org/jira/browse/ARROW-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024952#comment-17024952 ] Igor Yastrebov commented on ARROW-7538: --- [~lidavidm] [~emkornfi...@gmail.com] is this resolved? > Clarify actual and desired size in AllocationManager > > > Key: ARROW-7538 > URL: https://issues.apache.org/jira/browse/ARROW-7538 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: David Li >Assignee: Rong Rong >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > As a follow up to the review of ARROW-7329, we should clarify the different > sizes (desired vs actual size) in AllocationManager: > https://github.com/apache/arrow/pull/5973#discussion_r354729754 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
[ https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979983#comment-16979983 ] Igor Yastrebov commented on ARROW-7162: --- [~apitrou] for whatever reason this Jira issue has unresolved resolution > [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake > --- > > Key: ARROW-7162 > URL: https://issues.apache.org/jira/browse/ARROW-7162 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Developer Tools >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > For clang we currently disable a lot of warnings explicitly. This dates back > to when we enabled {{-Weverything}}. We should probably remove most or all of > these flags now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7163) [Doc] Fix double-and typos
[ https://issues.apache.org/jira/browse/ARROW-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979982#comment-16979982 ] Igor Yastrebov commented on ARROW-7163: --- [~npr] for whatever reason this Jira issue has unresolved resolution > [Doc] Fix double-and typos > -- > > Key: ARROW-7163 > URL: https://issues.apache.org/jira/browse/ARROW-7163 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Neal Richardson >Assignee: Brian Wignall >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6578) [Python] Casting int64 to string columns
[ https://issues.apache.org/jira/browse/ARROW-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932161#comment-16932161 ] Igor Yastrebov commented on ARROW-6578: --- [~pitrou] When I benchmarked it on 16 csv files ~200 MB in size, using read+cast(safe=False) was >10% faster than read with ConvertOptions. This doesn't account for string->int64->string conversions since they aren't implemented :) > [Python] Casting int64 to string columns > > > Key: ARROW-6578 > URL: https://issues.apache.org/jira/browse/ARROW-6578 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.14.1 >Reporter: Igor Yastrebov >Priority: Major > > I wanted to cast a list of a tables to the same schema so I could use > concat_tables later. However, I encountered ArrowNotImplementedError: > {code:java} > --- > ArrowNotImplementedError Traceback (most recent call last) > in > > 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb] > in (.0) > > 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb] > ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi > in itercolumns() > ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi > in pyarrow.lib.Column.cast() > ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\error.pxi > in pyarrow.lib.check_status() > ArrowNotImplementedError: No cast implemented from int64 to string > {code} > Some context: I want to read and concatenate a bunch of csv files that come > from partitioning of the same table. Using cast after reading csv is usually > significantly faster than specifying column_types in ConvertOptions. There > are string columns that are mostly populated with integer-like values so a > particular file can have an integer-only column. This situation is rather > common so having an option to cast int64 column to string column would be > helpful. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages
[ https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931376#comment-16931376 ] Igor Yastrebov commented on ARROW-6577: --- [~suvayu] I had a problem with a conflict between boost and blas versions which is probably not related but the only thing that helped me was to update conda to 4.7 version - there was a significant rework of package resolution. > Dependency conflict in conda packages > - > > Key: ARROW-6577 > URL: https://issues.apache.org/jira/browse/ARROW-6577 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Affects Versions: 0.14.1 > Environment: kernel: 5.2.11-200.fc30.x86_64 > conda 4.6.13 > Python 3.7.3 >Reporter: Suvayu Ali >Assignee: Uwe L. Korn >Priority: Major > Attachments: pa-conda.txt > > > When I install pyarrow on a fresh environment, the latest version (0.14.1) is > picked up. But installing certain packages downgrades pyarrow to 0.13.0 or > 0.12.1. I think a common dependency is causing the downgrade, my guess is > boost or protobuf. This is based on several instances of this issue I > encountered over the last few weeks. It took me a while to find a somewhat > reproducible recipe. > {code:java} > $ conda create -n test pyarrow pandas numpy > ... > Proceed ([y]/n)? y > ... > $ conda install -n test ipython > ... > Proceed ([y]/n)? n > CondaSystemExit: Exiting. > {code} > I have attached a mildly edited (to remove progress bars, and control > characters) transcript of this session. Here {{ipython}} triggers the > problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other > common packages who also conflict in this way. Please let me know if I can > provide more info. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-6577) Dependency conflict in conda packages
[ https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931349#comment-16931349 ] Igor Yastrebov edited comment on ARROW-6577 at 9/17/19 11:48 AM: - Does it downgrade boost to 1.68.0 was (Author: igor yastrebov): Does it downgrade boost to 1.68.0?{{}} > Dependency conflict in conda packages > - > > Key: ARROW-6577 > URL: https://issues.apache.org/jira/browse/ARROW-6577 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Affects Versions: 0.14.1 > Environment: kernel: 5.2.11-200.fc30.x86_64 > conda 4.6.13 > Python 3.7.3 >Reporter: Suvayu Ali >Priority: Major > Attachments: pa-conda.txt > > > When I install pyarrow on a fresh environment, the latest version (0.14.1) is > picked up. But installing certain packages downgrades pyarrow to 0.13.0 or > 0.12.1. I think a common dependency is causing the downgrade, my guess is > boost or protobuf. This is based on several instances of this issue I > encountered over the last few weeks. It took me a while to find a somewhat > reproducible recipe. > {code:java} > $ conda create -n test pyarrow pandas numpy > ... > Proceed ([y]/n)? y > ... > $ conda install -n test ipython > ... > Proceed ([y]/n)? n > CondaSystemExit: Exiting. > {code} > I have attached a mildly edited (to remove progress bars, and control > characters) transcript of this session. Here {{ipython}} triggers the > problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other > common packages who also conflict in this way. Please let me know if I can > provide more info. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Comment Edited] (ARROW-6577) Dependency conflict in conda packages
[ https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931349#comment-16931349 ] Igor Yastrebov edited comment on ARROW-6577 at 9/17/19 11:49 AM: - Does it downgrade boost to 1.68.0? was (Author: igor yastrebov): Does it downgrade boost to 1.68.0 > Dependency conflict in conda packages > - > > Key: ARROW-6577 > URL: https://issues.apache.org/jira/browse/ARROW-6577 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Affects Versions: 0.14.1 > Environment: kernel: 5.2.11-200.fc30.x86_64 > conda 4.6.13 > Python 3.7.3 >Reporter: Suvayu Ali >Priority: Major > Attachments: pa-conda.txt > > > When I install pyarrow on a fresh environment, the latest version (0.14.1) is > picked up. But installing certain packages downgrades pyarrow to 0.13.0 or > 0.12.1. I think a common dependency is causing the downgrade, my guess is > boost or protobuf. This is based on several instances of this issue I > encountered over the last few weeks. It took me a while to find a somewhat > reproducible recipe. > {code:java} > $ conda create -n test pyarrow pandas numpy > ... > Proceed ([y]/n)? y > ... > $ conda install -n test ipython > ... > Proceed ([y]/n)? n > CondaSystemExit: Exiting. > {code} > I have attached a mildly edited (to remove progress bars, and control > characters) transcript of this session. Here {{ipython}} triggers the > problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other > common packages who also conflict in this way. Please let me know if I can > provide more info. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages
[ https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931349#comment-16931349 ] Igor Yastrebov commented on ARROW-6577: --- Does it downgrade boost to 1.68.0?{{}} > Dependency conflict in conda packages > - > > Key: ARROW-6577 > URL: https://issues.apache.org/jira/browse/ARROW-6577 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Affects Versions: 0.14.1 > Environment: kernel: 5.2.11-200.fc30.x86_64 > conda 4.6.13 > Python 3.7.3 >Reporter: Suvayu Ali >Priority: Major > Attachments: pa-conda.txt > > > When I install pyarrow on a fresh environment, the latest version (0.14.1) is > picked up. But installing certain packages downgrades pyarrow to 0.13.0 or > 0.12.1. I think a common dependency is causing the downgrade, my guess is > boost or protobuf. This is based on several instances of this issue I > encountered over the last few weeks. It took me a while to find a somewhat > reproducible recipe. > {code:java} > $ conda create -n test pyarrow pandas numpy > ... > Proceed ([y]/n)? y > ... > $ conda install -n test ipython > ... > Proceed ([y]/n)? n > CondaSystemExit: Exiting. > {code} > I have attached a mildly edited (to remove progress bars, and control > characters) transcript of this session. Here {{ipython}} triggers the > problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other > common packages who also conflict in this way. Please let me know if I can > provide more info. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6578) [Python] Casting int64 to string columns
Igor Yastrebov created ARROW-6578: - Summary: [Python] Casting int64 to string columns Key: ARROW-6578 URL: https://issues.apache.org/jira/browse/ARROW-6578 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.14.1 Reporter: Igor Yastrebov I wanted to cast a list of a tables to the same schema so I could use concat_tables later. However, I encountered ArrowNotImplementedError: {code:java} --- ArrowNotImplementedError Traceback (most recent call last) in > 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb] in (.0) > 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb] ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi in itercolumns() ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi in pyarrow.lib.Column.cast() ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status() ArrowNotImplementedError: No cast implemented from int64 to string {code} Some context: I want to read and concatenate a bunch of csv files that come from partitioning of the same table. Using cast after reading csv is usually significantly faster than specifying column_types in ConvertOptions. There are string columns that are mostly populated with integer-like values so a particular file can have an integer-only column. This situation is rather common so having an option to cast int64 column to string column would be helpful. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1
[ https://issues.apache.org/jira/browse/ARROW-6395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919377#comment-16919377 ] Igor Yastrebov commented on ARROW-6395: --- [~jorisvandenbossche] is this solved by [ARROW-6325|https://issues.apache.org/jira/browse/ARROW-6325]? > [pyarrow] Bug when using bool arrays with stride greater than 1 > --- > > Key: ARROW-6395 > URL: https://issues.apache.org/jira/browse/ARROW-6395 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Philip Felton >Priority: Major > > Here's code to reproduce it: > {code:python} > >>> import numpy as np > >>> import pyarrow as pa > >>> pa.__version__ > '0.14.0' > >>> xs = np.array([True, False, False, True, True, False, True, True, True, > >>> False, False, False, False, False, True, False, True, True, True, True, > >>> True]) > >>> xs_sliced = xs[0::2] > >>> xs_sliced > array([ True, False, True, True, True, False, False, True, True, > True, True]) > >>> pa_xs = pa.array(xs_sliced, pa.bool_()) > >>> pa_xs > > [ > true, > false, > false, > false, > false, > false, > false, > false, > false, > false, > false > ]{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14
[ https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919370#comment-16919370 ] Igor Yastrebov commented on ARROW-6380: --- Is it a duplicate of [ARROW-6059|https://issues.apache.org/jira/browse/ARROW-6059]? > Method pyarrow.parquet.read_table has memory spikes from version 0.14 > - > > Key: ARROW-6380 > URL: https://issues.apache.org/jira/browse/ARROW-6380 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.14.0, 0.14.1 > Environment: ubuntu 18, 16GB ram, 4 cpus >Reporter: Renan Alves Fonseca >Priority: Major > Fix For: 0.13.0 > > > Method pyarrow.parquet.read_table is very slow and cause RAM spikes from > version 0.14.0 > Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 > and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x > This impact in performance is easily measured. However, there is another > problem that I could only detect on htop screen. While opening a 40MB > parquet, the process occupies almost 16GB for some miliseconds. The pyarrow > table will result in around 300MB in the python process (registered using > memory-profiler). This does not happens in versions 0.13 and previous ones. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6353) [Python] Allow user to select compression level in pyarrow.parquet.write_table
[ https://issues.apache.org/jira/browse/ARROW-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916694#comment-16916694 ] Igor Yastrebov commented on ARROW-6353: --- [~martinradev] You are free to work on it if you want. I'd love to see this feature in 0.15.0 but since I won't do it myself I'm in no position to ask for it. As far as I'm concerned, there are only two levels of priority - blocker and non-blocker - but jira admins can correct it if it is a problem. > [Python] Allow user to select compression level in pyarrow.parquet.write_table > -- > > Key: ARROW-6353 > URL: https://issues.apache.org/jira/browse/ARROW-6353 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Igor Yastrebov >Priority: Major > > This feature was introduced for C++ in > [ARROW-6216|https://issues.apache.org/jira/browse/ARROW-6216]. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6353) [Python] Allow user to select compression level in pyarrow.parquet.write_table
Igor Yastrebov created ARROW-6353: - Summary: [Python] Allow user to select compression level in pyarrow.parquet.write_table Key: ARROW-6353 URL: https://issues.apache.org/jira/browse/ARROW-6353 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Igor Yastrebov This feature was introduced for C++ in [ARROW-6216|https://issues.apache.org/jira/browse/ARROW-6216]. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6153) [R] Address parquet deprecation warning
[ https://issues.apache.org/jira/browse/ARROW-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908864#comment-16908864 ] Igor Yastrebov commented on ARROW-6153: --- [~npr] is this fixed? > [R] Address parquet deprecation warning > --- > > Key: ARROW-6153 > URL: https://issues.apache.org/jira/browse/ARROW-6153 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Major > > [~wesmckinn] has been refactoring the Parquet C++ library and there's now > this deprecation warning appearing when I build the R package locally: > {code:java} > clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" > -DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW > -I"/Users/enpiar/R/Rcpp/include" -isysroot > /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include > -fPIC -Wall -g -O2 -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: > 'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder > [-Wdeprecated-declarations] parquet::arrow::OpenFile(file, > arrow::default_memory_pool(), *props, )); ^ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907129#comment-16907129 ] Igor Yastrebov commented on ARROW-3762: --- It seems the issue comes from pandas boolean indexing. CSV file with data (~4 lines, couldn't reduce without missing the error): [Google Drive|https://drive.google.com/file/d/1QMwxl4tgo8W-wOL1ih4nXV2vzRq2NaVW/view?usp=sharing] Reproduction code: {code:java} >>> import pandas as pd >>> tst = pd.read_csv('test.csv', dtype = {'col1': 'float32', 'col2': 'str'}) >>> tst = tst[~tst.col1.isnull()] >>> tst.to_parquet('test.parquet', engine = 'pyarrow', index = False) >>> tst = pd.read_parquet('test.parquet') {code} Conda environment reproduction: {code:java} conda install python=3.7 pandas=0.25.0 pyarrow=0.14.1 -c conda-forge {code} > [C++] Parquet arrow::Table reads error when overflowing capacity of > BinaryArray > --- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Chris Ellison >Assignee: Benjamin Kietzman >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0, 0.15.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6227) [Python] pyarrow.array() shouldn't coerce np.nan to string
Igor Yastrebov created ARROW-6227: - Summary: [Python] pyarrow.array() shouldn't coerce np.nan to string Key: ARROW-6227 URL: https://issues.apache.org/jira/browse/ARROW-6227 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Reporter: Igor Yastrebov pa.array() by default regards np.nan as float value and fails on pa.array([np.nan, 'string']). It should also fail on pa.array(['string', np.nan]) instead of coercing it to null value. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906096#comment-16906096 ] Igor Yastrebov commented on ARROW-3762: --- Interestingly enough, converting string columns to category in pd.DataFrame results in the same error while trying to use to_parquet instead of while reading it afterwards. > [C++] Parquet arrow::Table reads error when overflowing capacity of > BinaryArray > --- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Chris Ellison >Assignee: Benjamin Kietzman >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5566) [Python] Overhaul type unification from Python sequence in arrow::py::InferArrowType
[ https://issues.apache.org/jira/browse/ARROW-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906052#comment-16906052 ] Igor Yastrebov commented on ARROW-5566: --- [~jorisvandenbossche] should it fail on pa.array(['string', np.nan]) then? It seems wrong that conversion is dependent on order of elements. It doesn't work this way for pa.array(['string', 1.1]) and pa.array([1.1, 'string']) > [Python] Overhaul type unification from Python sequence in > arrow::py::InferArrowType > > > Key: ARROW-5566 > URL: https://issues.apache.org/jira/browse/ARROW-5566 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > > I'm working on ARROW-4324 and there's some technical debt lying in > arrow/python/inference.cc because the case where NumPy scalars are mixed with > non-NumPy Python scalar values, all hell breaks loose. In particular, the > innocuous {{numpy.nan}} is a Python float, not a NumPy float64, so the > sequence {{[np.float16(1.5), np.nan]}} can be converted incorrectly. > Part of what's messy is that NumPy dtype unification is split from general > type unification. This should all be combined together with the NumPy types > mapping onto an intermediate value (for unification purposes) that then maps > ultimately onto an Arrow type -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5566) [Python] Overhaul type unification from Python sequence in arrow::py::InferArrowType
[ https://issues.apache.org/jira/browse/ARROW-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905960#comment-16905960 ] Igor Yastrebov commented on ARROW-5566: --- [~wesmckinn] I have found another issue of this type: when you pass a list or a np.array of strings which starts with np.nan to pa.array(), it fails to convert because it expects elements to be floats. Such np.array is easy to obtain in practice if you use values method on a pd.Series of strings with nulls. > [Python] Overhaul type unification from Python sequence in > arrow::py::InferArrowType > > > Key: ARROW-5566 > URL: https://issues.apache.org/jira/browse/ARROW-5566 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > > I'm working on ARROW-4324 and there's some technical debt lying in > arrow/python/inference.cc because the case where NumPy scalars are mixed with > non-NumPy Python scalar values, all hell breaks loose. In particular, the > innocuous {{numpy.nan}} is a Python float, not a NumPy float64, so the > sequence {{[np.float16(1.5), np.nan]}} can be converted incorrectly. > Part of what's messy is that NumPy dtype unification is split from general > type unification. This should all be combined together with the NumPy types > mapping onto an intermediate value (for unification purposes) that then maps > ultimately onto an Arrow type -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905160#comment-16905160 ] Igor Yastrebov commented on ARROW-3762: --- [~wesmckinn] [~bkietz] I am seeing this issue again using arrow 0.14.1 > [C++] Parquet arrow::Table reads error when overflowing capacity of > BinaryArray > --- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Chris Ellison >Assignee: Benjamin Kietzman >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6173) [Python] error loading csv submodule
[ https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov resolved ARROW-6173. --- Resolution: Not A Problem > [Python] error loading csv submodule > > > Key: ARROW-6173 > URL: https://issues.apache.org/jira/browse/ARROW-6173 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1 > Environment: Windows 7, conda 4.7.11 >Reporter: Igor Yastrebov >Priority: Major > Labels: csv > > When I create a new environment in conda: > {code:java} > conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 > {code} > and try to read a csv file: > {code:java} > import pyarrow as pa > pa.csv.read_csv('test.csv'){code} > it fails with an error: > {code:java} > Traceback (most recent call last): > File "", line 1, in > AttributeError: module 'pyarrow' has no attribute 'csv' > {code} > However, loading it directly works: > {code:java} > import pyarrow.csv as pc > table = pc.read_csv('test.csv') > {code} > and using pa.csv.read_csv() after loading it directly also works. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6173) [Python] error loading csv submodule
[ https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903072#comment-16903072 ] Igor Yastrebov commented on ARROW-6173: --- Oh, I see. It is the default behaviour for all submodules - feather, json, plasma, orc (some of which aren't built for conda) - it makes sense only to load what you need. > [Python] error loading csv submodule > > > Key: ARROW-6173 > URL: https://issues.apache.org/jira/browse/ARROW-6173 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1 > Environment: Windows 7, conda 4.7.11 >Reporter: Igor Yastrebov >Priority: Major > Labels: csv > > When I create a new environment in conda: > {code:java} > conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 > {code} > and try to read a csv file: > {code:java} > import pyarrow as pa > pa.csv.read_csv('test.csv'){code} > it fails with an error: > {code:java} > Traceback (most recent call last): > File "", line 1, in > AttributeError: module 'pyarrow' has no attribute 'csv' > {code} > However, loading it directly works: > {code:java} > import pyarrow.csv as pc > table = pc.read_csv('test.csv') > {code} > and using pa.csv.read_csv() after loading it directly also works. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule
[ https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov updated ARROW-6173: -- Affects Version/s: 0.12.0 0.13.0 0.14.0 > [Python] error loading csv submodule > > > Key: ARROW-6173 > URL: https://issues.apache.org/jira/browse/ARROW-6173 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1 > Environment: Windows 7, conda 4.7.11 >Reporter: Igor Yastrebov >Priority: Major > Labels: csv > > When I create a new environment in conda: > {code:java} > conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 > {code} > and try to read a csv file: > {code:java} > import pyarrow as pa > pa.csv.read_csv('test.csv'){code} > it fails with an error: > {code:java} > Traceback (most recent call last): > File "", line 1, in > AttributeError: module 'pyarrow' has no attribute 'csv' > {code} > However, loading it directly works: > {code:java} > import pyarrow.csv as pc > table = pc.read_csv('test.csv') > {code} > and using pa.csv.read_csv() after loading it directly also works. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule
[ https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov updated ARROW-6173: -- Description: When I create a new environment in conda: {code:java} conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 {code} and try to read a csv file: {code:java} import pyarrow as pa pa.csv.read_csv('test.csv'){code} it fails with an error: {code:java} Traceback (most recent call last): File "", line 1, in AttributeError: module 'pyarrow' has no attribute 'csv' {code} However, loading it directly works: {code:java} import pyarrow.csv as pc table = pc.read_csv('test.csv') {code} and using pa.csv.read_csv() after loading it directly also works. was: When I create a new environment in conda: {code:java} conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 {code} and try to read a csv file: {code:java} import pyarrow as pa pa.read_csv('test.csv'){code} it fails with an error: {code:java} Traceback (most recent call last): File "", line 1, in AttributeError: module 'pyarrow' has no attribute 'csv' {code} However, loading it directly works: {code:java} import pyarrow.csv as pc table = pc.read_csv('test.csv') {code} and using pa.csv.read_csv() after loading it directly also works. > [Python] error loading csv submodule > > > Key: ARROW-6173 > URL: https://issues.apache.org/jira/browse/ARROW-6173 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: Windows 7, conda 4.7.11 >Reporter: Igor Yastrebov >Priority: Major > Labels: csv > > When I create a new environment in conda: > {code:java} > conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 > {code} > and try to read a csv file: > {code:java} > import pyarrow as pa > pa.csv.read_csv('test.csv'){code} > it fails with an error: > {code:java} > Traceback (most recent call last): > File "", line 1, in > AttributeError: module 'pyarrow' has no attribute 'csv' > {code} > However, loading it directly works: > {code:java} > import pyarrow.csv as pc > table = pc.read_csv('test.csv') > {code} > and using pa.csv.read_csv() after loading it directly also works. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule
[ https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov updated ARROW-6173: -- Description: When I create a new environment in conda: {code:java} conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 {code} and try to read a csv file: {code:java} import pyarrow as pa pa.read_csv('test.csv'){code} it fails with an error: {code:java} Traceback (most recent call last): File "", line 1, in AttributeError: module 'pyarrow' has no attribute 'csv' {code} However, loading it directly works: {code:java} import pyarrow.csv as pc table = pc.read_csv('test.csv') {code} and using pa.csv.read_csv() after loading it directly also works. was: When I create a new environment in conda: {code:java} conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 {code} and try to read a csv file: {code:java} import pyarrow as pa pa.read_csv('test.csv'){code} it fails with an error: {code:java} Traceback (most recent call last): File "", line 1, in AttributeError: module 'pyarrow' has no attribute 'csv' {code} However, loading it directly works: {code:java} import pyarrow.csv as pc table = pc.read_csv('test.csv') {code} and using pa.csv.read_csv() after loading it directly also works. > [Python] error loading csv submodule > > > Key: ARROW-6173 > URL: https://issues.apache.org/jira/browse/ARROW-6173 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: Windows 7, conda 4.7.11 >Reporter: Igor Yastrebov >Priority: Major > Labels: csv > > When I create a new environment in conda: > {code:java} > conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 > {code} > and try to read a csv file: > {code:java} > import pyarrow as pa > pa.read_csv('test.csv'){code} > it fails with an error: > {code:java} > Traceback (most recent call last): > File "", line 1, in > AttributeError: module 'pyarrow' has no attribute 'csv' > {code} > However, loading it directly works: > {code:java} > import pyarrow.csv as pc > table = pc.read_csv('test.csv') > {code} > and using pa.csv.read_csv() after loading it directly also works. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6173) [Python] error loading csv submodule
Igor Yastrebov created ARROW-6173: - Summary: [Python] error loading csv submodule Key: ARROW-6173 URL: https://issues.apache.org/jira/browse/ARROW-6173 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Environment: Windows 7, conda 4.7.11 Reporter: Igor Yastrebov When I create a new environment in conda: {code:java} conda create -n pyarrow-test python=3.7 pyarrow=0.14.1 {code} and try to read a csv file: {code:java} import pyarrow as pa pa.read_csv('test.csv'){code} it fails with an error: {code:java} Traceback (most recent call last): File "", line 1, in AttributeError: module 'pyarrow' has no attribute 'csv' {code} However, loading it directly works: {code:java} import pyarrow.csv as pc table = pc.read_csv('test.csv') {code} and using pa.csv.read_csv() after loading it directly also works. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large UTF32 numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887730#comment-16887730 ] Igor Yastrebov commented on ARROW-5966: --- Yes, using pa.array() on a list works and creating np.array(dtype='bytes_') also works. > [Python] Capacity error when converting large UTF32 numpy array to arrow array > -- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887197#comment-16887197 ] Igor Yastrebov commented on ARROW-5966: --- I tried your example and it worked but uuid array fails. I have pyarrow 0.14.0 (from conda-forge) > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov updated ARROW-5966: -- Description: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed: [https://github.com/apache/arrow/issues/1855]? was: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed: [link title|[https://github.com/apache/arrow/issues/1855]]? > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: > [https://github.com/apache/arrow/issues/1855]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array
[ https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Yastrebov updated ARROW-5966: -- External issue URL: (was: https://github.com/apache/arrow/issues/1855) Description: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed: [link title|[https://github.com/apache/arrow/issues/1855]]? was: Trying to create a large string array fails with ArrowCapacityError: Encoded string length exceeds maximum size (2GB) instead of creating a chunked array. A reproducible example: {code:java} import uuid import numpy as np import pyarrow as pa li = [] for i in range(1): li.append(uuid.uuid4().hex) arr = np.array(li) parr = pa.array(arr) {code} Is it a regression or was it never properly fixed? > [Python] Capacity error when converting large string numpy array to arrow > array > --- > > Key: ARROW-5966 > URL: https://issues.apache.org/jira/browse/ARROW-5966 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0 >Reporter: Igor Yastrebov >Priority: Major > > Trying to create a large string array fails with > ArrowCapacityError: Encoded string length exceeds maximum size (2GB) > instead of creating a chunked array. > > A reproducible example: > {code:java} > import uuid > import numpy as np > import pyarrow as pa > li = [] > for i in range(1): > li.append(uuid.uuid4().hex) > arr = np.array(li) > parr = pa.array(arr) > {code} > Is it a regression or was it never properly fixed: [link > title|[https://github.com/apache/arrow/issues/1855]]? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5413) [C++] CSV reader doesn't remove BOM
Igor Yastrebov created ARROW-5413: - Summary: [C++] CSV reader doesn't remove BOM Key: ARROW-5413 URL: https://issues.apache.org/jira/browse/ARROW-5413 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.13.0 Reporter: Igor Yastrebov If a CSV file starts with a byte-order mark, CSV reader doesn't strip it but instead adds it to the first column name. -- This message was sent by Atlassian JIRA (v7.6.3#76005)