[jira] [Commented] (ARROW-7538) Clarify actual and desired size in AllocationManager

2020-01-28 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024952#comment-17024952
 ] 

Igor Yastrebov commented on ARROW-7538:
---

[~lidavidm] [~emkornfi...@gmail.com] is this resolved?

> Clarify actual and desired size in AllocationManager
> 
>
> Key: ARROW-7538
> URL: https://issues.apache.org/jira/browse/ARROW-7538
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: David Li
>Assignee: Rong Rong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As a follow up to the review of ARROW-7329, we should clarify the different 
> sizes (desired vs actual size) in AllocationManager: 
> https://github.com/apache/arrow/pull/5973#discussion_r354729754



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake

2019-11-22 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979983#comment-16979983
 ] 

Igor Yastrebov commented on ARROW-7162:
---

[~apitrou] for whatever reason this Jira issue has unresolved resolution

> [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
> ---
>
> Key: ARROW-7162
> URL: https://issues.apache.org/jira/browse/ARROW-7162
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> For clang we currently disable a lot of warnings explicitly. This dates back 
> to when we enabled {{-Weverything}}. We should probably remove most or all of 
> these flags now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7163) [Doc] Fix double-and typos

2019-11-22 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979982#comment-16979982
 ] 

Igor Yastrebov commented on ARROW-7163:
---

[~npr] for whatever reason this Jira issue has unresolved resolution

> [Doc] Fix double-and typos
> --
>
> Key: ARROW-7163
> URL: https://issues.apache.org/jira/browse/ARROW-7163
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Neal Richardson
>Assignee: Brian Wignall
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6578) [Python] Casting int64 to string columns

2019-09-18 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932161#comment-16932161
 ] 

Igor Yastrebov commented on ARROW-6578:
---

[~pitrou] When I benchmarked it on 16 csv files ~200 MB in size, using 
read+cast(safe=False) was >10% faster than read with ConvertOptions.

This doesn't account for string->int64->string conversions since they aren't 
implemented :)

> [Python] Casting int64 to string columns
> 
>
> Key: ARROW-6578
> URL: https://issues.apache.org/jira/browse/ARROW-6578
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.14.1
>Reporter: Igor Yastrebov
>Priority: Major
>
> I wanted to cast a list of a tables to the same schema so I could use 
> concat_tables later. However, I encountered ArrowNotImplementedError:
> {code:java}
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb]
>  in (.0)
> > 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb]
> ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi
>  in itercolumns()
> ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi
>  in pyarrow.lib.Column.cast()
> ~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\error.pxi
>  in pyarrow.lib.check_status()
> ArrowNotImplementedError: No cast implemented from int64 to string
> {code}
> Some context: I want to read and concatenate a bunch of csv files that come 
> from partitioning of the same table. Using cast after reading csv is usually 
> significantly faster than specifying column_types in ConvertOptions. There 
> are string columns that are mostly populated with integer-like values so a 
> particular file can have an integer-only column. This situation is rather 
> common so having an option to cast int64 column to string column would be 
> helpful.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931376#comment-16931376
 ] 

Igor Yastrebov commented on ARROW-6577:
---

[~suvayu] I had a problem with a conflict between boost and blas versions which 
is probably not related but the only thing that helped me was to update conda 
to 4.7 version - there was a significant rework of package resolution. 

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Assignee: Uwe L. Korn
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931349#comment-16931349
 ] 

Igor Yastrebov edited comment on ARROW-6577 at 9/17/19 11:48 AM:
-

Does it downgrade boost to 1.68.0


was (Author: igor yastrebov):
Does it downgrade boost to 1.68.0?{{}}

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931349#comment-16931349
 ] 

Igor Yastrebov edited comment on ARROW-6577 at 9/17/19 11:49 AM:
-

Does it downgrade boost to 1.68.0?


was (Author: igor yastrebov):
Does it downgrade boost to 1.68.0

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931349#comment-16931349
 ] 

Igor Yastrebov commented on ARROW-6577:
---

Does it downgrade boost to 1.68.0?{{}}

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6578) [Python] Casting int64 to string columns

2019-09-17 Thread Igor Yastrebov (Jira)
Igor Yastrebov created ARROW-6578:
-

 Summary: [Python] Casting int64 to string columns
 Key: ARROW-6578
 URL: https://issues.apache.org/jira/browse/ARROW-6578
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.14.1
Reporter: Igor Yastrebov


I wanted to cast a list of a tables to the same schema so I could use 
concat_tables later. However, I encountered ArrowNotImplementedError:
{code:java}
---
ArrowNotImplementedError  Traceback (most recent call last)
 in 
> 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb]

 in (.0)
> 1 list_tb = [i.cast(mts_schema, safe = True) for i in list_tb]

~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi
 in itercolumns()

~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\table.pxi
 in pyarrow.lib.Column.cast()

~\AppData\Local\Continuum\miniconda3\envs\cyclone\lib\site-packages\pyarrow\error.pxi
 in pyarrow.lib.check_status()

ArrowNotImplementedError: No cast implemented from int64 to string
{code}
Some context: I want to read and concatenate a bunch of csv files that come 
from partitioning of the same table. Using cast after reading csv is usually 
significantly faster than specifying column_types in ConvertOptions. There are 
string columns that are mostly populated with integer-like values so a 
particular file can have an integer-only column. This situation is rather 
common so having an option to cast int64 column to string column would be 
helpful.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6395) [pyarrow] Bug when using bool arrays with stride greater than 1

2019-08-30 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919377#comment-16919377
 ] 

Igor Yastrebov commented on ARROW-6395:
---

[~jorisvandenbossche] is this solved by 
[ARROW-6325|https://issues.apache.org/jira/browse/ARROW-6325]?

> [pyarrow] Bug when using bool arrays with stride greater than 1
> ---
>
> Key: ARROW-6395
> URL: https://issues.apache.org/jira/browse/ARROW-6395
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.14.0
>Reporter: Philip Felton
>Priority: Major
>
> Here's code to reproduce it:
> {code:python}
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.14.0'
> >>> xs = np.array([True, False, False, True, True, False, True, True, True, 
> >>> False, False, False, False, False, True, False, True, True, True, True, 
> >>> True])
> >>> xs_sliced = xs[0::2]
> >>> xs_sliced
> array([ True, False, True, True, True, False, False, True, True,
>  True, True])
> >>> pa_xs = pa.array(xs_sliced, pa.bool_())
> >>> pa_xs
> 
> [
>  true,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false,
>  false
> ]{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14

2019-08-30 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919370#comment-16919370
 ] 

Igor Yastrebov commented on ARROW-6380:
---

Is it a duplicate of 
[ARROW-6059|https://issues.apache.org/jira/browse/ARROW-6059]?

> Method pyarrow.parquet.read_table has memory spikes from version 0.14
> -
>
> Key: ARROW-6380
> URL: https://issues.apache.org/jira/browse/ARROW-6380
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0, 0.14.1
> Environment: ubuntu 18, 16GB ram, 4 cpus
>Reporter: Renan Alves Fonseca
>Priority: Major
> Fix For: 0.13.0
>
>
> Method pyarrow.parquet.read_table is very slow and cause RAM spikes from 
> version 0.14.0
> Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 
> and 0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x
> This impact in performance is easily measured. However, there is another 
> problem that I could only detect on htop screen. While opening a 40MB 
> parquet, the process occupies almost 16GB for some miliseconds. The pyarrow 
> table will result in around 300MB in the python process (registered using 
> memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6353) [Python] Allow user to select compression level in pyarrow.parquet.write_table

2019-08-27 Thread Igor Yastrebov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916694#comment-16916694
 ] 

Igor Yastrebov commented on ARROW-6353:
---

[~martinradev]

You are free to work on it if you want. I'd love to see this feature in 0.15.0 
but since I won't do it myself I'm in no position to ask for it.

 

As far as I'm concerned, there are only two levels of priority - blocker and 
non-blocker - but jira admins can correct it if it is a problem.

> [Python] Allow user to select compression level in pyarrow.parquet.write_table
> --
>
> Key: ARROW-6353
> URL: https://issues.apache.org/jira/browse/ARROW-6353
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Igor Yastrebov
>Priority: Major
>
> This feature was introduced for C++ in 
> [ARROW-6216|https://issues.apache.org/jira/browse/ARROW-6216].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6353) [Python] Allow user to select compression level in pyarrow.parquet.write_table

2019-08-26 Thread Igor Yastrebov (Jira)
Igor Yastrebov created ARROW-6353:
-

 Summary: [Python] Allow user to select compression level in 
pyarrow.parquet.write_table
 Key: ARROW-6353
 URL: https://issues.apache.org/jira/browse/ARROW-6353
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Igor Yastrebov


This feature was introduced for C++ in 
[ARROW-6216|https://issues.apache.org/jira/browse/ARROW-6216].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6153) [R] Address parquet deprecation warning

2019-08-16 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908864#comment-16908864
 ] 

Igor Yastrebov commented on ARROW-6153:
---

[~npr] is this fixed?

> [R] Address parquet deprecation warning
> ---
>
> Key: ARROW-6153
> URL: https://issues.apache.org/jira/browse/ARROW-6153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
>
> [~wesmckinn] has been refactoring the Parquet C++ library and there's now 
> this deprecation warning appearing when I build the R package locally: 
> {code:java}
> clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" 
> -DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW 
> -I"/Users/enpiar/R/Rcpp/include" -isysroot 
> /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include  
> -fPIC  -Wall -g -O2  -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: 
> 'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder      
>  [-Wdeprecated-declarations]       parquet::arrow::OpenFile(file, 
> arrow::default_memory_pool(), *props, ));                       ^
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

2019-08-14 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907129#comment-16907129
 ] 

Igor Yastrebov commented on ARROW-3762:
---

It seems the issue comes from pandas boolean indexing. CSV file with data 
(~4 lines, couldn't reduce without missing the error): [Google 
Drive|https://drive.google.com/file/d/1QMwxl4tgo8W-wOL1ih4nXV2vzRq2NaVW/view?usp=sharing]

Reproduction code:
{code:java}
>>> import pandas as pd
>>> tst = pd.read_csv('test.csv', dtype = {'col1': 'float32', 'col2': 'str'})
>>> tst = tst[~tst.col1.isnull()]
>>> tst.to_parquet('test.parquet', engine = 'pyarrow', index = False)
>>> tst = pd.read_parquet('test.parquet')
{code}
Conda environment reproduction:
{code:java}
conda install python=3.7 pandas=0.25.0 pyarrow=0.14.1 -c conda-forge
{code}

> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> ---
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0, 0.15.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6227) [Python] pyarrow.array() shouldn't coerce np.nan to string

2019-08-13 Thread Igor Yastrebov (JIRA)
Igor Yastrebov created ARROW-6227:
-

 Summary: [Python] pyarrow.array() shouldn't coerce np.nan to string
 Key: ARROW-6227
 URL: https://issues.apache.org/jira/browse/ARROW-6227
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: Igor Yastrebov


pa.array() by default regards np.nan as float value and fails on 
pa.array([np.nan, 'string']). It should also fail on pa.array(['string', 
np.nan]) instead of coercing it to null value.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

2019-08-13 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906096#comment-16906096
 ] 

Igor Yastrebov commented on ARROW-3762:
---

Interestingly enough, converting string columns to category in pd.DataFrame 
results in the same error while trying to use to_parquet instead of while 
reading it afterwards.

> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> ---
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5566) [Python] Overhaul type unification from Python sequence in arrow::py::InferArrowType

2019-08-13 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906052#comment-16906052
 ] 

Igor Yastrebov commented on ARROW-5566:
---

[~jorisvandenbossche] should it fail on pa.array(['string', np.nan]) then? It 
seems wrong that conversion is dependent on order of elements. It doesn't work 
this way for pa.array(['string', 1.1]) and pa.array([1.1, 'string'])

> [Python] Overhaul type unification from Python sequence in 
> arrow::py::InferArrowType
> 
>
> Key: ARROW-5566
> URL: https://issues.apache.org/jira/browse/ARROW-5566
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> I'm working on ARROW-4324 and there's some technical debt lying in 
> arrow/python/inference.cc because the case where NumPy scalars are mixed with 
> non-NumPy Python scalar values, all hell breaks loose. In particular, the 
> innocuous {{numpy.nan}} is a Python float, not a NumPy float64, so the 
> sequence {{[np.float16(1.5), np.nan]}} can be converted incorrectly. 
> Part of what's messy is that NumPy dtype unification is split from general 
> type unification. This should all be combined together with the NumPy types 
> mapping onto an intermediate value (for unification purposes) that then maps 
> ultimately onto an Arrow type



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5566) [Python] Overhaul type unification from Python sequence in arrow::py::InferArrowType

2019-08-13 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905960#comment-16905960
 ] 

Igor Yastrebov commented on ARROW-5566:
---

[~wesmckinn] I have found another issue of this type: when you pass a list or a 
np.array of strings which starts with np.nan to pa.array(), it fails to convert 
because it expects elements to be floats. Such np.array is easy to obtain in 
practice if you use values method on a pd.Series of strings with nulls.

> [Python] Overhaul type unification from Python sequence in 
> arrow::py::InferArrowType
> 
>
> Key: ARROW-5566
> URL: https://issues.apache.org/jira/browse/ARROW-5566
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> I'm working on ARROW-4324 and there's some technical debt lying in 
> arrow/python/inference.cc because the case where NumPy scalars are mixed with 
> non-NumPy Python scalar values, all hell breaks loose. In particular, the 
> innocuous {{numpy.nan}} is a Python float, not a NumPy float64, so the 
> sequence {{[np.float16(1.5), np.nan]}} can be converted incorrectly. 
> Part of what's messy is that NumPy dtype unification is split from general 
> type unification. This should all be combined together with the NumPy types 
> mapping onto an intermediate value (for unification purposes) that then maps 
> ultimately onto an Arrow type



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

2019-08-12 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905160#comment-16905160
 ] 

Igor Yastrebov commented on ARROW-3762:
---

[~wesmckinn] [~bkietz] I am seeing this issue again using arrow 0.14.1

> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> ---
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov resolved ARROW-6173.
---
Resolution: Not A Problem

> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903072#comment-16903072
 ] 

Igor Yastrebov commented on ARROW-6173:
---

Oh, I see. It is the default behaviour for all submodules - feather, json, 
plasma, orc (some of which aren't built for conda) - it makes sense only to 
load what you need.

> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-6173:
--
Affects Version/s: 0.12.0
   0.13.0
   0.14.0

> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-6173:
--
Description: 
When I create a new environment in conda:
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
and try to read a csv file:
{code:java}
import pyarrow as pa
pa.csv.read_csv('test.csv'){code}
it fails with an error:
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
and using pa.csv.read_csv() after loading it directly also works.

  was:
When I create a new environment in conda:
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
and try to read a csv file:
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
and using pa.csv.read_csv() after loading it directly also works.


> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.csv.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-6173:
--
Description: 
When I create a new environment in conda:
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
and try to read a csv file:
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
and using pa.csv.read_csv() after loading it directly also works.

  was:
When I create a new environment in conda:

 
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
 

and try to read a csv file:

 
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:

 

 
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:

 

 
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
 

and using pa.csv.read_csv() after loading it directly also works.


> [Python] error loading csv submodule
> 
>
> Key: ARROW-6173
> URL: https://issues.apache.org/jira/browse/ARROW-6173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: Windows 7, conda 4.7.11
>Reporter: Igor Yastrebov
>Priority: Major
>  Labels: csv
>
> When I create a new environment in conda:
> {code:java}
> conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
> {code}
> and try to read a csv file:
> {code:java}
> import pyarrow as pa
> pa.read_csv('test.csv'){code}
> it fails with an error:
> {code:java}
> Traceback (most recent call last):
> File "", line 1, in 
> AttributeError: module 'pyarrow' has no attribute 'csv'
> {code}
> However, loading it directly works:
> {code:java}
> import pyarrow.csv as pc
> table = pc.read_csv('test.csv')
> {code}
> and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)
Igor Yastrebov created ARROW-6173:
-

 Summary: [Python] error loading csv submodule
 Key: ARROW-6173
 URL: https://issues.apache.org/jira/browse/ARROW-6173
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: Windows 7, conda 4.7.11
Reporter: Igor Yastrebov


When I create a new environment in conda:

 
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
 

and try to read a csv file:

 
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:

 

 
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:

 

 
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
 

and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large UTF32 numpy array to arrow array

2019-07-18 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887730#comment-16887730
 ] 

Igor Yastrebov commented on ARROW-5966:
---

Yes, using pa.array() on a list works and creating np.array(dtype='bytes_') 
also works.

> [Python] Capacity error when converting large UTF32 numpy array to arrow array
> --
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Igor Yastrebov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887197#comment-16887197
 ] 

Igor Yastrebov commented on ARROW-5966:
---

I tried your example and it worked but uuid array fails. I have pyarrow 0.14.0 
(from conda-forge)

> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-5966:
--
Description: 
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed: 
[https://github.com/apache/arrow/issues/1855]?

 

 

  was:
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed: [link 
title|[https://github.com/apache/arrow/issues/1855]]?

 

 


> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: 
> [https://github.com/apache/arrow/issues/1855]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (ARROW-5966) [Python] Capacity error when converting large string numpy array to arrow array

2019-07-17 Thread Igor Yastrebov (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Yastrebov updated ARROW-5966:
--
External issue URL:   (was: https://github.com/apache/arrow/issues/1855)
   Description: 
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed: [link 
title|[https://github.com/apache/arrow/issues/1855]]?

 

 

  was:
Trying to create a large string array fails with 

ArrowCapacityError: Encoded string length exceeds maximum size (2GB)

instead of creating a chunked array.

 

A reproducible example:
{code:java}
import uuid
import numpy as np
import pyarrow as pa

li = []
for i in range(1):
li.append(uuid.uuid4().hex)
arr = np.array(li)
parr = pa.array(arr)
{code}
Is it a regression or was it never properly fixed?

 


> [Python] Capacity error when converting large string numpy array to arrow 
> array
> ---
>
> Key: ARROW-5966
> URL: https://issues.apache.org/jira/browse/ARROW-5966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Igor Yastrebov
>Priority: Major
>
> Trying to create a large string array fails with 
> ArrowCapacityError: Encoded string length exceeds maximum size (2GB)
> instead of creating a chunked array.
>  
> A reproducible example:
> {code:java}
> import uuid
> import numpy as np
> import pyarrow as pa
> li = []
> for i in range(1):
> li.append(uuid.uuid4().hex)
> arr = np.array(li)
> parr = pa.array(arr)
> {code}
> Is it a regression or was it never properly fixed: [link 
> title|[https://github.com/apache/arrow/issues/1855]]?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5413) [C++] CSV reader doesn't remove BOM

2019-05-24 Thread Igor Yastrebov (JIRA)
Igor Yastrebov created ARROW-5413:
-

 Summary: [C++] CSV reader doesn't remove BOM
 Key: ARROW-5413
 URL: https://issues.apache.org/jira/browse/ARROW-5413
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.13.0
Reporter: Igor Yastrebov


If a CSV file starts with a byte-order mark, CSV reader doesn't strip it but 
instead adds it to the first column name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)