[jira] [Issue Comment Deleted] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-20 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7305: -- Comment: was deleted (was: I dont think memory_profile is registering memory usage here

[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-20 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000928#comment-17000928 ] Bogdan Klichuk commented on ARROW-7305: --- Looking at a bigger example {code:java} df =

[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-20 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000908#comment-17000908 ] Bogdan Klichuk commented on ARROW-7305: --- I dont think memory_profile is registering memory usage

[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-20 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000893#comment-17000893 ] Bogdan Klichuk commented on ARROW-7305: --- I have tried this in ubuntu docker and results for 0.14.1

[jira] [Updated] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-17 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7305: -- Attachment: 50mb.csv.gz > [Python] High memory usage writing pyarrow.Table with large strings

[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-17 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998688#comment-16998688 ] Bogdan Klichuk commented on ARROW-7305: --- Sorry for delay, attaching a gzipped 50mb csv file with

[jira] [Issue Comment Deleted] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-10 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7305: -- Comment: was deleted (was: Seems like its transformation of pandas to pyarrow.Table. If you

[jira] [Commented] (ARROW-7305) [Python] High memory usage writing pyarrow.Table with large strings to parquet

2019-12-09 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992058#comment-16992058 ] Bogdan Klichuk commented on ARROW-7305: --- Seems like its transformation of pandas to pyarrow.Table.

[jira] [Created] (ARROW-7305) High memory usage writing pyarrow.Table to parquet

2019-12-03 Thread Bogdan Klichuk (Jira)
Bogdan Klichuk created ARROW-7305: - Summary: High memory usage writing pyarrow.Table to parquet Key: ARROW-7305 URL: https://issues.apache.org/jira/browse/ARROW-7305 Project: Apache Arrow

[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-18 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976379#comment-16976379 ] Bogdan Klichuk commented on ARROW-7150: --- Yeah its not that simple on my end. It's just one of json

[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-14 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974110#comment-16974110 ] Bogdan Klichuk commented on ARROW-7150: --- Been always thinking of avro as alternative since i pretty

[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-14 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974108#comment-16974108 ] Bogdan Klichuk commented on ARROW-7150: --- [~emkornfi...@gmail.com] yeah, i'm not too familiar with

[jira] [Updated] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-13 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7150: -- Environment: Mac OS X (was: Mac OS X. Pyarrow==0.14.1) > [Python] Explain parquet file size

[jira] [Updated] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-13 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7150: -- Affects Version/s: (was: 0.14.1) 0.15.1 > [Python] Explain parquet

[jira] [Updated] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-13 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7150: -- Description: Having columnar storage format in mind, with gzip compression enabled, I can't

[jira] [Updated] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-13 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7150: -- Attachment: 820.parquet > [Python] Explain parquet file size growth >

[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-13 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973501#comment-16973501 ] Bogdan Klichuk commented on ARROW-7150: --- Hello. Tried 0.15.1, got the same results. I managed to

[jira] [Comment Edited] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-13 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973501#comment-16973501 ] Bogdan Klichuk edited comment on ARROW-7150 at 11/13/19 4:31 PM: - Hello.

[jira] [Updated] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-12 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7150: -- Environment: Mac OS X. Pyarrow==0.14.1 (was: Mac OS X. Pyarrow==0.15.1) > [Python] Explain

[jira] [Created] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-12 Thread Bogdan Klichuk (Jira)
Bogdan Klichuk created ARROW-7150: - Summary: [Python] Explain parquet file size growth Key: ARROW-7150 URL: https://issues.apache.org/jira/browse/ARROW-7150 Project: Apache Arrow Issue Type:

[jira] [Updated] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-12 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-7150: -- Description: Having columnar storage format in mind, with gzip compression enabled, I can't

[jira] [Updated] (ARROW-6481) [Python] Bad performance of read_csv() with column_types

2019-09-09 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-6481: -- Summary: [Python] Bad performance of read_csv() with column_types (was: Bad performance of

[jira] [Updated] (ARROW-6481) Bad performance of read_csv() with column_types

2019-09-09 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-6481: -- Environment: ubuntu xenial, python2.7 (was: ubuntu xenial) > Bad performance of read_csv()

[jira] [Commented] (ARROW-6481) Bad performance of read_csv() with column_types

2019-09-07 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925007#comment-16925007 ] Bogdan Klichuk commented on ARROW-6481: --- I don't think hashtable lookup on each column has to make

[jira] [Updated] (ARROW-6481) Bad performance of read_csv() with column_types

2019-09-07 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-6481: -- Description: Case: Dataset wit 20k columns. Amount of rows can be 0.

[jira] [Updated] (ARROW-6481) Bad performance of read_csv() with column_types

2019-09-07 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-6481: -- Description: Case: Dataset wit 20k columns. Amount of rows can be 0.

[jira] [Created] (ARROW-6481) Bad performance of read_csv() with column_types

2019-09-07 Thread Bogdan Klichuk (Jira)
Bogdan Klichuk created ARROW-6481: - Summary: Bad performance of read_csv() with column_types Key: ARROW-6481 URL: https://issues.apache.org/jira/browse/ARROW-6481 Project: Apache Arrow Issue

[jira] [Commented] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'

2019-08-27 Thread Bogdan Klichuk (Jira)
[ https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916798#comment-16916798 ] Bogdan Klichuk commented on ARROW-6301: --- Bumping this thread with related segfault, that

[jira] [Commented] (ARROW-5791) [Python] pyarrow.csv.read_csv hangs + eats all RAM

2019-07-02 Thread Bogdan Klichuk (JIRA)
[ https://issues.apache.org/jira/browse/ARROW-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876769#comment-16876769 ] Bogdan Klichuk commented on ARROW-5791: --- Thanks a lot!  > [Python] pyarrow.csv.read_csv hangs +

[jira] [Updated] (ARROW-5811) [Python] pyarrow.csv.read_csv: Ability to not infer column types.

2019-06-30 Thread Bogdan Klichuk (JIRA)
[ https://issues.apache.org/jira/browse/ARROW-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-5811: -- Summary: [Python] pyarrow.csv.read_csv: Ability to not infer column types. (was:

[jira] [Created] (ARROW-5811) pyarrow.csv.read_csv: Ability to not infer column types.

2019-06-30 Thread Bogdan Klichuk (JIRA)
Bogdan Klichuk created ARROW-5811: - Summary: pyarrow.csv.read_csv: Ability to not infer column types. Key: ARROW-5811 URL: https://issues.apache.org/jira/browse/ARROW-5811 Project: Apache Arrow

[jira] [Commented] (ARROW-5791) pyarrow.csv.read_csv hangs + eats all RAM

2019-06-29 Thread Bogdan Klichuk (JIRA)
[ https://issues.apache.org/jira/browse/ARROW-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875626#comment-16875626 ] Bogdan Klichuk commented on ARROW-5791: --- Just to point, I can successfully convert a dataframe (if

[jira] [Commented] (ARROW-5791) pyarrow.csv.read_csv hangs + eats all RAM

2019-06-29 Thread Bogdan Klichuk (JIRA)
[ https://issues.apache.org/jira/browse/ARROW-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875624#comment-16875624 ] Bogdan Klichuk commented on ARROW-5791: --- [~bhulette] It's a shame I threw away the idea of "maybe

[jira] [Created] (ARROW-5791) pyarrow.csv.read_csv hangs + eats all RAM

2019-06-29 Thread Bogdan Klichuk (JIRA)
Bogdan Klichuk created ARROW-5791: - Summary: pyarrow.csv.read_csv hangs + eats all RAM Key: ARROW-5791 URL: https://issues.apache.org/jira/browse/ARROW-5791 Project: Apache Arrow Issue Type:

[jira] [Updated] (ARROW-5791) pyarrow.csv.read_csv hangs + eats all RAM

2019-06-29 Thread Bogdan Klichuk (JIRA)
[ https://issues.apache.org/jira/browse/ARROW-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bogdan Klichuk updated ARROW-5791: -- Description: I have quite a sparse dataset in CSV format. A wide table that has several rows