[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3408: -- Labels: csv dataset pull-request-available (was: csv dataset) > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv, dataset, pull-request-available > Fix For: 1.0.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3408: Labels: csv dataset (was: csv dataset datasets) > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv, dataset > Fix For: 1.0.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-3408: -- Labels: csv dataset datasets (was: csv datasets) > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv, dataset, datasets > Fix For: 1.0.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-3408: - Labels: csv datasets (was: datasets) > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv, datasets > Fix For: 1.0.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3408: Labels: datasets (was: ) > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: datasets > Fix For: 0.14.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3408: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: datasets > Fix For: 0.15.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3408: Fix Version/s: (was: 0.13.0) 0.14.0 > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3408: Fix Version/s: (was: 0.12.0) 0.13.0 > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian JIRA (v7.6.3#76005)