Wes McKinney created ARROW-9441: ----------------------------------- Summary: [C++] Optimize RecordBatchReader::ReadAll Key: ARROW-9441 URL: https://issues.apache.org/jira/browse/ARROW-9441 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney
Based on perf reports, more time is spent manipulating C++ data structures than reconstructing record batches from IPC messages, which strikes me as not what we want here is from a perf report based on the Python code {code} for i in range(100): pa.ipc.open_stream('nyctaxi.arrow').read_all() {code} {code} - 50.40% 0.06% python libarrow.so.100.0.0 [.] arrow::RecordBatchReader::ReadAll - 50.34% arrow::RecordBatchReader::ReadAll - 25.86% arrow::Table::FromRecordBatches - 18.41% arrow::SimpleRecordBatch::column - 16.00% arrow::MakeArray - 10.49% arrow::VisitTypeInline<arrow::internal::ArrayDataWrapper> 7.71% arrow::PrimitiveArray::SetData 1.87% arrow::StringArray::StringArray 1.54% __pthread_mutex_lock 0.88% __pthread_mutex_unlock 0.67% std::_Hash_bytes 0.60% arrow::ChunkedArray::ChunkedArray - 22.30% arrow::RecordBatchReader::ReadAll - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext - 15.91% arrow::ipc::ReadRecordBatchInternal - 15.15% arrow::ipc::LoadRecordBatch - 14.45% arrow::ipc::ArrayLoader::Load + 13.15% arrow::VisitTypeInline<arrow::ipc::ArrayLoader> + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch {code} Perhaps {{ChunkedArray}} internally should be changed to contain a vector of {{ArrayData}} instead of boxed Arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)