Re: Small malloc at file open and metadata parsing
Sorry to jump in like this, but I was wondering if parquet-rs can read the file correctly, or the issue also happens there. Alex, could you give it a go and see if file and metadata can be read with parquet-rs (https://github.com/sunchao/parquet-rs, you can run cargo install parquet to install parquet tools). Cheers, Ivan On Mon, 30 Jul 2018 at 21:49 ALeX Wang wrote: > Thanks for the quick reply @Wes, > > Too bad this is causing a lot of delays (due to page fault handing) for > light queries (ones that query only few rows/columns), > > Will try to use jemallc and see,,, > > One more question, when i upgrade to 1.4.0 or later code, and use the same > cmake options, and environment, OpenFile result in segfault,,, > > ``` > awake@ev003:/tmp$ cat tmpfile > (gdb) where > #0 0x7fc542eebc3c in free () from /lib64/libc.so.6 > #1 0x00f13cb1 in arrow::DefaultMemoryPool::Free (this=0x16e71e0 > , buffer=0x7fc52f425040 > , size=616512) > at > > /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/memory_pool.cc:147 > #2 0x00f117b6 in arrow::PoolBuffer::~PoolBuffer (this=0x34b5fb8, > __in_chrg=) at > /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/buffer.cc:70 > #3 0x00e364b7 in > __gnu_cxx::new_allocator::destroy > (this=0x34b5fb0, __p=0x34b5fb8) at > /usr/include/c++/4.8.2/ext/new_allocator.h:124 > #4 0x00e35e10 in > std::allocator_traits > >::_S_destroy (__a=..., __p=0x34b5fb8) at > /usr/include/c++/4.8.2/bits/alloc_traits.h:281 > #5 0x00e34ea3 in > std::allocator_traits > >::destroy (__a=..., __p=0x34b5fb8) at > /usr/include/c++/4.8.2/bits/alloc_traits.h:405 > #6 0x00e33f01 in std::_Sp_counted_ptr_inplace std::allocator, (__gnu_cxx::_Lock_policy)2>::_M_dispose > (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:407 > #7 0x00e27748 in > std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release > (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144 > #8 0x00e255bb in > std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count > (this=0x7ffea5fffc88, __in_chrg=) at > /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546 > #9 0x00e23eae in std::__shared_ptr (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffea5fffc80, > __in_chrg=) at > /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781 > #10 0x00e23ec8 in std::shared_ptr::~shared_ptr > (this=0x7ffea5fffc80, __in_chrg=) at > /usr/include/c++/4.8.2/bits/shared_ptr.h:93 > #11 0x00e875a4 in parquet::SerializedFile::ParseMetaData > (this=0x34b5f60) at /opt/parquet-cpp/src/parquet/file_reader.cc:213 > #12 0x00e858d4 in parquet::ParquetFileReader::Contents::Open > (source=std::unique_ptr containing 0x0, > props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:247 > ---Type to continue, or q to quit--- > #13 0x00e85a6f in parquet::ParquetFileReader::Open > (source=std::unique_ptr containing 0x0, > props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:265 > #14 0x00e859ba in parquet::ParquetFileReader::Open > (source=std::shared_ptr (count 2, weak 0) 0x34b5e50, props=..., > metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:259 > #15 0x00e85df4 in parquet::ParquetFileReader::OpenFile > > (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030", > memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:287 > ``` > > Is this a known issue? > > Thanks, > Alex Wang, > > > > On Mon, Jul 30, 2018, 11:22 AM Wes McKinney wrote: > > > hi Alex, > > > > It looks like the mallocs are coming from Thrift > > (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we > > can do much about this. I'm curious if it's possible to pass a custom > > STL allocator to Thrift so we could use a different allocation > > strategy than the default STL allocator > > > > - Wes > > > > On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang wrote: > > > Hi, > > > > > > I'm reading parquet file (generated by Java parquet library). Our > schema > > > has 400 columns (including non-array elements, 1-dimensional array > > > elements). > > > > > > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53, > > > > > > I compile parquet-cpp with following cmake options, > > > ``` > > > cmake3-DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF > > > -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static" > > > -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF . > > > ``` > > > > > > One thing we noticed is that the cpp library conducts a lot of small > > > mallocs during the open file and the reading metadata phases... shown > > > below: > > > > > > ``` > > > (gdb) where > > > #0 0x7fdf40594801 in malloc () from /lib64/libc.so.6 > > > #1 0x7fdf40e52ecd in operator
Re: Small malloc at file open and metadata parsing
Thanks for the quick reply @Wes, Too bad this is causing a lot of delays (due to page fault handing) for light queries (ones that query only few rows/columns), Will try to use jemallc and see,,, One more question, when i upgrade to 1.4.0 or later code, and use the same cmake options, and environment, OpenFile result in segfault,,, ``` awake@ev003:/tmp$ cat tmpfile (gdb) where #0 0x7fc542eebc3c in free () from /lib64/libc.so.6 #1 0x00f13cb1 in arrow::DefaultMemoryPool::Free (this=0x16e71e0 , buffer=0x7fc52f425040 , size=616512) at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/memory_pool.cc:147 #2 0x00f117b6 in arrow::PoolBuffer::~PoolBuffer (this=0x34b5fb8, __in_chrg=) at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/buffer.cc:70 #3 0x00e364b7 in __gnu_cxx::new_allocator::destroy (this=0x34b5fb0, __p=0x34b5fb8) at /usr/include/c++/4.8.2/ext/new_allocator.h:124 #4 0x00e35e10 in std::allocator_traits >::_S_destroy (__a=..., __p=0x34b5fb8) at /usr/include/c++/4.8.2/bits/alloc_traits.h:281 #5 0x00e34ea3 in std::allocator_traits >::destroy (__a=..., __p=0x34b5fb8) at /usr/include/c++/4.8.2/bits/alloc_traits.h:405 #6 0x00e33f01 in std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:407 #7 0x00e27748 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144 #8 0x00e255bb in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffea5fffc88, __in_chrg=) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:546 #9 0x00e23eae in std::__shared_ptr::~__shared_ptr (this=0x7ffea5fffc80, __in_chrg=) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:781 #10 0x00e23ec8 in std::shared_ptr::~shared_ptr (this=0x7ffea5fffc80, __in_chrg=) at /usr/include/c++/4.8.2/bits/shared_ptr.h:93 #11 0x00e875a4 in parquet::SerializedFile::ParseMetaData (this=0x34b5f60) at /opt/parquet-cpp/src/parquet/file_reader.cc:213 #12 0x00e858d4 in parquet::ParquetFileReader::Contents::Open (source=std::unique_ptr containing 0x0, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:247 ---Type to continue, or q to quit--- #13 0x00e85a6f in parquet::ParquetFileReader::Open (source=std::unique_ptr containing 0x0, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:265 #14 0x00e859ba in parquet::ParquetFileReader::Open (source=std::shared_ptr (count 2, weak 0) 0x34b5e50, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:259 #15 0x00e85df4 in parquet::ParquetFileReader::OpenFile (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030", memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:287 ``` Is this a known issue? Thanks, Alex Wang, On Mon, Jul 30, 2018, 11:22 AM Wes McKinney wrote: > hi Alex, > > It looks like the mallocs are coming from Thrift > (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we > can do much about this. I'm curious if it's possible to pass a custom > STL allocator to Thrift so we could use a different allocation > strategy than the default STL allocator > > - Wes > > On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang wrote: > > Hi, > > > > I'm reading parquet file (generated by Java parquet library). Our schema > > has 400 columns (including non-array elements, 1-dimensional array > > elements). > > > > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53, > > > > I compile parquet-cpp with following cmake options, > > ``` > > cmake3-DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF > > -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static" > > -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF . > > ``` > > > > One thing we noticed is that the cpp library conducts a lot of small > > mallocs during the open file and the reading metadata phases... shown > > below: > > > > ``` > > (gdb) where > > #0 0x7fdf40594801 in malloc () from /lib64/libc.so.6 > > #1 0x7fdf40e52ecd in operator new(unsigned long) () from > > /lib64/libstdc++.so.6 > > #2 0x00ea16c0 in __gnu_cxx::new_allocator::allocate > > (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104 > > #3 0x00e9eabb in std::_Vector_base > std::allocator >::_M_allocate (this=0x33e6930, __n=3) at > > /usr/include/c++/4.8.2/bits/stl_vector.h:168 > > #4 0x00ecf512 in std::vector > std::allocator >::_M_default_append (this=0x33e6930, __n=3) > at > > /usr/include/c++/4.8.2/bits/vector.tcc:549 > > #5 0x00eca887 in std::vector > std::allocator >::resize (this=0x33e6930, __new_size=3) at > >
Re: Small malloc at file open and metadata parsing
hi Alex, It looks like the mallocs are coming from Thrift (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we can do much about this. I'm curious if it's possible to pass a custom STL allocator to Thrift so we could use a different allocation strategy than the default STL allocator - Wes On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang wrote: > Hi, > > I'm reading parquet file (generated by Java parquet library). Our schema > has 400 columns (including non-array elements, 1-dimensional array > elements). > > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53, > > I compile parquet-cpp with following cmake options, > ``` > cmake3-DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF > -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static" > -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF . > ``` > > One thing we noticed is that the cpp library conducts a lot of small > mallocs during the open file and the reading metadata phases... shown > below: > > ``` > (gdb) where > #0 0x7fdf40594801 in malloc () from /lib64/libc.so.6 > #1 0x7fdf40e52ecd in operator new(unsigned long) () from > /lib64/libstdc++.so.6 > #2 0x00ea16c0 in __gnu_cxx::new_allocator::allocate > (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104 > #3 0x00e9eabb in std::_Vector_base std::allocator >::_M_allocate (this=0x33e6930, __n=3) at > /usr/include/c++/4.8.2/bits/stl_vector.h:168 > #4 0x00ecf512 in std::vector std::allocator >::_M_default_append (this=0x33e6930, __n=3) at > /usr/include/c++/4.8.2/bits/vector.tcc:549 > #5 0x00eca887 in std::vector std::allocator >::resize (this=0x33e6930, __new_size=3) at > /usr/include/c++/4.8.2/bits/stl_vector.h:667 > #6 0x00ebd589 in parquet::format::ColumnMetaData::read > (this=0x33e6908, iprot=0x3337300) at > /opt/parquet-cpp/src/parquet/parquet_types.cpp:3845 > #7 0x00ebf9ed in parquet::format::ColumnChunk::read > (this=0x33e68f0, iprot=0x3337300) at > /opt/parquet-cpp/src/parquet/parquet_types.cpp:4246 > #8 0x00ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0, > iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451 > #9 0x00ec4e22 in parquet::format::FileMetaData::read > (this=0x3337270, iprot=0x3337300) at > /opt/parquet-cpp/src/parquet/parquet_types.cpp:5385 > #10 0x00e9364d in > parquet::DeserializeThriftMsg > (buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", > len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at > /opt/parquet-cpp/src/parquet/thrift.h:119 > #11 0x00e8fda5 in > parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0, > metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", > metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303 > #12 0x00e8bf4f in parquet::FileMetaData::FileMetaData > (this=0x31a4ca0, metadata=0x7fdf2cace040 > "\025\002\031\374\313\004H\bsessions\025\374\005", > metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403 > #13 0x00e8bee3 in parquet::FileMetaData::Make > (metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", > metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398 > #14 0x00e87572 in parquet::SerializedFile::ParseMetaData > (this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213 > #15 0x00e858d4 in parquet::ParquetFileReader::Contents::Open > (source=std::unique_ptr containing 0x0, > props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:247 > #16 0x00e85a6f in parquet::ParquetFileReader::Open > (source=std::unique_ptr containing 0x0, > props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:265 > #17 0x00e859ba in parquet::ParquetFileReader::Open > (source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=..., > metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:259 > #18 0x00e85df4 in parquet::ParquetFileReader::OpenFile > (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030", > memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at > /opt/parquet-cpp/src/parquet/file_reader.cc:287 > > (gdb) info br > Num Type Disp Enb AddressWhat > 1 breakpoint keep y > breakpoint already hit 2679 times > ignore next 2321 hits > ``` > > I set the breakpoint to `malloc`, above ^ > > This seems to be the case regardless of mmap option. > > Would really appreciate some pointer on how to avoid this. > > Thanks, > Alex Wang, > > -- > Alex Wang, > Open vSwitch developer
Small malloc at file open and metadata parsing
Hi, I'm reading parquet file (generated by Java parquet library). Our schema has 400 columns (including non-array elements, 1-dimensional array elements). I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53, I compile parquet-cpp with following cmake options, ``` cmake3-DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static" -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF . ``` One thing we noticed is that the cpp library conducts a lot of small mallocs during the open file and the reading metadata phases... shown below: ``` (gdb) where #0 0x7fdf40594801 in malloc () from /lib64/libc.so.6 #1 0x7fdf40e52ecd in operator new(unsigned long) () from /lib64/libstdc++.so.6 #2 0x00ea16c0 in __gnu_cxx::new_allocator::allocate (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104 #3 0x00e9eabb in std::_Vector_base >::_M_allocate (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/bits/stl_vector.h:168 #4 0x00ecf512 in std::vector >::_M_default_append (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/bits/vector.tcc:549 #5 0x00eca887 in std::vector >::resize (this=0x33e6930, __new_size=3) at /usr/include/c++/4.8.2/bits/stl_vector.h:667 #6 0x00ebd589 in parquet::format::ColumnMetaData::read (this=0x33e6908, iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:3845 #7 0x00ebf9ed in parquet::format::ColumnChunk::read (this=0x33e68f0, iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4246 #8 0x00ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0, iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451 #9 0x00ec4e22 in parquet::format::FileMetaData::read (this=0x3337270, iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:5385 #10 0x00e9364d in parquet::DeserializeThriftMsg (buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at /opt/parquet-cpp/src/parquet/thrift.h:119 #11 0x00e8fda5 in parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0, metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303 #12 0x00e8bf4f in parquet::FileMetaData::FileMetaData (this=0x31a4ca0, metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403 #13 0x00e8bee3 in parquet::FileMetaData::Make (metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005", metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398 #14 0x00e87572 in parquet::SerializedFile::ParseMetaData (this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213 #15 0x00e858d4 in parquet::ParquetFileReader::Contents::Open (source=std::unique_ptr containing 0x0, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:247 #16 0x00e85a6f in parquet::ParquetFileReader::Open (source=std::unique_ptr containing 0x0, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:265 #17 0x00e859ba in parquet::ParquetFileReader::Open (source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:259 #18 0x00e85df4 in parquet::ParquetFileReader::OpenFile (path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030", memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at /opt/parquet-cpp/src/parquet/file_reader.cc:287 (gdb) info br Num Type Disp Enb AddressWhat 1 breakpoint keep y breakpoint already hit 2679 times ignore next 2321 hits ``` I set the breakpoint to `malloc`, above ^ This seems to be the case regardless of mmap option. Would really appreciate some pointer on how to avoid this. Thanks, Alex Wang, -- Alex Wang, Open vSwitch developer