> > terminate called after throwing an instance of 'parquet::ParquetException' > what(): Column converted type mismatch. Column 'field_name' has > converted type[NONE] not 'INT_64'
I think this is probably a bug in the streaming library where it should also be checking on LogicalType, it has been a while since I looked at the code. Nanoseconds isn't support for ConvertType which is deprecated concept in Parquet. Regarding nullopt compatibility with the ParquetStreamWriter, is that > something that should work without a template parameter? The compiler error > that gets thrown is: I think you need to add an overload for nullopt specifically which is a different type in C++ then the the empty optional<int64_t> On Wed, Sep 14, 2022 at 9:51 AM Arun Joseph <[email protected]> wrote: > I've tried the following schema: > > fields.push_back( > parquet::schema::PrimitiveNode::Make( > "field_name", parquet::Repetition::OPTIONAL, > parquet::LogicalType::Timestamp(true, > parquet::LogicalType::TimeUnit::NANOS), > parquet::Type::INT64) > ); > > But when I try to insert a value, I get the following exception: > > terminate called after throwing an instance of 'parquet::ParquetException' > what(): Column converted type mismatch. Column 'field_name' has > converted type[NONE] not 'INT_64' > > I don't really understand how the ConvertedType vs LogicalType stuff > works w.r.t the two diff versions of Make. However the Make call with > ConvertedType > does not seem like it would support Timestamp. > > Regarding nullopt compatibility with the ParquetStreamWriter, is that > something that should work without a template parameter? The compiler error > that gets thrown is: > ./include/writer.h:181:33: error: no match for ‘operator<<’ (operand types > are ‘parquet::StreamWriter’ and ‘const nonstd::optional_lite::nullopt_t’) > 181 | writer_.os_ << arrow::util::nullopt; > | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~ > > With each of the following errors having the following format (with all > the diff types): > > /home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:17: note: > candidate: ‘parquet::StreamWriter& > parquet::StreamWriter::operator<<(int64_t)’ > 110 | StreamWriter& operator<<(int64_t v); > | ^~~~~~~~ > /home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:36: note: > no known conversion for argument 1 from ‘const > nonstd::optional_lite::nullopt_t’ to ‘int64_t’ {aka ‘long int’} > 110 | StreamWriter& operator<<(int64_t v); > | ~~~~~~~~^ > > I can try to contribute a solution, but I've never contributed to an > Apache project before. I can try to take a peek this weekend or after work > one of these days if this is an actual issue (since there seems to be a > workaround with arrow::util::optional<int64_t>() > > On Wed, Sep 14, 2022 at 12:38 PM Micah Kornfield <[email protected]> > wrote: > >> I'm not sure how it works with null elements but pass LogicalType of >> timestamp with isAdjustedToUtc=true and nanoseconds unit when creating the >> schema would be the most likely thing to work. >> >> The fact that nullopt doesn't work, seems like an oversight that might be >> nice to address if you would like to contribute to the project. >> >> On Wed, Sep 14, 2022 at 7:57 AM Arun Joseph <[email protected]> wrote: >> >>> Hi Micah, >>> >>> I couldn't find arrow::util::Optional::nullopt but I did find >>> arrow::util::nullopt which also did not seem to work. However, I then >>> found arrow::util::optional<T>() right afterwhich seems to output NaNs! >>> >>> I do see that the resulting dataframe when loaded in pandas has the >>> column dtype as float64. Do you know if there is a way to define the >>> schema such that I can input an uint64_t (linux epoch time nanos) and >>> have it output as datetime64[ns] in parquet cpp? >>> >>> Thank You, >>> Arun >>> >>> On Tue, Sep 13, 2022 at 10:49 PM Micah Kornfield <[email protected]> >>> wrote: >>> >>>> Hi Arun, >>>> The schema should be `parquet::Repetition:OPTIONAL`, >>>> parquet::Repetition:REPEATED >>>> should be for repeated groups. IIRC you can insert >>>> arrow::util::Optional::nullopt into the stream for a null value. >>>> >>>> Hope this helps. >>>> >>>> Micah >>>> >>>> On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I've tried defining my field with the following: >>>>> >>>>> fields.push_back( >>>>> parquet::schema::PrimitiveNode::Make( >>>>> "field_name", >>>>> parquet::Repetition::REQUIRED, >>>>> parquet::Type::INT64, >>>>> parquet::ConvertedType::INT_64) >>>>> ); >>>>> >>>>> and I'm not sure if it's possible to specify a null value for an int64 >>>>> column. I understand that C++ ints don't have a null value. I write to the >>>>> field with the following: >>>>> >>>>> os << std::numeric_limits<int64_t>::quiet_NaN(); >>>>> >>>>> where os is: >>>>> >>>>> parquet::WriterProperties::Builder builder_; >>>>> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_, >>>>> schema_, builder_.build())}; >>>>> >>>>> This (as expected) writes a 0 for the value. But is there a way to >>>>> specify a null value? From my understanding parquet::Repetition:OPTIONAL >>>>> is meant for repeating groups. >>>>> >>>>> My actual usecase is trying to represent a null linux epoch timestamp >>>>> in nanos e.g. NaN or NaT in the resulting pandas dataframe after reading >>>>> the written parquet file. It seems like in Pandas, int columns with >>>>> nulls are implicitly casted to float but I think parquet is able to >>>>> define a null value like this. Is this the only way to achieve this >>>>> to convert the column to a float or is there a way to specify value >>>>> is null in parquet cpp? >>>>> >>>>> Thank You, >>>>> Arun Joseph >>>>> >>>>> >>> >>> -- >>> Arun Joseph >>> >>> > > -- > Arun Joseph > >
