Hi Arun, The schema should be `parquet::Repetition:OPTIONAL`, parquet::Repetition:REPEATED should be for repeated groups. IIRC you can insert arrow::util::Optional::nullopt into the stream for a null value.
Hope this helps. Micah On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <[email protected]> wrote: > Hi all, > > I've tried defining my field with the following: > > fields.push_back( > parquet::schema::PrimitiveNode::Make( > "field_name", > parquet::Repetition::REQUIRED, > parquet::Type::INT64, > parquet::ConvertedType::INT_64) > ); > > and I'm not sure if it's possible to specify a null value for an int64 > column. I understand that C++ ints don't have a null value. I write to the > field with the following: > > os << std::numeric_limits<int64_t>::quiet_NaN(); > > where os is: > > parquet::WriterProperties::Builder builder_; > parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_, > schema_, builder_.build())}; > > This (as expected) writes a 0 for the value. But is there a way to specify > a null value? From my understanding parquet::Repetition:OPTIONAL is meant > for repeating groups. > > My actual usecase is trying to represent a null linux epoch timestamp in > nanos e.g. NaN or NaT in the resulting pandas dataframe after reading the > written parquet file. It seems like in Pandas, int columns with nulls are > implicitly casted to float but I think parquet is able to define a null > value like this. Is this the only way to achieve this to convert the > column to a float or is there a way to specify value is null in parquet > cpp? > > Thank You, > Arun Joseph > >
