Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Micah Kornfield Wed, 14 Sep 2022 10:01:28 -0700

>
> terminate called after throwing an instance of 'parquet::ParquetException'
>   what():  Column converted type mismatch.  Column 'field_name' has
> converted type[NONE] not 'INT_64'


I think this is probably a bug in the streaming library where it should
also be checking on LogicalType, it has been a while since I looked at the
code.  Nanoseconds isn't support for ConvertType which is deprecated
concept in Parquet.

Regarding nullopt compatibility with the ParquetStreamWriter, is that
> something that should work without a template parameter? The compiler error
> that gets thrown is:

I think you need to add an overload for nullopt specifically which is a
different type in C++ then the the empty optional<int64_t>





On Wed, Sep 14, 2022 at 9:51 AM Arun Joseph <[email protected]> wrote:

> I've tried the following schema:
>
>             fields.push_back(
>                 parquet::schema::PrimitiveNode::Make(
>                     "field_name", parquet::Repetition::OPTIONAL,
>                     parquet::LogicalType::Timestamp(true,
> parquet::LogicalType::TimeUnit::NANOS),
>                     parquet::Type::INT64)
>             );
>
> But when I try to insert a value, I get the following exception:
>
> terminate called after throwing an instance of 'parquet::ParquetException'
>   what():  Column converted type mismatch.  Column 'field_name' has
> converted type[NONE] not 'INT_64'
>
> I don't really understand how the ConvertedType vs LogicalType stuff
> works w.r.t the two diff versions of Make. However the Make call with 
> ConvertedType
> does not seem like it would support Timestamp.
>
> Regarding nullopt compatibility with the ParquetStreamWriter, is that
> something that should work without a template parameter? The compiler error
> that gets thrown is:
> ./include/writer.h:181:33: error: no match for ‘operator<<’ (operand types
> are ‘parquet::StreamWriter’ and ‘const nonstd::optional_lite::nullopt_t’)
>   181 |                     writer_.os_ << arrow::util::nullopt;
>       |                     ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
>
> With each of the following errors having the following format (with all
> the diff types):
>
> /home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:17: note:
> candidate: ‘parquet::StreamWriter&
> parquet::StreamWriter::operator<<(int64_t)’
>   110 |   StreamWriter& operator<<(int64_t v);
>       |                 ^~~~~~~~
> /home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:36: note:
> no known conversion for argument 1 from ‘const
> nonstd::optional_lite::nullopt_t’ to ‘int64_t’ {aka ‘long int’}
>   110 |   StreamWriter& operator<<(int64_t v);
>       |                            ~~~~~~~~^
>
> I can try to contribute a solution, but I've never contributed to an
> Apache project before. I can try to take a peek this weekend or after work
> one of these days if this is an actual issue (since there seems to be a
> workaround with arrow::util::optional<int64_t>()
>
> On Wed, Sep 14, 2022 at 12:38 PM Micah Kornfield <[email protected]>
> wrote:
>
>> I'm not sure how it works with null elements but pass LogicalType of
>> timestamp with isAdjustedToUtc=true and nanoseconds unit when creating the
>> schema would be the most likely thing to work.
>>
>> The fact that nullopt doesn't work, seems like an oversight that might be
>> nice to address if you would like to contribute to the project.
>>
>> On Wed, Sep 14, 2022 at 7:57 AM Arun Joseph <[email protected]> wrote:
>>
>>> Hi Micah,
>>>
>>> I couldn't find arrow::util::Optional::nullopt but I did find
>>> arrow::util::nullopt which also did not seem to work. However, I then
>>> found arrow::util::optional<T>() right afterwhich seems to output NaNs!
>>>
>>> I do see that the resulting dataframe when loaded in pandas has the
>>> column dtype as float64. Do you know if there is a way to define the
>>> schema such that I can input an uint64_t (linux epoch time nanos) and
>>> have it output as datetime64[ns] in parquet cpp?
>>>
>>> Thank You,
>>> Arun
>>>
>>> On Tue, Sep 13, 2022 at 10:49 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> Hi Arun,
>>>> The schema should be `parquet::Repetition:OPTIONAL`, 
>>>> parquet::Repetition:REPEATED
>>>> should be for repeated groups.  IIRC you can insert
>>>> arrow::util::Optional::nullopt into the stream for a null value.
>>>>
>>>> Hope this helps.
>>>>
>>>> Micah
>>>>
>>>> On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've tried defining my field with the following:
>>>>>
>>>>> fields.push_back(
>>>>>   parquet::schema::PrimitiveNode::Make(
>>>>>     "field_name",
>>>>>     parquet::Repetition::REQUIRED,
>>>>>     parquet::Type::INT64,
>>>>>     parquet::ConvertedType::INT_64)
>>>>> );
>>>>>
>>>>> and I'm not sure if it's possible to specify a null value for an int64
>>>>> column. I understand that C++ ints don't have a null value. I write to the
>>>>> field with the following:
>>>>>
>>>>> os << std::numeric_limits<int64_t>::quiet_NaN();
>>>>>
>>>>> where os is:
>>>>>
>>>>> parquet::WriterProperties::Builder builder_;
>>>>> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
>>>>> schema_, builder_.build())};
>>>>>
>>>>> This (as expected) writes a 0 for the value. But is there a way to
>>>>> specify a null value? From my understanding parquet::Repetition:OPTIONAL
>>>>> is meant for repeating groups.
>>>>>
>>>>> My actual usecase is trying to represent a null linux epoch timestamp
>>>>> in nanos e.g. NaN or NaT in the resulting pandas dataframe after reading
>>>>> the written parquet file. It seems like in Pandas, int columns with
>>>>> nulls are implicitly casted to float but I think parquet is able to
>>>>> define a null value like this. Is this the only way to achieve this
>>>>> to convert the column to a float or is there a way to specify value
>>>>> is null in parquet cpp?
>>>>>
>>>>> Thank You,
>>>>> Arun Joseph
>>>>>
>>>>>
>>>
>>> --
>>> Arun Joseph
>>>
>>>
>
> --
> Arun Joseph
>
>

Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Reply via email to