Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Arun Joseph Wed, 14 Sep 2022 09:51:11 -0700

I've tried the following schema:

            fields.push_back(
                parquet::schema::PrimitiveNode::Make(
                    "field_name", parquet::Repetition::OPTIONAL,
                    parquet::LogicalType::Timestamp(true,
parquet::LogicalType::TimeUnit::NANOS),
                    parquet::Type::INT64)
            );


But when I try to insert a value, I get the following exception:

terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Column converted type mismatch.  Column 'field_name' has
converted type[NONE] not 'INT_64'

I don't really understand how the ConvertedType vs LogicalType stuff works
w.r.t the two diff versions of Make. However the Make call with ConvertedType
does not seem like it would support Timestamp.

Regarding nullopt compatibility with the ParquetStreamWriter, is that
something that should work without a template parameter? The compiler error
that gets thrown is:
./include/writer.h:181:33: error: no match for ‘operator<<’ (operand types
are ‘parquet::StreamWriter’ and ‘const nonstd::optional_lite::nullopt_t’)
  181 |                     writer_.os_ << arrow::util::nullopt;
      |                     ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~

With each of the following errors having the following format (with all the
diff types):

/home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:17: note:
candidate: ‘parquet::StreamWriter&
parquet::StreamWriter::operator<<(int64_t)’
  110 |   StreamWriter& operator<<(int64_t v);
      |                 ^~~~~~~~
/home/ajoseph/local/arrow/include/parquet/stream_writer.h:110:36: note:
no known conversion for argument 1 from ‘const
nonstd::optional_lite::nullopt_t’ to ‘int64_t’ {aka ‘long int’}
  110 |   StreamWriter& operator<<(int64_t v);
      |                            ~~~~~~~~^

I can try to contribute a solution, but I've never contributed to an Apache
project before. I can try to take a peek this weekend or after work one of
these days if this is an actual issue (since there seems to be a workaround
with arrow::util::optional<int64_t>()

On Wed, Sep 14, 2022 at 12:38 PM Micah Kornfield <[email protected]>
wrote:

> I'm not sure how it works with null elements but pass LogicalType of
> timestamp with isAdjustedToUtc=true and nanoseconds unit when creating the
> schema would be the most likely thing to work.
>
> The fact that nullopt doesn't work, seems like an oversight that might be
> nice to address if you would like to contribute to the project.
>
> On Wed, Sep 14, 2022 at 7:57 AM Arun Joseph <[email protected]> wrote:
>
>> Hi Micah,
>>
>> I couldn't find arrow::util::Optional::nullopt but I did find
>> arrow::util::nullopt which also did not seem to work. However, I then
>> found arrow::util::optional<T>() right afterwhich seems to output NaNs!
>>
>> I do see that the resulting dataframe when loaded in pandas has the
>> column dtype as float64. Do you know if there is a way to define the
>> schema such that I can input an uint64_t (linux epoch time nanos) and
>> have it output as datetime64[ns] in parquet cpp?
>>
>> Thank You,
>> Arun
>>
>> On Tue, Sep 13, 2022 at 10:49 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Hi Arun,
>>> The schema should be `parquet::Repetition:OPTIONAL`, 
>>> parquet::Repetition:REPEATED
>>> should be for repeated groups.  IIRC you can insert
>>> arrow::util::Optional::nullopt into the stream for a null value.
>>>
>>> Hope this helps.
>>>
>>> Micah
>>>
>>> On Tue, Sep 13, 2022 at 8:58 AM Arun Joseph <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I've tried defining my field with the following:
>>>>
>>>> fields.push_back(
>>>>   parquet::schema::PrimitiveNode::Make(
>>>>     "field_name",
>>>>     parquet::Repetition::REQUIRED,
>>>>     parquet::Type::INT64,
>>>>     parquet::ConvertedType::INT_64)
>>>> );
>>>>
>>>> and I'm not sure if it's possible to specify a null value for an int64
>>>> column. I understand that C++ ints don't have a null value. I write to the
>>>> field with the following:
>>>>
>>>> os << std::numeric_limits<int64_t>::quiet_NaN();
>>>>
>>>> where os is:
>>>>
>>>> parquet::WriterProperties::Builder builder_;
>>>> parquet::StreamWriter os {parquet::ParquetFileWriter::Open(outfile_,
>>>> schema_, builder_.build())};
>>>>
>>>> This (as expected) writes a 0 for the value. But is there a way to
>>>> specify a null value? From my understanding parquet::Repetition:OPTIONAL
>>>> is meant for repeating groups.
>>>>
>>>> My actual usecase is trying to represent a null linux epoch timestamp
>>>> in nanos e.g. NaN or NaT in the resulting pandas dataframe after reading
>>>> the written parquet file. It seems like in Pandas, int columns with
>>>> nulls are implicitly casted to float but I think parquet is able to
>>>> define a null value like this. Is this the only way to achieve this to
>>>> convert the column to a float or is there a way to specify value is
>>>> null in parquet cpp?
>>>>
>>>> Thank You,
>>>> Arun Joseph
>>>>
>>>>
>>
>> --
>> Arun Joseph
>>
>>

-- 
Arun Joseph

Re: [C++] How to write a null value to a int64 column with Parquet StreamWriter?

Reply via email to