that is what I am trying to do. ie to replace all int64 fields with string fields. This however causes issues in other parts of the pipeline where there rows are supposed to be mapped to other rows with compatible schemas i.e. select certain fields from input row and map to the same name field of an output row. There I can no longer use Cast.castRow() and will have to copy paste the code of castRow with additional if else conditions. Is there a way I can modify/hack the behaviour of JsonFormat.printer to not serialize int64s as strings?
On Fri, Oct 29, 2021 at 11:29 AM Brian Hulette <[email protected]> wrote: > Also, a temporary workaround might be to just modify the schema to make it > expect strings for the int64 fields, then immediately translate them. > > On Fri, Oct 29, 2021 at 11:27 AM Brian Hulette <[email protected]> > wrote: > >> I was puzzled by this logic in JsonFormat at first as I don't see a >> justification for it in the linked docs. Representing an int64 as a string >> doesn't improve anything, it just consumes two more bytes on the wire for >> the quotations. >> Finally I realized this must be for compatibility with Javascript, where >> the full range of an int64 can't be represented with a primitive type. >> >> We could add an option on JsonToRow to make it expect INT64s to be >> strings. There are already some options available to configure the >> underlying RowJsonDeserializer [1]. Similar to the null behavior it could >> have three options: 1) expect number, 2) expect string, 3) allow either. >> >> Brian >> >> [1] >> https://github.com/apache/beam/blob/2e448dee58f1ee60551cc47b9aa7df6bc832734a/sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowJson.java#L252 >> >> >> >> On Fri, Oct 29, 2021 at 8:45 AM Reuven Lax <[email protected]> wrote: >> >>> This is done on every record though, so parsing twice seems undesirable. >>> >>> On Thu, Oct 28, 2021 at 2:09 PM gaurav mishra < >>> [email protected]> wrote: >>> >>>> the validator function will probably have to do try catch with >>>> Long.parselong(jsonNode.toString()) to test for valid inputs in case the >>>> current check - jsonNode.isIntegralNumber() && jsonNode.canConvertToLong() >>>> is false. >>>> >>>> On Fri, Oct 22, 2021 at 8:07 PM Reuven Lax <[email protected]> wrote: >>>> >>>>> If we want to support this behavior, we would have to change this code: >>>>> >>>>> >>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/RowJsonValueExtractors.java#L93 >>>>> >>>>> I believe that instead of longValue, JsonNode::asLong would have to be >>>>> used (that function parses strings). I'm not sure how to test for invalid >>>>> input, since that function simply returns 0 in that case, which is also a >>>>> valid value. >>>>> >>>>> Reuven >>>>> >>>>> On Fri, Oct 22, 2021 at 12:41 PM gaurav mishra < >>>>> [email protected]> wrote: >>>>> >>>>>> I have two pipelines >>>>>> PipelineA -> output to pubsub(schema bound json) -> PipelineB >>>>>> >>>>>> PipelineA is emitting proto models serialized as JsonStrings. >>>>>> Serialization is done using JsonFormat.printer. >>>>>> ``` >>>>>> JsonFormat.printer().preservingProtoFieldNames() >>>>>> .print(model) >>>>>> ``` >>>>>> >>>>>> In PipelineB I am trying to read these as jsons as Row which is tied >>>>>> to the schema derived to the same proto which was used in PipelineA. >>>>>> >>>>>> ``` >>>>>> Schema schema = new >>>>>> ProtoMessageSchema().schemaFor(TypeDescriptor.of(protoClass)); >>>>>> .... >>>>>> input.apply(JsonToRow.withSchema(schema)); >>>>>> ... >>>>>> ``` >>>>>> >>>>>> The problem I am facing is with int64 type fields. When these fields >>>>>> are serialized using `JsonFormat.printer` these are serialized as strings >>>>>> in final json, which is by design( >>>>>> https://developers.google.com/protocol-buffers/docs/proto3#json). >>>>>> In pipelineB when the framework tries to deserialize these fields as >>>>>> int64 it fails. >>>>>> ``` >>>>>> Unable to get value from field 'site_id'. Schema type 'INT64'. JSON >>>>>> node type STRING >>>>>> org.apache.beam.sdk.util.RowJson$RowJsonDeserializer.extractJsonPrimitiveValue >>>>>> ``` >>>>>> >>>>>> Is there a way to workaround this problem, can I do something either >>>>>> on serialization side or deserialization side to fix this? >>>>>> >>>>>> >>>>>>
