Parallel files processing would be enough, inner file parallelism would be
awesome but it's a plus

On Fri, Nov 29, 2019 at 3:46 PM Arvid Heise <[email protected]> wrote:

> A while ago, I implemented XML and Json input formats. However, having
> proper split support for structured formats without sync markers is not
> that easy. Any split that has a random start offset need to figure out the
> start of the next record on its own, which is fragile by definition.
> That's why supporting jsonl files is much easier; you just need to look
> for the next newline. For the same reason, supporting json or xml in Kafka
> is fairly straightforward: records are already split.
>
> It would be easier to support XML and Json if we can get of splits.
> @Flavio would you expect to get inner file parallelism or would you be fine
> with processing only the files in parallel?
>
> Best,
>
> Arvid
>
> On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler <[email protected]>
> wrote:
>
>> I know that at least the Table API
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#csv-format>
>> can read json, but I don't know how well this translates into other APIs.
>>
>> On 29/11/2019 12:09, Flavio Pompermaier wrote:
>>
>> Hi to all,
>> is there any out-of-the-box option to read multiline JSON or XML like in
>> Spark?
>> It would be awesome to have something like
>>
>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>
>> Best,
>> Flavio
>>
>>
>>

Reply via email to