Re: Read multiline JSON/XML

2019-12-01 Thread vino yang
Also, say sorry to Flavio!

Best,
Vino

vino yang  于2019年12月2日周一 上午10:29写道:

> Hi Chesnay,
>
> Sorry, yes, I lost the "like" keyword. I mistakenly thought he wanted to
> ask how to use Spark to accomplish this job.
>
> Best,
> Vino
>
> Chesnay Schepler  于2019年11月29日周五 下午10:01写道:
>
>> Why vino?
>>
>> He's specifically asking whether Flink offers something _like_ spark.
>>
>> On 29/11/2019 14:39, vino yang wrote:
>>
>> Hi Flavio,
>>
>> IMO, it would take more effect to ask this question in the Spark user
>> mailing list.
>>
>> WDYT?
>>
>> Best,
>> Vino
>>
>> Flavio Pompermaier  于2019年11月29日周五 下午7:09写道:
>>
>>> Hi to all,
>>> is there any out-of-the-box option to read multiline JSON or XML like in
>>> Spark?
>>> It would be awesome to have something like
>>>
>>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>>
>>> Best,
>>> Flavio
>>>
>>
>>


Re: Read multiline JSON/XML

2019-11-29 Thread Flavio Pompermaier
Parallel files processing would be enough, inner file parallelism would be
awesome but it's a plus

On Fri, Nov 29, 2019 at 3:46 PM Arvid Heise  wrote:

> A while ago, I implemented XML and Json input formats. However, having
> proper split support for structured formats without sync markers is not
> that easy. Any split that has a random start offset need to figure out the
> start of the next record on its own, which is fragile by definition.
> That's why supporting jsonl files is much easier; you just need to look
> for the next newline. For the same reason, supporting json or xml in Kafka
> is fairly straightforward: records are already split.
>
> It would be easier to support XML and Json if we can get of splits.
> @Flavio would you expect to get inner file parallelism or would you be fine
> with processing only the files in parallel?
>
> Best,
>
> Arvid
>
> On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler 
> wrote:
>
>> I know that at least the Table API
>> 
>> can read json, but I don't know how well this translates into other APIs.
>>
>> On 29/11/2019 12:09, Flavio Pompermaier wrote:
>>
>> Hi to all,
>> is there any out-of-the-box option to read multiline JSON or XML like in
>> Spark?
>> It would be awesome to have something like
>>
>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>
>> Best,
>> Flavio
>>
>>
>>


Re: Read multiline JSON/XML

2019-11-29 Thread Arvid Heise
A while ago, I implemented XML and Json input formats. However, having
proper split support for structured formats without sync markers is not
that easy. Any split that has a random start offset need to figure out the
start of the next record on its own, which is fragile by definition.
That's why supporting jsonl files is much easier; you just need to look for
the next newline. For the same reason, supporting json or xml in Kafka is
fairly straightforward: records are already split.

It would be easier to support XML and Json if we can get of splits. @Flavio
would you expect to get inner file parallelism or would you be fine with
processing only the files in parallel?

Best,

Arvid

On Fri, Nov 29, 2019 at 3:26 PM Chesnay Schepler  wrote:

> I know that at least the Table API
> 
> can read json, but I don't know how well this translates into other APIs.
>
> On 29/11/2019 12:09, Flavio Pompermaier wrote:
>
> Hi to all,
> is there any out-of-the-box option to read multiline JSON or XML like in
> Spark?
> It would be awesome to have something like
>
> spark.read .option("multiline", true) .json("/path/to/user.json")
>
> Best,
> Flavio
>
>
>


Re: Read multiline JSON/XML

2019-11-29 Thread Chesnay Schepler
I know that at least the Table API 
 
can read json, but I don't know how well this translates into other APIs.


On 29/11/2019 12:09, Flavio Pompermaier wrote:

Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like 
in Spark?

It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio





Re: Read multiline JSON/XML

2019-11-29 Thread Suneel Marthi
For XML, u could look at Mahout's XMLInputFormat (if u r using HadoopInput
Format).

On Fri, Nov 29, 2019 at 9:01 AM Chesnay Schepler  wrote:

> Why vino?
>
> He's specifically asking whether Flink offers something _like_ spark.
>
> On 29/11/2019 14:39, vino yang wrote:
>
> Hi Flavio,
>
> IMO, it would take more effect to ask this question in the Spark user
> mailing list.
>
> WDYT?
>
> Best,
> Vino
>
> Flavio Pompermaier  于2019年11月29日周五 下午7:09写道:
>
>> Hi to all,
>> is there any out-of-the-box option to read multiline JSON or XML like in
>> Spark?
>> It would be awesome to have something like
>>
>> spark.read .option("multiline", true) .json("/path/to/user.json")
>>
>> Best,
>> Flavio
>>
>
>


Re: Read multiline JSON/XML

2019-11-29 Thread Chesnay Schepler

Why vino?

He's specifically asking whether Flink offers something _like_ spark.

On 29/11/2019 14:39, vino yang wrote:

Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user 
mailing list.


WDYT?

Best,
Vino

Flavio Pompermaier > 于2019年11月29日周五 下午7:09写道:


Hi to all,
is there any out-of-the-box option to read multiline JSON or XML
like in Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio





Re: Read multiline JSON/XML

2019-11-29 Thread vino yang
Hi Flavio,

IMO, it would take more effect to ask this question in the Spark user
mailing list.

WDYT?

Best,
Vino

Flavio Pompermaier  于2019年11月29日周五 下午7:09写道:

> Hi to all,
> is there any out-of-the-box option to read multiline JSON or XML like in
> Spark?
> It would be awesome to have something like
>
> spark.read .option("multiline", true) .json("/path/to/user.json")
>
> Best,
> Flavio
>


Read multiline JSON/XML

2019-11-29 Thread Flavio Pompermaier
Hi to all,
is there any out-of-the-box option to read multiline JSON or XML like in
Spark?
It would be awesome to have something like

spark.read .option("multiline", true) .json("/path/to/user.json")

Best,
Flavio