subject:"Should I convert json into parquet\?"

Re: Should I convert json into parquet?

2015-10-19 Thread Adrian Tanase

For general data access of the pre-computed aggregates (group by) you’re better 
off with Parquet. I’d only choose JSON if I needed interop with another app 
stack / language that has difficulty accessing parquet (E.g. Bulk load into 
document db…).

On a strategic level, both JSON and parquet are similar since neither give you 
good random access, so you can’t simply “update specific user Ids on new data 
coming in”. Your strategy will probably be to re-process all the users by 
loading new data and current aggregates, joining and writing a new version of 
the aggregates…

If you’re worried about update performance then you probably need to look at a 
DB that offers random write access (Cassandra, Hbase..)

-adrian




On 10/19/15, 12:31 PM, "Ewan Leith"  wrote:

>As Jörn says, Parquet and ORC will get you really good compression and can be 
>much faster. There also some nice additions around predicate pushdown which 
>can be great if you've got wide tables.
>
>Parquet is obviously easier to use, since it's bundled into Spark. Using ORC 
>is described here 
>http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/
>
>Thanks,
>Ewan
>
>-Original Message-
>From: Jörn Franke [mailto:jornfra...@gmail.com] 
>Sent: 19 October 2015 06:32
>To: Gavin Yue 
>Cc: user 
>Subject: Re: Should I convert json into parquet?
>
>
>
>Good Formats are Parquet or ORC. Both can be useful with compression, such as 
>Snappy.   They are much faster than JSON. however, the table structure is up 
>to you and depends on your use case.
>
>> On 17 Oct 2015, at 23:07, Gavin Yue  wrote:
>> 
>> I have json files which contains timestamped events.  Each event associate 
>> with a user id. 
>> 
>> Now I want to group by user id. So converts from
>> 
>> Event1 -> UserIDA;
>> Event2 -> UserIDA;
>> Event3 -> UserIDB;
>> 
>> To intermediate storage. 
>> UserIDA -> (Event1, Event2...)
>> UserIDB-> (Event3...)
>> 
>> Then I will label positives and featurize the Events Vector in many 
>> different ways, fit each of them into the Logistic Regression. 
>> 
>> I want to save intermediate storage permanently since it will be used many 
>> times.  And there will new events coming every day. So I need to update this 
>> intermediate storage every day. 
>> 
>> Right now I store intermediate storage using Json files.  Should I use 
>> Parquet instead?  Or is there better solutions for this use case?
>> 
>> Thanks a lot !
>> 
>> 
>> 
>> 
>> 
>> 
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
>commands, e-mail: user-h...@spark.apache.org
>
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

RE: Should I convert json into parquet?

2015-10-19 Thread Ewan Leith

As Jörn says, Parquet and ORC will get you really good compression and can be 
much faster. There also some nice additions around predicate pushdown which can 
be great if you've got wide tables.

Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is 
described here 
http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/

Thanks,
Ewan

-Original Message-
From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 19 October 2015 06:32
To: Gavin Yue 
Cc: user 
Subject: Re: Should I convert json into parquet?



Good Formats are Parquet or ORC. Both can be useful with compression, such as 
Snappy.   They are much faster than JSON. however, the table structure is up to 
you and depends on your use case.

> On 17 Oct 2015, at 23:07, Gavin Yue  wrote:
> 
> I have json files which contains timestamped events.  Each event associate 
> with a user id. 
> 
> Now I want to group by user id. So converts from
> 
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
> 
> To intermediate storage. 
> UserIDA -> (Event1, Event2...)
> UserIDB-> (Event3...)
> 
> Then I will label positives and featurize the Events Vector in many different 
> ways, fit each of them into the Logistic Regression. 
> 
> I want to save intermediate storage permanently since it will be used many 
> times.  And there will new events coming every day. So I need to update this 
> intermediate storage every day. 
> 
> Right now I store intermediate storage using Json files.  Should I use 
> Parquet instead?  Or is there better solutions for this use case?
> 
> Thanks a lot !
> 
> 
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Should I convert json into parquet?

2015-10-18 Thread Jörn Franke



Good Formats are Parquet or ORC. Both can be useful with compression, such as 
Snappy.   They are much faster than JSON. however, the table structure is up to 
you and depends on your use case.

> On 17 Oct 2015, at 23:07, Gavin Yue  wrote:
> 
> I have json files which contains timestamped events.  Each event associate 
> with a user id. 
> 
> Now I want to group by user id. So converts from
> 
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
> 
> To intermediate storage. 
> UserIDA -> (Event1, Event2...) 
> UserIDB-> (Event3...) 
> 
> Then I will label positives and featurize the Events Vector in many different 
> ways, fit each of them into the Logistic Regression. 
> 
> I want to save intermediate storage permanently since it will be used many 
> times.  And there will new events coming every day. So I need to update this 
> intermediate storage every day. 
> 
> Right now I store intermediate storage using Json files.  Should I use 
> Parquet instead?  Or is there better solutions for this use case?
> 
> Thanks a lot !
> 
> 
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Should I convert json into parquet?

2015-10-17 Thread Gavin Yue

I have json files which contains timestamped events.  Each event associate
with a user id.

Now I want to group by user id. So converts from

Event1 -> UserIDA;
Event2 -> UserIDA;
Event3 -> UserIDB;

To intermediate storage.
UserIDA -> (Event1, Event2...)
UserIDB-> (Event3...)

Then I will label positives and featurize the Events Vector in many
different ways, fit each of them into the Logistic Regression.

I want to save intermediate storage permanently since it will be used many
times.  And there will new events coming every day. So I need to update
this intermediate storage every day.

Right now I store intermediate storage using Json files.  Should I use
Parquet instead?  Or is there better solutions for this use case?

Thanks a lot !

Re: Should I convert json into parquet?

RE: Should I convert json into parquet?

Re: Should I convert json into parquet?

Should I convert json into parquet?

4 matches

Site Navigation

Mail list logo

Footer information