Re: Should I convert json into parquet?
For general data access of the pre-computed aggregates (group by) you’re better off with Parquet. I’d only choose JSON if I needed interop with another app stack / language that has difficulty accessing parquet (E.g. Bulk load into document db…). On a strategic level, both JSON and parquet are similar since neither give you good random access, so you can’t simply “update specific user Ids on new data coming in”. Your strategy will probably be to re-process all the users by loading new data and current aggregates, joining and writing a new version of the aggregates… If you’re worried about update performance then you probably need to look at a DB that offers random write access (Cassandra, Hbase..) -adrian On 10/19/15, 12:31 PM, "Ewan Leith" wrote: >As Jörn says, Parquet and ORC will get you really good compression and can be >much faster. There also some nice additions around predicate pushdown which >can be great if you've got wide tables. > >Parquet is obviously easier to use, since it's bundled into Spark. Using ORC >is described here >http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ > >Thanks, >Ewan > >-Original Message- >From: Jörn Franke [mailto:jornfra...@gmail.com] >Sent: 19 October 2015 06:32 >To: Gavin Yue >Cc: user >Subject: Re: Should I convert json into parquet? > > > >Good Formats are Parquet or ORC. Both can be useful with compression, such as >Snappy. They are much faster than JSON. however, the table structure is up >to you and depends on your use case. > >> On 17 Oct 2015, at 23:07, Gavin Yue wrote: >> >> I have json files which contains timestamped events. Each event associate >> with a user id. >> >> Now I want to group by user id. So converts from >> >> Event1 -> UserIDA; >> Event2 -> UserIDA; >> Event3 -> UserIDB; >> >> To intermediate storage. >> UserIDA -> (Event1, Event2...) >> UserIDB-> (Event3...) >> >> Then I will label positives and featurize the Events Vector in many >> different ways, fit each of them into the Logistic Regression. >> >> I want to save intermediate storage permanently since it will be used many >> times. And there will new events coming every day. So I need to update this >> intermediate storage every day. >> >> Right now I store intermediate storage using Json files. Should I use >> Parquet instead? Or is there better solutions for this use case? >> >> Thanks a lot ! >> >> >> >> >> >> > >- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional >commands, e-mail: user-h...@spark.apache.org > > >- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >
RE: Should I convert json into parquet?
As Jörn says, Parquet and ORC will get you really good compression and can be much faster. There also some nice additions around predicate pushdown which can be great if you've got wide tables. Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is described here http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ Thanks, Ewan -Original Message- From: Jörn Franke [mailto:jornfra...@gmail.com] Sent: 19 October 2015 06:32 To: Gavin Yue Cc: user Subject: Re: Should I convert json into parquet? Good Formats are Parquet or ORC. Both can be useful with compression, such as Snappy. They are much faster than JSON. however, the table structure is up to you and depends on your use case. > On 17 Oct 2015, at 23:07, Gavin Yue wrote: > > I have json files which contains timestamped events. Each event associate > with a user id. > > Now I want to group by user id. So converts from > > Event1 -> UserIDA; > Event2 -> UserIDA; > Event3 -> UserIDB; > > To intermediate storage. > UserIDA -> (Event1, Event2...) > UserIDB-> (Event3...) > > Then I will label positives and featurize the Events Vector in many different > ways, fit each of them into the Logistic Regression. > > I want to save intermediate storage permanently since it will be used many > times. And there will new events coming every day. So I need to update this > intermediate storage every day. > > Right now I store intermediate storage using Json files. Should I use > Parquet instead? Or is there better solutions for this use case? > > Thanks a lot ! > > > > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Should I convert json into parquet?
Good Formats are Parquet or ORC. Both can be useful with compression, such as Snappy. They are much faster than JSON. however, the table structure is up to you and depends on your use case. > On 17 Oct 2015, at 23:07, Gavin Yue wrote: > > I have json files which contains timestamped events. Each event associate > with a user id. > > Now I want to group by user id. So converts from > > Event1 -> UserIDA; > Event2 -> UserIDA; > Event3 -> UserIDB; > > To intermediate storage. > UserIDA -> (Event1, Event2...) > UserIDB-> (Event3...) > > Then I will label positives and featurize the Events Vector in many different > ways, fit each of them into the Logistic Regression. > > I want to save intermediate storage permanently since it will be used many > times. And there will new events coming every day. So I need to update this > intermediate storage every day. > > Right now I store intermediate storage using Json files. Should I use > Parquet instead? Or is there better solutions for this use case? > > Thanks a lot ! > > > > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Should I convert json into parquet?
I have json files which contains timestamped events. Each event associate with a user id. Now I want to group by user id. So converts from Event1 -> UserIDA; Event2 -> UserIDA; Event3 -> UserIDB; To intermediate storage. UserIDA -> (Event1, Event2...) UserIDB-> (Event3...) Then I will label positives and featurize the Events Vector in many different ways, fit each of them into the Logistic Regression. I want to save intermediate storage permanently since it will be used many times. And there will new events coming every day. So I need to update this intermediate storage every day. Right now I store intermediate storage using Json files. Should I use Parquet instead? Or is there better solutions for this use case? Thanks a lot !