Re: HFile vs Parquet for very wide table

Jerry He Thu, 21 Jan 2016 20:34:35 -0800

What do you want to do with your matrix data?  How do you want to use it?
Do you need random read/write or point query?  Do you need to get the
row/record or many many columns at a time?
If yes, HBase is a good choice for you.
Parquet is good as a storage format for large scans, aggregations, on
limited number of specific columns. Analytical type of work.


Jerry




On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <[email protected]> wrote:

> I have very limited knowledge on Parquet, so I can only answer from HBase
> point of view.
>
> Please see recent thread on number of columns in a row in HBase:
>
> http://search-hadoop.com/m/YGbb3NN3v1jeL1f
>
> There're a few Spark hbase connectors.
> See this thread:
>
> http://search-hadoop.com/m/q3RTt4cp9Z4p37s
>
> Sorry I cannot answer performance comparison question.
>
> Cheers
>
> On Thu, Jan 21, 2016 at 2:43 PM, Krishna <[email protected]> wrote:
>
> > We are evaluating Parquet and HBase for storing a dense & very, very wide
> > matrix (can have more than 600K columns).
> >
> > I've following questions:
> >
> >    - Is there is a limit on # of columns in Parquet or HFile? We expect
> to
> >    query [10-100] columns at a time using Spark - what are the
> performance
> >    implications in this scenario?
> >    - HBase can support millions of columns - anyone with prior experience
> >    that compares Parquet vs HFile performance for wide structured tables?
> >    - We want a schema-less solution since the matrix can get wider over a
> >    period of time
> >    - Is there a way to generate wide structured schema-less Parquet files
> >    using map-reduce (input files are in custom binary format)?
> >
> > What other solutions other than Parquet & HBase are useful for this
> > use-case?
> >
>

Re: HFile vs Parquet for very wide table

Reply via email to