Re: HFile vs Parquet for very wide table

Krishna Fri, 22 Jan 2016 10:05:50 -0800

Thanks Ted, Jerry.

Computing pairwise similarity is the primary purpose of the matrix. This is
done by extracting all rows for a set of columns at each iteration.


On Thursday, January 21, 2016, Jerry He <[email protected]> wrote:

> What do you want to do with your matrix data?  How do you want to use it?
> Do you need random read/write or point query?  Do you need to get the
> row/record or many many columns at a time?
> If yes, HBase is a good choice for you.
> Parquet is good as a storage format for large scans, aggregations, on
> limited number of specific columns. Analytical type of work.
>
> Jerry
>
>
>
>
> On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <[email protected]
> <javascript:;>> wrote:
>
> > I have very limited knowledge on Parquet, so I can only answer from HBase
> > point of view.
> >
> > Please see recent thread on number of columns in a row in HBase:
> >
> > http://search-hadoop.com/m/YGbb3NN3v1jeL1f
> >
> > There're a few Spark hbase connectors.
> > See this thread:
> >
> > http://search-hadoop.com/m/q3RTt4cp9Z4p37s
> >
> > Sorry I cannot answer performance comparison question.
> >
> > Cheers
> >
> > On Thu, Jan 21, 2016 at 2:43 PM, Krishna <[email protected]
> <javascript:;>> wrote:
> >
> > > We are evaluating Parquet and HBase for storing a dense & very, very
> wide
> > > matrix (can have more than 600K columns).
> > >
> > > I've following questions:
> > >
> > >    - Is there is a limit on # of columns in Parquet or HFile? We expect
> > to
> > >    query [10-100] columns at a time using Spark - what are the
> > performance
> > >    implications in this scenario?
> > >    - HBase can support millions of columns - anyone with prior
> experience
> > >    that compares Parquet vs HFile performance for wide structured
> tables?
> > >    - We want a schema-less solution since the matrix can get wider
> over a
> > >    period of time
> > >    - Is there a way to generate wide structured schema-less Parquet
> files
> > >    using map-reduce (input files are in custom binary format)?
> > >
> > > What other solutions other than Parquet & HBase are useful for this
> > > use-case?
> > >
> >
>

Re: HFile vs Parquet for very wide table

Reply via email to