What do you want to do with your matrix data? How do you want to use it? Do you need random read/write or point query? Do you need to get the row/record or many many columns at a time? If yes, HBase is a good choice for you. Parquet is good as a storage format for large scans, aggregations, on limited number of specific columns. Analytical type of work.
Jerry On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <[email protected]> wrote: > I have very limited knowledge on Parquet, so I can only answer from HBase > point of view. > > Please see recent thread on number of columns in a row in HBase: > > http://search-hadoop.com/m/YGbb3NN3v1jeL1f > > There're a few Spark hbase connectors. > See this thread: > > http://search-hadoop.com/m/q3RTt4cp9Z4p37s > > Sorry I cannot answer performance comparison question. > > Cheers > > On Thu, Jan 21, 2016 at 2:43 PM, Krishna <[email protected]> wrote: > > > We are evaluating Parquet and HBase for storing a dense & very, very wide > > matrix (can have more than 600K columns). > > > > I've following questions: > > > > - Is there is a limit on # of columns in Parquet or HFile? We expect > to > > query [10-100] columns at a time using Spark - what are the > performance > > implications in this scenario? > > - HBase can support millions of columns - anyone with prior experience > > that compares Parquet vs HFile performance for wide structured tables? > > - We want a schema-less solution since the matrix can get wider over a > > period of time > > - Is there a way to generate wide structured schema-less Parquet files > > using map-reduce (input files are in custom binary format)? > > > > What other solutions other than Parquet & HBase are useful for this > > use-case? > > >
