Thanks Ted, Jerry. Computing pairwise similarity is the primary purpose of the matrix. This is done by extracting all rows for a set of columns at each iteration.
On Thursday, January 21, 2016, Jerry He <[email protected]> wrote: > What do you want to do with your matrix data? How do you want to use it? > Do you need random read/write or point query? Do you need to get the > row/record or many many columns at a time? > If yes, HBase is a good choice for you. > Parquet is good as a storage format for large scans, aggregations, on > limited number of specific columns. Analytical type of work. > > Jerry > > > > > On Thu, Jan 21, 2016 at 3:25 PM, Ted Yu <[email protected] > <javascript:;>> wrote: > > > I have very limited knowledge on Parquet, so I can only answer from HBase > > point of view. > > > > Please see recent thread on number of columns in a row in HBase: > > > > http://search-hadoop.com/m/YGbb3NN3v1jeL1f > > > > There're a few Spark hbase connectors. > > See this thread: > > > > http://search-hadoop.com/m/q3RTt4cp9Z4p37s > > > > Sorry I cannot answer performance comparison question. > > > > Cheers > > > > On Thu, Jan 21, 2016 at 2:43 PM, Krishna <[email protected] > <javascript:;>> wrote: > > > > > We are evaluating Parquet and HBase for storing a dense & very, very > wide > > > matrix (can have more than 600K columns). > > > > > > I've following questions: > > > > > > - Is there is a limit on # of columns in Parquet or HFile? We expect > > to > > > query [10-100] columns at a time using Spark - what are the > > performance > > > implications in this scenario? > > > - HBase can support millions of columns - anyone with prior > experience > > > that compares Parquet vs HFile performance for wide structured > tables? > > > - We want a schema-less solution since the matrix can get wider > over a > > > period of time > > > - Is there a way to generate wide structured schema-less Parquet > files > > > using map-reduce (input files are in custom binary format)? > > > > > > What other solutions other than Parquet & HBase are useful for this > > > use-case? > > > > > >
