Hello, If you want to use Lucene....why not use Lucene, or one of the fancy search servers built on top of it - Solr(Cloud), ElasticSearch, or SenseiDB? You can easily shard the index by time, lookup by key, and search using full-text search with results sorted by some key value or relevance to the query.
Otis -- Performance Monitoring - http://sematext.com/spm/index.html Search Analytics - http://sematext.com/search-analytics/index.html On Wed, Dec 5, 2012 at 10:28 PM, tgh <[email protected]> wrote: > Thank you for your reply > > And I want to access the data with lucene search engine, that is, with key > to retrieve any message, and I also want to get one hour data together, so > I > think to split data table into one hour , or if I can store it in one big > table, is it better than store in 365 table or store in 365*24 table, which > one is best for my data access schema, and I am also confused about how to > make secondary index in hbase , if I have use some key words search engine > , > lucene or other > > > Could you help me > Thank you > > ------------- > Tian Guanhua > > > > -----邮件原件----- > 发件人: [email protected] > [mailto:[email protected]] 代表 Ian > Varley > 发送时间: 2012年12月6日 11:01 > 收件人: [email protected] > 主题: Re: how to store 100billion short text messages with hbase > > Tian, > > The best way to think about how to structure your data in HBase is to ask > the question: "How will I access it?". Perhaps you could reply with the > sorts of queries you expect to be able to do over this data? For example, > retrieve any single conversation between two people in < 10 ms; or show all > conversations that happened in a single hour, regardless of participants. > HBase only gives you fast GET/SCAN access along a single "primary" key (the > row key) so you must choose it carefully, or else duplicate & denormalize > your data for fast access. > > Your data size seems reasonable (but not overwhelming) for HBase. 100B > messages x 1K bytes per message on average comes out to 100TB. That, plus > 3x > replication in HDFS, means you need roughly 300TB of space. If you have 13 > nodes (taking out 2 for redundant master services) that's a requirement for > about 23T of space per server. That's a lot, even these days. Did I get all > that math right? > > On your question about multiple tables: a table in HBase is only a > namespace > for rowkeys, and a container for a set of regions. If it's a homogenous > data > set, there's no advantage to breaking the table into multiple tables; > that's > what regions within the table are for. > > Ian > > ps - Please don't cross post to both dev@ and user@. > > On Dec 5, 2012, at 8:51 PM, tgh wrote: > > > Hi > > I try to use hbase to store 100billion short texts messages, each > > message has less than 1000 character and some other items, that is, > > each messages has less than 10 items, > > The whole data is a stream for about one year, and I want to create > > multi tables to store these data, I have two ideas, the one is to > > store the data in one hour in one table, and for one year data, there > > are 365*24 tables, the other is to store the date in one day in one > > table, and for one year , there are 365 tables, > > > > And I have about 15 computer nodes to handle these data, and I want > > to know how to deal with these data, the one for 365*24 tables , or > > the one for 365 tables, or other better ideas, > > > > I am really confused about hbase, it is powerful yet a bit complex > > for me , is it? > > Could you give me some advice for hbase data schema and others, > > Could you help me, > > > > > > Thank you > > --------------------------------- > > Tian Guanhua > > > > > > > > > > > > > > >
