Hi, If your SQL-on-Hadoop solution supports secondary indexes, you can simply create those on the popular columns to speed up query time.
Dave -----Original Message----- From: Bin Wang [mailto:[email protected]] Sent: Wednesday, May 11, 2016 2:22 PM To: [email protected] Subject: Able to search by all the columns and faster than impala Hi there, I have a use case here where I have a table that have low billions of rows and less than 50 columns. This is a very popular data sources where there is a huge demand internally people want to query the table. Nothing more complex than "select * from where .. and .." However, not everyone is able to write SQL and we are thinking about building this into a web application. However, we usually use the tool Hive and Impala and neither of them is fast enough to be the backend of a web application. (think about what happens if you click a button and it returned the result back to you in minutes... annoying!) We tried to dump the data into Solrcloud and it worked really well if the number of returned results is small.. but the performance is really back even when the row count is a few thousand... We also tried to use HBase where it is really hard to design a schema since a full table scan takes lots of time anyway. Then I have an idea that we can write a map reduce to turn the data into a melted <http://www.statmethods.net/management/reshape.html> format. And then we group them by column_name and cell_value, and build a list of row-id where this value appear, sort of the inverted index of Lucene. so we have two tables in HBase. So when user issues a query maybe where 3 where conditions, it will first quickly go the the inverted index to search for what are the row ids that meet each condition, then we find the intersection for those three lists. Now we have a list of all the row id where meet the three criteria. In the end, we look up those row id in another HBase table to retrieve the rows. Can anyone shed some lights on this? Best regards, Bin
