Re: HBase and Lucene for realtime search

2011-02-13 Thread Thomas Koch
Jason Rutherglen: Hello, I'm curious as to what a 'good' approach would be for implementing search in HBase (using Lucene) with the end goal being the integration of realtime search into HBase. I think the use case makes sense as HBase is realtime and has a write-ahead log, performs

Re: HBase and Lucene for realtime search

2011-02-13 Thread Sean Bigdatafun
On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning tdunn...@maprtech.com wrote: On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I can't imagine that the speed achieved by using Hbase would be even within orders of magnitude of what you can do in Lucene 4

Designing table with auto increment key

2011-02-13 Thread Something Something
Hello, Can you please tell me if this is the proper way of designing a table that's got an auto increment key? If there's a better way please let me know that as well. After reading the mail archives, I learned that the best way is to use the 'incrementColumnValue' method of HTable. So

Re: Using the Hadoop bundled in the lib directory of HBase

2011-02-13 Thread Ryan Rawson
On Sun, Feb 13, 2011 at 8:29 AM, Mike Spreitzer mspre...@us.ibm.com wrote: Yes, I simply took the Hadoop 0.20.2 release, deleted its hadoop-core.jar, and replaced it with the contents of lib/hadoop-core-0.20-append-r1056497.jar from hbase. I'm not sure what to do with this approach might

Re: HBase and Lucene for realtime search

2011-02-13 Thread Jason Rutherglen
Transactional consistency isn't going to happen if you even involve more than one hbase row. What does this mean? Or rather, can you elaborate? What they need is that documents can be found very shortly after they are inserted and that crashes won't compromise that. Right. I think HBase

Re: HBase and Lucene for realtime search

2011-02-13 Thread Jason Rutherglen
Google's percolator paper. Can you post a link? Another issue is that maybe the scalability needs for search might be different. An HBase region is always only active in one region server, there are no active replica's, while often for search you need replicas to scale, since a search will

HBase bulk load spawn high number of reducer tasks - any workaround

2011-02-13 Thread manobal
HBase bulk load (using configureIncrementalLoad helper method) configures the job to create as many reducer task as the regions in the hbase table. So if there are few hundred regions then the job would spawn few hundred reducer tasks. This could get very slow on a small cluster.. Is there any

Re: HBase and Lucene for realtime search

2011-02-13 Thread Jason Rutherglen
Do you want to do Term- or Document partitioning? It sounds like no one uses term partitioning, doc-partitioning seems to be the most logical default? serve the index shards from memory In Lucene-land this's a function of allocating enough RAM for the system IO cache. On Sun, Feb 13, 2011 at

Re: HBase and Lucene for realtime search

2011-02-13 Thread Jason Rutherglen
I think there's another way to look at this, and that is what types of queries do HBase users perform that search can enhance? Eg, given we can index extremely quickly with Lucene and with RT we can search with near-zero latency, perhaps there are new queries that would be of interest/useful to

Re: Designing table with auto increment key

2011-02-13 Thread Ryan Rawson
you can also stripe, eg: c_1 starts at 1, skip=100 c_2 starts at 2, skip=100 c_$i starts at $i, skip=100 for 3..99 now you have 100x speed/parallelism. If single regionserver assignment becomes a problem, use multiple tables. On Sun, Feb 13, 2011 at 10:12 PM, Lars George lars.geo...@gmail.com

Re: HBase and Lucene for realtime search

2011-02-13 Thread Ted Dunning
Doc-partitioning has much better failure modes and is universal in my experience for serious applications. On Sun, Feb 13, 2011 at 6:01 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Do you want to do Term- or Document partitioning? It sounds like no one uses term partitioning,

Re: HBase and Lucene for realtime search

2011-02-13 Thread Ted Dunning
I would avoid this, personally. Serious transactions and complex queries are pretty much incompatible with simple implementation and large scale. Flow based updates and write-behind are more the norm. On Sun, Feb 13, 2011 at 6:09 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I

RE: Error recovery for block... failed because recovery from primary datanode failed 6 times

2011-02-13 Thread Jonathan Gray
The DFS errors are after the server aborts. What is in the log before the server abort? Doesn't seem to show any reason here which is unusual. Anything in the master? Did it time out this RS? You're running with replication = 1? -Original Message- From: Bradford Stephens

Re: Error recovery for block... failed because recovery from primary datanode failed 6 times

2011-02-13 Thread Bradford Stephens
We've got dfs.replication = 3 in hdfs-site.xml doing a grep for FATAL and the surrounding 50 lines yields this: Regionserver log: http://pastebin.com/3cYYNhct HMaster and DataNode logs seem pretty boring, no errors. Some sections of lots of scheduling/deleting blocks... Restarted the HBase