[jira] [Commented] (HBASE-3529) Add search to HBase

2018-02-27 Thread caolong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379712#comment-16379712
 ] 

caolong commented on HBASE-3529:


 this is a best idea,It's a pity it to be leater

> Add search to HBase
> ---
>
> Key: HBASE-3529
> URL: https://issues.apache.org/jira/browse/HBASE-3529
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.90.0
>Reporter: Jason Rutherglen
>Priority: Major
> Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The 
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, 
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already 
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  
> This means cascading add, update, and deletes between a Lucene index and an 
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to 
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly 
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index 
> to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore 
> will explicitly not be used because the document/row data is stored in HBase. 
>  We will need to solve what the best data structure for efficiently mapping a 
> docid -> row key is.  It could be a docstore, field cache, column stride 
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be 
> searched on, meaning there's no need for a separate search related system in 
> Zookeeper.
> * Integrate search with HBase's RPC mechanis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-3529) Add search to HBase

2016-06-27 Thread Dzmitry.Lahoda (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350934#comment-15350934
 ] 

Dzmitry.Lahoda commented on HBASE-3529:
---

I am seeking to replace oracle db + elastic search in data intensive legal 
ediscovery application. Hbase seems suit, and internal integration with lucene 
would be interesting option. I know about external integrations of hbase with 
solr/elastcisearch integrations, but these have own problems as I understand 
(need to data storage and orchestration of cluster instead of one).

> Add search to HBase
> ---
>
> Key: HBASE-3529
> URL: https://issues.apache.org/jira/browse/HBASE-3529
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 0.90.0
>Reporter: Jason Rutherglen
> Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The 
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, 
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already 
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  
> This means cascading add, update, and deletes between a Lucene index and an 
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to 
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly 
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index 
> to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore 
> will explicitly not be used because the document/row data is stored in HBase. 
>  We will need to solve what the best data structure for efficiently mapping a 
> docid -> row key is.  It could be a docstore, field cache, column stride 
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be 
> searched on, meaning there's no need for a separate search related system in 
> Zookeeper.
> * Integrate search with HBase's RPC mechanis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-3529) Add search to HBase

2013-08-02 Thread linwukang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13727536#comment-13727536
 ] 

linwukang commented on HBASE-3529:
--

I think the most difficut part to integrate Solr into hbase is How to maintain 
consistency between solr and hbase.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanis

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2012-04-26 Thread Martin Alig (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262494#comment-13262494
 ] 

Martin Alig commented on HBASE-3529:


@Json: Are you still working on this issue?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062266#comment-13062266
 ] 

Jason Rutherglen commented on HBASE-3529:
-

With some recent patches committed to Lucene, I can post a patch to HBase trunk 
that should work fine, that will only require the special HDFS-347 
modification/build.  Perhaps it's possible to Maven in the custom HDFS-347 so 
that no external libraries need to manually downloaded.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062269#comment-13062269
 ] 

Andrew Purtell commented on HBASE-3529:
---

bq. Perhaps it's possible to Maven in the custom HDFS-347 so that no external 
libraries need to manually downloaded.

Post 0.92 we plan to modularize the Maven build already for pluggable RPC and 
security-variant code. We can also conditionally build coprocessors set in 
their own packages. In this case, something like {{-D HDFS-347}} enables build 
of it, and pulls down a suitably patched Hadoop core jar?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062270#comment-13062270
 ] 

Jason Rutherglen commented on HBASE-3529:
-

bq. We can also conditionally build coprocessors set in their own packages

Ok, that sounds interesting.  Currently I'm pretending like search will be a 
part of HBase core.  :)  If there is another directory to place it in, eg, a 
coprocessor or contrib directory, I will place it there.

bq. In this case, something like {{-D HDFS-347}} enables build of it, and pulls 
down a suitably patched Hadoop core jar?

Yeah I have no idea how to post the HDFS-347-LUCENE version to a Maven repo and 
get that working.  I can however probably figure it out.  

I like the idea of posting a patch, putting things on Github seems quite 
remote, even to me, and I admit to preferring the simplicity of SVN on this 
currently one man project.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062273#comment-13062273
 ] 

Andrew Purtell commented on HBASE-3529:
---

bq. Currently I'm pretending like search will be a part of HBase core.

Like security, I think there will be enough interest for this that core but 
conditional makes a lot of sense.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062286#comment-13062286
 ] 

Jason Rutherglen commented on HBASE-3529:
-

What's the best way to set custom attributes on the Coprocessor?  Eg, I want to 
tell the Lucene Coprocessor where to look for a configuration file in HDFS.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062292#comment-13062292
 ] 

Andrew Purtell commented on HBASE-3529:
---

bq. What's the best way to set custom attributes on the Coprocessor? Eg, I want 
to tell the Lucene Coprocessor where to look for a configuration file in HDFS.

See HBASE-4048 and HBase-3810. 3810 is still pending.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062302#comment-13062302
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I opened a trivial issue LUCENE-3296 so that the custom IW config can be passed 
in.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062318#comment-13062318
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I'm signing up to [1] for the HDFS-347 Maven hosting.

1. http://nexus.sonatype.org/oss-repository-hosting.html

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050658#comment-13050658
 ] 

Jason Rutherglen commented on HBASE-3529:
-

To implement distributed search with sort, we'll need to serialize the field 
values across the RPC channel.  This can be implemented by assuming the sort is 
by ord which yields BytesRef values, which are easy to sort.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046696#comment-13046696
 ] 

Jason Rutherglen commented on HBASE-3529:
-

bq. Does that mean that in order to implement distributed search you'll 
immediately convert this to HBase+Solr instead of HBase+Lucene

I think the distributed search capability has been removed from Lucene (I just 
sent an email to Lucene dev)?  We should add it back?  Hence the possible Solr 
integration.  

bq. If so, what about NRTness that will be lost until Solr gets NRT search?

There's a Solr issue to add this though one wouldn't want to implement NRT 
without LUCENE-3092 + SOLR-2565.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Alex Baranau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046016#comment-13046016
 ] 

Alex Baranau commented on HBASE-3529:
-

Another problem we faced: looks like there's an issue in TestLuceneCoprocessor 
tests life-cycle or smth else:
* the testSearchRPC test fails if we run mvn clean 
-Dtest=TestLuceneCoprocessor test, other 2 pass (it fails on first assert: 
expected 20, but found 10)
* if I add @Ignore to other two tests, i.e. the maven command runs only 
testSearchRPC, it works well

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046023#comment-13046023
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Hi Alex, I have new code I will commit to Github.  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Alex Baranau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046026#comment-13046026
 ] 

Alex Baranau commented on HBASE-3529:
-

Thank you! Berlin is waiting! (kidding, we are going to leave very soon)

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046258#comment-13046258
 ] 

Otis Gospodnetic commented on HBASE-3529:
-

A few more comments/questions for Jason:

* I see PKIndexSplitter usage for splitting the index when a region splits.  I 
see you split the index, open 2 IndexWriters for 2 new Lucene indices, but then 
either you are not adding documents to them, or I'm not seeing it?

* Are there issues around distributed search?  It looks like it wasn't in your 
github branch.

* What will happen when a region changes its location/regionserver for whatever 
reason?  I see HDFS-2004 got -1ed and you said without that search will be 
slow.  Do you have an alternative plan?

* What is the reason for storing those 2 extra row fields? (the UID one at the 
other one... I think it's called rowStr or something like that)

* What about storing the index in HBase itself? (a la Solandra, I suppose)  
Would this be doable?  Would it make things simpler in the sense that any 
splitting or moving around, etc. may be handled by HBase and we wouldn't have 
to make sure the Lucene index always mirrors what's in a region and make sure 
it follows the region wherever it goes?  Lars' idea/question, and I hope I 
didn't misunderstand or misrepresent his ideas.


 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046267#comment-13046267
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Otis, I think many of your questions have been addressed in this issue, though 
indeed the comment trail is long at this point.  

bq. Do you have an alternative plan?

https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13040465page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13040465

bq. Are there issues around distributed search? It looks like it wasn't in your 
github branch

https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913

bq. What about storing the index in HBase itself?

I think that's a great idea to test, though in a different Jira issue.

bq. PKIndexSplitter

That's LUCENE-2919.  Given it's not been committed I may need to bring it over 
into the HBase search source tree.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046274#comment-13046274
 ] 

Otis Gospodnetic commented on HBASE-3529:
-

Re 
https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913

Does that mean that in order to implement distributed search you'll immediately 
convert this to HBase+Solr instead of HBase+Lucene, so that you don't have to 
do Lucene-level distributed search?  If so, what about NRTness that will be 
lost until Solr gets NRT search?


 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-02 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042913#comment-13042913
 ] 

Jason Rutherglen commented on HBASE-3529:
-

SOLR-1431 is updated to trunk. I'm tempted to start trying to plug in
Solr. I think the way to do this is to use the HTable.coprocessorExec
method (for the distributed search), where the Solr shards are of the form
'shards=start:hexstartkey,end:hexendkey'. Then HBase will take care of the
rest from an RPC perspective. Eg, forwarding the request to the individual
HRegion's running the SolrCoprocessor.

I think we'll use a single Solr schema per region, though we can add a
special delimiter in the field name to indicate that the prefix is the
column family, then the column name. Something like 'headers:subject' may
work. The main caveat is that the fields marked stored in fact
will not be stored into Lucene (because they're in HBase).

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-27 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13040465#comment-13040465
 ] 

Jason Rutherglen commented on HBASE-3529:
-

In discussing with J-D (thanks!), we can place logic in the Lucene
Coprocessor preOpen method to find out if any of the blocks of the Lucene
files in HDFS are not local (by asking the NameNode), then we can:

1) Rewrite, partially optimize, or fully optimize the index, thereby
rewriting the index files which causes them to 'go local'.

2) Extend the default placement policy and balancer to skip 'balancing'
Lucene files, because we want them to stay local.

3) Use HDFS-2004 to manually move non-local blocks to the local DataNode.

Where #3 is more complex and will likely be much more time consuming.

This functionality is important as it could currently be considered the only
'blocker' on putting HBase search into a test/production environment.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-26 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039882#comment-13039882
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I opened HDFS-2004 to implement pinning HDFS files (in this case the Lucene 
index files) to the local DataNode.  I think this is necessary functionality 
for HBase search because all index files need to be local (we're MMap'ing).  I 
think the common use case is a region server goes down, when the new one is 
brought up, files will likely not be local?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033281#comment-13033281
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I updated the Lucene version to the latest from trunk which includes the new 
asynchronous flushing of the RAM buffer.  As expected, this has put the 
indexing creation using HDFS in line with Lucene (because the overhead from the 
DataNode does not delay further indexing).  Also it looks like the query times 
are in fact nearly the same as well.  

Lucene indexing duration: 57858 ms
Lucene query time #1: 14208 ms
Lucene query time #2: 7024 ms
Lucene query time #3: 6902 ms

HBase indexing duration: 50631 ms
HBase query time #1: 8625 ms
HBase query time #2: 7081 ms
HBase query time #3: 7139 ms

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033288#comment-13033288
 ] 

stack commented on HBASE-3529:
--

Nice!

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-13 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033311#comment-13033311
 ] 

Jason Rutherglen commented on HBASE-3529:
-

bq. Awesome stuff. These query times above are using the hacky (non-secure 
non-checksummed) implementation of HDFS-347?

It's hackier than that.  It's basically obtaining the java.io.File directly 
from the FSInputStream.  However it's a good baseline to benchmark against 
things like HADOOP-6311 + HDFS-347.  Those need to wait for HBase that works 
with Hadoop 0.22/trunk anyways?

{quote}
User defines some special property on a column family that they want to be 
searchable, this property would include a solr schema which specifies analyzers 
and fields
{quote}

Currently there's a DocumentTransformer class which needs to be implemented to 
transform column-family edits into a Lucene document.  That could use the Solr 
schema for example or any other separate system to tokenize the byte[]s into a 
Document.

{quote}User can now perform an arbitrary lucene search over the table, 
resulting in completely up-to-date results? (ie spans both memstore and flushed 
data)?{quote}

I think for now we need to offer an external commit on the index, as Lucene 
only has near realtime search (eg, small segments will be written out, which 
will overwhelm HDFS).  LUCENE-2312 will implement realtime search (eg, 
searching on the RAM buffer as it's being built).  The recent LUCENE-3092 could 
be used in the meantime to build segments in RAM, and only flush to HDFS when 
it's too RAM consuming, then we would not need to force the user to 'commit' 
the index.

To answer the question, yes, though today the indexing performance will not be 
as good as when LUCENE-2312 is implemented or the user will need to 'commit' 
the index to search on the latest data.

Getting all of Solr work work with this system is fairly doable.  Each Solr 
core would map to a region.  Things like replication would be disabled.  The 
config files would be stored in HDFS (instead of the local filesystem).  For 
distributed queries, we need SOLR-1431, and then to implement distributed 
networking using HBase RPC instead of Solr's HTTP RPC.  There are other smaller 
internal things that'd need to change in Solr.  I think HBase RPC is aware of 
where regions live etc so I don't think we need to worry about putting failover 
logic into the distributed search code?

I'm going to post additional benchmarks shortly, eg, for 100,000 and 1 mil 
documents.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-11 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032015#comment-13032015
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I think the next round of benchmarking could involve showing that we need to 
directly access the underlying block file in order to not lose performance when 
running Lucene on HDFS.  This is somewhat as per the comment on HDFS-347:

https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13013719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13013719

{quote}The next thing we wanted to look at was random I/O. There is a lot
more overhead on the datanode for this particular use case so this
could be a place where direct access could really excel{quote}

We can test using HDFS-941 vs. direct block file access using MMap (by 
obtaining the local file path and the unix domain sockets).  I think then we'll 
show that for the Lucene case, we're on the right track by using direct file 
access.



 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-11 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032196#comment-13032196
 ] 

Jason Rutherglen commented on HBASE-3529:
-

HDFS-941 isn't applying to trunk, and we'll need a semi-unique build of the 
HDFSDirectory and benchmarking code updated to Hadoop trunk (as opposed to 
Hadoop 0.20-append).  Given Unix Domain Sockets HADOOP-6311 is for trunk 
(rather than 0.20-append) we may want to wait for a version of HBase that runs 
on Hadoop trunk, (eg, the current direct file access works fine, Unix Domain 
Sockets is only for security, not speed).  Then we can put off benchmarking 
HDFS-941 as well.  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-17 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13020892#comment-13020892
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I updated the HBase search branch at Github and created complete instructions 
for how to execute the benchmark. This should also help with examining the 
code. The HBASE-SEARCH project contains 10,000 bz2 compressed wiki-en documents 
which account for 100 MB of the download. The slightly modified Lucene 
libraries are located in the lib/ directory (so that you do not need to 
download the entire Lucene branch source). 

https://github.com/jasonrutherglen/HBASE-SEARCH/blob/trunk/BENCHMARK.txt 

The Lucene vs. HBase Search indexing and search times will be located in the 
file: 
target/surefire-reports/org.apache.hadoop.hbase.search.TestSearchBenchmark-output.txt
 

{noformat}
Benchmark Execution Instructions

Create a directory for the HBase Lucene installation.  Then run the following:

git clone git://github.com/jasonrutherglen/HDFS-347-HBASE.git HDFS-347-HBASE
cd HDFS-347-HBASE
ant mvn-install
cd ..

git clone git://github.com/jasonrutherglen/HBASE-SEARCH.git HBASE-SEARCH
cd HBASE-SEARCH
cd lib
./install-libs.sh
cd ..
cd wiki-en
tar -jxvf 1.bz2
cd ..
mvn test -Dtest=TestSearchBenchmark
{noformat} 

Feel free to let me know if there are problems or if you have questions.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019879#comment-13019879
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Here are some basic benchmark numbers.  The code is more or less pushed to 
Github.  I need to verify it all works for a clean download of the various 
parts, of which there are 3, Lucene, HDFS-347 Hadoop 0.20 append modified, and 
HBase with Search. 

The architecture is to write out a single block per Lucene file.  In this way 
we can simply obtain one underlying java.io.File directly from the DFSClient.  
The file is then MMap'ed using a modified version of the MMapDirectory called 
HDFSDirectory.

The benchmark shows that storing Lucene indexes into HDFS and reading directly 
from HDFS is viable (as opposed to copying the files out of HDFS first to the 
local filesystem).

Here are times in milliseconds, on the Wiki-EN corpus:

lucene indexing duration: 50202
lucene query time #1: 11780
lucene query time #2: 6211
lucene query time #3: 6181

hbase indexing duration: 70681
hbase query time #1: 8332
hbase query time #2: 6785
hbase query time #3: 6621

As you can see, the indexing is a little bit slower when writing to HDFS.  
However with the new changes going into Lucene (ie, LUCENE-2324), a pause when 
flushing (due to HDFS overhead) will not slow down indexing.  So expect 
indexing parity soon.

The main query times to look at are the #2 and #3, allowing for warmup of the 
system IO cache in #1.  HBase queries are somewhat slower because each new 
DFSInputStream created must contact the DataNode.  We can optimize this however 
I think for now we're good.

Here are the queries being run (50 times per round), they are non-trivial.

states
unit*
uni*
u*d
un*d
united~0.75
united~0.6
unit~0.7
unit~0.5, // 2
doctitle:/.*[Uu]nited.*/
united OR states
united AND states
nebraska AND states
\united states\
\united states\~3

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13018835#comment-13018835
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I'm working on profiling and optimizing the HDFS random access, so that the 
Lucene HDFS queries are the same as native file system access using 
NIOFSDirectory.  

I think one extremely direct approach is to set the max block size to something 
above all Lucene segments files (at runtime via the DFSClient.create method).  
This will guarantee that there is only one underlying java.io.File per HDFS 
file, and so random access will avoid navigating block structures (which 
require expensive network calls, a binary search, and object creation overhead).

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-07 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017210#comment-13017210
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I placed the HDFS-347 changes in a Github repository located at: 
https://github.com/jasonrutherglen/HDFS-347-HBASE

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016055#comment-13016055
 ] 

Otis Gospodnetic commented on HBASE-3529:
-

Jason, what is the current state of this work?  Does it work with the trunk?  
Is there a list of issues/problems that need to be fixed before this can be 
called working? Thanks!

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016079#comment-13016079
 ] 

Jason Rutherglen commented on HBASE-3529:
-

@Otis The next step is to benchmark the query performance which may be degraded 
due to the random positional read performance of HDFS.  For this maybe we 
should use: http://code.google.com/a/apache-extras.org/p/luceneutil/  Also, the 
blocking issues should [ideally] be resolved.  You can take a look at the Solr 
one SOLR-1431, and commit it.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016085#comment-13016085
 ] 

Otis Gospodnetic commented on HBASE-3529:
-

Thanks Jason.  What's the Solr dependency about?  I thought your idea is to go 
with pure Lucene-level HBase + indexing integration, not Solr.  I do see you 
mention Solr's schema in the initial comments in this issue, but can't find any 
mentions of Solr in your patch.  Could you please clarify the approach?  Oh, 
and if the ML is a better medium, I can move my questions there.  Thanks.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016088#comment-13016088
 ] 

Jason Rutherglen commented on HBASE-3529:
-

@Otis We can benchmark using Lucene in conjunction with HDFS-347, of which I 
have a more streamlined version of that'll be available in Github.  
Implementing Solr for benchmarking would create too much overhead.

I think we may want to integrate with Solr [in the future] for out of the box 
distributed queries, facets, and also to make use of the schema.  I'll likely 
open additional Solr related issues when we get there.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-17 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008006#comment-13008006
 ] 

Ted Yu commented on HBASE-3529:
---

postWALRestore would pass one WALEdit which is for one row.
postPut is for one row as well.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007761#comment-13007761
 ] 

Andrew Purtell commented on HBASE-3529:
---

@Todd Hosting subprojects sounds reasonable to me. We want to make a friendly 
home for cool new work but can also accommodate downstream packagers who don't 
want any kind of support implied.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007768#comment-13007768
 ] 

Jason Rutherglen commented on HBASE-3529:
-

bq. In DefaultDocumentTransformer, I think we should check whether row has 
changed

It's possible to modify multiple rows per postPut or postWALRestore?  Are the 
KeyValue(s) sorted by row, as we probably want to group row modifications 
together.  Also it seems that it's possible to only update a select few columns 
of a row?  So we may need to reload the entire row and index it again?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007780#comment-13007780
 ] 

stack commented on HBASE-3529:
--

@Andrew I'm against taking on src/contribs given past experience where they 
tended to add friction to major core changes.  With hbase up in Apache git, I 
think its easier for projects that are not in our src tree to follow along 
(github makes it easy doc'ing, etc., the related external project).  Discussion 
of the add-on up on hbase is grand (and encouraged I'd say since it lets the 
rest of the hbase space know of the addition) but no src I'd say.  Any changes 
to core an external project requires to work we should take on too (if good 
justification).

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007782#comment-13007782
 ] 

Andrew Purtell commented on HBASE-3529:
---

@Stack I didn't say contrib, I said sub projects.



 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007796#comment-13007796
 ] 

stack commented on HBASE-3529:
--

@Andrew Pardon me for my misread but I'd be agin keeping up subprojects too 
because of the admin load.  We don't need it IMO.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch, 
 lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
 lucene-misc-4.0-SNAPSHOT.jar


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002438#comment-13002438
 ] 

Jason Rutherglen commented on HBASE-3529:
-

To get Solr distributed queries working across the searchable HBase cluster, 
we'll need SOLR-1431 completed.  Then in this issue, we'll implement the 
underlying data transfer protocol using HBase RPC (instead of HTTP).  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000866#comment-13000866
 ] 

Jason Rutherglen commented on HBASE-3529:
-

@Stack Thanks for the analysis.  I forgot to mention that each subquery would 
also require it's own FSInputStream, which would be too many file descriptors.  
The heap required for 25 bytes * 2 mil docs is 50MB, eg, that's too much?  

I think we can go ahead with the positional read which'd only require an 
FSInputStream per file, to be shared by all readers of that file (using 
FileChannel.read(ByteBuffer dst, long position) underneath.  Given the number 
of blocks per Lucene file will be  10 and the blocks are of a fixed size, we 
can divide the (offset / blocksize) to efficiently obtain the block index?  I 
think it'll be efficient to translate a file offset into a local block file, 
eg, I'm not sure why LocatedBlocks.findBlock uses a binary search because I'm 
not familiar enough with HDFS. Then we'd just need to cache the 
LocatedBlock(s), instead of looking them up from the DataNode on each small 
read byte[1024] call.

In summary:

* DFSClient.DFSInputStream.getBlockRange looks fast enough for many calls per 
second
* locatedBlocks.findBlock uses a binary search for some reason, that'll be a 
bottleneck, why can't we divide the number the offset by the number of blocks.  
Oh ok, that's because block sizes are variable.  I guess if the number of 
blocks is small the binary search will always be fast?  Or we can detect if the 
blocks are of the same size and divide to get the correct block?
* DFSClient.DFSInputStream.fetchBlockByteRange is a hotspot because it calls 
chooseDataNode, whose return value [DNAddrPair] can be cached inside of 
LocatedBlock?
* Later in fetchBlockByteRange we call 
DFSClient.createClientDatanodeProtocolProxy() and make a local RPC call, 
getBlockPathInfo.  I think the results of this [BlockPathInfo] can be cached 
into LocatedBlock as well?
* Then instead of instantiating a new BlockReader object, we can call 
FileChannel.read(ByteBuffer b, long pos) directly?
* With this solution in place we can safely store documents in the docstore 
without any worries, and in addition use the system that most efficient in 
Lucene today, all the while using the fewest file descriptors possible.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001209#comment-13001209
 ] 

Jason Rutherglen commented on HBASE-3529:
-

We'll want to keep a single ConcurrentMergeScheduler per HRegionServer (rather 
than per HRegion) even though there'll be an IndexWriter per HRegion (eg, the 
default is to have a CMS per IW, which could potentially generate too many 
threads).  I'm wondering if there's a global attribute space to put the CMS so 
that it can be reused across HRegions?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001269#comment-13001269
 ] 

ryan rawson commented on HBASE-3529:


can you submit this to the proper jira? This isn't hdfs :)


 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001277#comment-13001277
 ] 

Jason Rutherglen commented on HBASE-3529:
-

@Ryan Sure, I just wanted to iterate here a little bit, and then test it out 
with the HDFSDirectory implementation, before submitting it to HDFS-347.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001280#comment-13001280
 ] 

stack commented on HBASE-3529:
--

@Jason Do you need to hack on hdfs first?  Its critical to making the search 
work on hbase?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001285#comment-13001285
 ] 

Jason Rutherglen commented on HBASE-3529:
-

bq. Do you need to hack on hdfs first? Its critical to making the search work 
on hbase?

Yes, HDFS as it is would make queries execute extremely slowly (because of 
random small reads), also I don't know how to implement the HDFSDirectory (the 
Lucene interface to the filesystem) without knowing how HDFS works.  In this 
case, we need to use NIO positional read underneath.  I think the patch shows 
NIO pos is doable and hopefully it'll be completed shortly, enough to implement 
HDFSDirectory and then run a performance comparison of HDFSDirectory vs. 
NIOFSDirectory.  Eg, we'll build identical indexes in both dirs, run the same 
queries and examine the difference in query speed.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001289#comment-13001289
 ] 

stack commented on HBASE-3529:
--

OK.

Why niopositional read?  How is that different than the pread that is already 
in the dfsclient api?  You don't like going via the Block API?  Above you say 
in parens '...(using FileChannel.read(ByteBuffer dst, long position)...'  What 
if the data is not local, usually it is ( 99% of the time), but is not always; 
e.g. in time of failure or perhaps after a rebalance.  You going to get the 
FileChannel off the socket (thats the nio bit)?

You do get the bit that hdfs-347 is a naughty hack as is.  A version that 
respects 'security', where the 'cleared' fd is passed via unix domain sockets, 
for the dfsclient to use going direct is probably what'll go in sometime soon 
hopefully.

You are messing down deep below hbase in dfs.  I'm a little worried that you'll 
do a bunch of custom work that may work for your lucene directory 
implementation but that it will be so particular, it won't be accepted back 
into hdfs.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001292#comment-13001292
 ] 

Ted Yu commented on HBASE-3529:
---

In certain deployment, data node and region server are not on the same machine.
The above would pose performance issue.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001299#comment-13001299
 ] 

Jason Rutherglen commented on HBASE-3529:
-

bq. Why niopositional read? How is that different than the pread that is 
already in the dfsclient api

I think the goal of HDFS-347 is it'll automatically switch between reading over 
the network and reading locally?  So the pread'll do one or the other?

bq. You going to get the FileChannel off the socket (thats the nio bit)?

That's just for the local file.  

bq. What if the data is not local, usually it is ( 99% of the time), but is 
not always; e.g. in time of failure or perhaps after a rebalance. 

If we read off a socket I think there's going to be be a serious degradation in 
performance.  I think that's an invariant of search?

{quote}A version that respects 'security', where the 'cleared' fd is passed via 
unix domain sockets, for the dfsclient to use going direct is probably what'll 
go in sometime soon hopefully.{quote}

That'll be good!  I think this initial version (of HDFS modifications) is 
simply to get things going, as these other [HDFS] improvements are added we can 
use them and the DFSInputStream methods used by HDFSDirectory'll be the same?

{quote}You are messing down deep below hbase in dfs. I'm a little worried that 
you'll do a bunch of custom work that may work for your lucene directory 
implementation but that it will be so particular, it won't be accepted back 
into hdfs.{quote}

If we need to pass the FD using Unix domain sockets then the HDFS work won't be 
useful.  If the UDS's enable positional read, then the [Lucene] HDFSDirectory 
will work well.  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000283#comment-13000283
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I started on the search part, which is nice as it can utilize HBase's 
Coprocessor RPC mechanism.  The design issue is if we need to store a unique 
[family, column, row, timestamp] per column/field into Lucene?  Or perhaps this 
only needs to be stored per column family?  This'll be used on iteration of the 
results from Lucene, which yields docids, we'll then lookup the values in the 
doc store, call Get for each doc, and add the Result to the search response.  I 
think this is how it should work?  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000398#comment-13000398
 ] 

stack commented on HBASE-3529:
--

You'll have to include row, column family, and qualifier at least if you are to 
get from lucene back to the the latest version of the cell, won't you?  If you 
want to index more than just the current version of a cell, you'll have to 
include the hbase timestamp in the lucene index.

If your lucene indices are per column family, you could leave the column family 
out of the lucene document and it can be picked up from context; that would 
leave row, qualifier and timestamp in the lucene document.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000667#comment-13000667
 ] 

Jason Rutherglen commented on HBASE-3529:
-

In regards to HDFS-347 and the issues around fast local file access.  I started 
reimplementing HDFS-347, however I realized it'll be fruitless without an 
efficient [cached] way of finding the local file a given offset corresponds to. 
 Is there a way for the DFSClient to listen for changes to the DataNode and 
then keep a memory resident 'cache' for the purpose of quickly accessing which 
local file(s) a given positional read + length corresponds to?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000693#comment-13000693
 ] 

stack commented on HBASE-3529:
--

@Jason Which offset are you talking off?  The storefile in hbase keeps offsets 
in a file index.  When we ask to read from a position in the hfile, dfsclient 
does a quick calc to figure which block and then relatively, the offset into 
the target block.  Are you talking of something more fine grained or something 
else?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000697#comment-13000697
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Sorry, I thought through the file access a little more.  I think we can use the 
block local reader as is, because Lucene reads the postings sequentially, we 
don't really need random file access (eg, the offset issue more or less goes 
away), we simply need to allow seek'ing forward, and most postings will live 
inside of a single (64 - 128MB block).  The issue with this system is we may 
need to maintain an FSInputStream per thread per file because we probably don't 
want to open a new FSInputStream per query given the overhead or creating and 
destroying them?  Will this cause issues with the max file descriptors?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000706#comment-13000706
 ] 

stack commented on HBASE-3529:
--

@Jason Currently HBase keeps all files open all the time (Yeah, users have to 
up their ulimit if they have more than a smidgeon of data in hbase--requirement 
#4 or #5).

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000721#comment-13000721
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Ah, going back to storing the row, qualifier and timestamp in a Lucene 
document/docstore, is that does require totally random reads.  I wonder if 
there's some efficient way to store row pointers in RAM (compression?) or a 
Hadooop data structure that can be used?  I think that storing this information 
in the Lucene field cache is going to cause OOMs.  It'd be great if we could 
simply store a long that points to the exact row and column family we'd like to 
reference, as that could easily be stored in RAM, and would possibly enable 
faster lookup?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-28 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000730#comment-13000730
 ] 

stack commented on HBASE-3529:
--

Are you thinking you could exploit hbase scan somehow?  If so, how you think it 
would work?

Whats a lucene docid?  A long?  Or a double?  You could toBytes that and that'd 
be the hbase row (HBase rows are byte arrays).  The column family could be one 
byte -- that'd give you 256 maximum column family names.  Qualifier probably 
has to be lucene document field name.  You could try and keep these short. 
Timestamp is a long.  So thats two longs (docid + ts), one byte for cf, and 
say, 8 characters for field name.. thats about 25 bytes or so per lucene doc.  
Will that cause you to run out of mem?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-26 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999789#comment-12999789
 ] 

Jason Rutherglen commented on HBASE-3529:
-

It looks simple to change HDFS-347 (the HDFS-347-branch-20-append.txt patch) to 
read using positional reads, I'm sure it's necessary as a block reader is 
instantiated per DFSInputStream? read(long position, byte[] buffer, int offset, 
int length) calls getBlockRange which is sync'd.  Then the read method calls 
fetchBlockByteRange which calls BlockReader.newBlockReader, eg, the blockreader 
is per thread and isn't reused?  So the contention would be in getBlockRange?  
Perhaps there's not an issue, or not much of one, if the 
HDFS-347-branch-20-append.txt patch (or something like it) is applied (using 
HADOOP-6311)?  

I guess the go ahead is to write a Lucene Directory that uses HDFS underneath, 
that gains concurrency by using DFSInputStream.read(long position, ...)?  Oh, 
the other issue would be all the overhead from simply loading a byte[1024] (eg, 
all the new object creation etc).  Hmm... That'll be a problem.  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999528#comment-12999528
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Where is a good 'temp' directory to place the Lucene indexes relative to other 
local HBase files?  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999533#comment-12999533
 ] 

ryan rawson commented on HBASE-3529:


there are no local hbase files. You'll have to come up with something yourself 
i guess?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999537#comment-12999537
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Maybe something relative to HDFS then?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999542#comment-12999542
 ] 

Andrew Purtell commented on HBASE-3529:
---

I mailed a comment back but it is not showing up fast enough.

We have internally been discussing the addition of a Coprocessor framework API 
for reading and writing streams from/to the region data directory in HDFS.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999545#comment-12999545
 ] 

Andrew Purtell commented on HBASE-3529:
---

We have internally been discussing the addition of a Coprocessor framework API 
for reading and writing streams from/to the region data directory in HDFS.



  


 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999546#comment-12999546
 ] 

ryan rawson commented on HBASE-3529:


it's going to be tricky, since with security some people may choose to run hdfs 
and hbase on different users. Futhermore most hadoop installs have multiple 
jbod-style disks, and places like /tmp won't have much room (my /tmp has  
2GB). If you can avoid local files as much as possible, I'd try to do that.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999552#comment-12999552
 ] 

Jason Rutherglen commented on HBASE-3529:
-

{quote}We have internally been discussing the addition of a Coprocessor 
framework API for reading and writing streams from/to the region data directory 
in HDFS.{quote}

This'd be good, however for Lucene we'll need to directly access the local 
filesystem for performance reasons, eg, HDFS sounds like it's going to be 
slower than going direct (at the moment).  Because the indexes will be local, 
we'll need to periodically sync the local index to HDFS.  This isn't as 
difficult as it sounds, because we can save off a Lucene commit point and write 
the checkpoint's index files to HDFS, while letting other Lucene operations 
proceed.  I'd say we can move to writing directly to HDFS when HBase no longer 
uses a heap based block store (and instead relies on the system IO cache).

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999559#comment-12999559
 ] 

Andrew Purtell commented on HBASE-3529:
---

Writing the indexes to HDFS is possible after LUCENE-2373? We get direct reads 
from HDFS via HDFS-347 and the OS block cache can help there?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Gary Helmling (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999562#comment-12999562
 ] 

Gary Helmling commented on HBASE-3529:
--

Yeah, as Ryan mentions, with security, writing to HDFS via a coprocessor 
extension will be easiest to enable.

I wonder if that plus HDFS-347 (which allows reading directly from the local FS 
if the block exists on the local DN) would allow for good enough performance?  
Of course, HDFS-347 itself is tricky from a security perspective.

If local disk writes are the only solution, then the best option may be to make 
the user plan for it and explicitly specify a Lucene index path in the 
coprocessor configuration.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999565#comment-12999565
 ] 

Jason Rutherglen commented on HBASE-3529:
-

bq. Writing the indexes to HDFS is possible after LUCENE-2373?

Right, that's implemented in trunk as the append codecs. 
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-misc/org/apache/lucene/index/codecs/appending/AppendingCodec.html

bq. We get direct reads from HDFS via HDFS-347 and the OS block cache can help 
there?

BlockReaderLocal is sync'd on each method, that's something we've outgrown in 
Lucene a while back (and in it's place NIOFSDirectory is most used, with MMap 
second).  We'd likely have a couple of options here, write to HDFS and 
[probably] slow queries to some extent, or write directly to a local directory 
and have the mechanical overhead of copying index files in/out of HDFS.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999579#comment-12999579
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Also, I'm curious about how the HLog works, eg, it's archived into HDFS, is 
there a difference between what's archived and what's live (and would 
interleaving be necessary?).  The reason the HLog needs to be replayed [I 
think] is deletes need to be executed.  If we simply iterate/scan from a given 
timestamp, we'd get the new rows however we'd miss executing deletes.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999595#comment-12999595
 ] 

Jason Rutherglen commented on HBASE-3529:
-

In the RegionObserver/Coprocessor I don't think there are methods to access the 
log replay (on server restart), is that something that's planned?  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999598#comment-12999598
 ] 

Jason Rutherglen commented on HBASE-3529:
-

To answer the previous question there's this issue: HBASE-3257

And on memstore flush, we'll do a Lucene index commit to ensure that when we 
replay the HLog, we won't need to access [potentially] out of date HLog 
entries.  We can store the checkpoint meta-data into the Lucene commit, which 
obviates the need to implement terms dict last term access. 

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999676#comment-12999676
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Is 
https://issues.apache.org/jira/secure/attachment/12470743/HDFS-347-branch-20-append.txt
 the patch applied to CDH?  If so, the readChunk method isn't implemented.  Is 
there a plan to implement that, perhaps with NIO positional read?  Implementing 
readChunk would make storing the indexes in HDFS entirely tenable.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999684#comment-12999684
 ] 

ryan rawson commented on HBASE-3529:


HDFS-347 is not in CDH nor in branch-20-append.

As for a plan to implement it, perhaps you should?


 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999687#comment-12999687
 ] 

Jason Rutherglen commented on HBASE-3529:
-

{quote}HDFS-347 is not in CDH nor in branch-20-append.

As for a plan to implement it, perhaps you should?{quote}

Really?  Ah, I guess I misread this: 
https://issues.apache.org/jira/browse/HBASE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997267#comment-12997267

Sure, I can give a go at an NIO positional read version, it'll be a good 
learning experience.  Are there any caveats to be aware of?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread ryan rawson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999688#comment-12999688
 ] 

ryan rawson commented on HBASE-3529:


I do not know, the whole thing is pretty green field. There are a few different 
implementations of HDFS-347, and I haven't actually seen a credible attempt at 
really getting it into a shipping hadoop yet. The test patches are pretty 
great, but they are POC and won't actually be shipping (due to hadoop security).

You can give it a shot, but be warned you might not get much for your troubles 
in terms of committed code.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999718#comment-12999718
 ] 

stack commented on HBASE-3529:
--

@Jason We could get hdfs-347 applied to branch-0.20-append.  Us HBasers are 
going to talk it up, that folks should apply it to their hadoop since the 
benefit is so great.  CDH will have something like an hdfs-347 but probably not 
till CDH4 (Todd talks of a version of hdfs-347 but one that will work w/ 
security -- see his patch up in hdfs-237 as opposed to the amended Dhruba patch 
posted by Ryan).  A hdfs-347 probably won't show in apache hadoop till 
0.23/0.24 would be my guess.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999722#comment-12999722
 ] 

Jason Rutherglen commented on HBASE-3529:
-

@Stack I didn't see any patches at HDFS-237.  I'd be curious to learn what the 
security issues are, I guess they're articulated in HDFS-347 as solvable by 
transferring file descriptors, though I'm not sure why the user running the 
Hadoop Java process should not be accessing certain local files?  Also, maybe 
there are higher level synchronization issues to be aware of (eg, HDFS-1605)?  
I'm sure much of this can be changed, though it may require a separate call 
'path' and classes to avoid any extraneous synchronization.  I do like this 
approach of making core changes to HDFS which'll benefit HBase and this issue, 
then also streamlines the Lucene integration (ie, there'll be no need for 
replicating the index back into HDFS from local disk), which'll reduce the 
aggregate complexity and testing.  

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999727#comment-12999727
 ] 

stack commented on HBASE-3529:
--

@Jason Pardon me. The HDFS-237 above is a mistype on my part.  I meant HDFS-347 
(I was about to make jokes about your dyslexia but on review the affliction 
blew up in my face).  The hbase process can access local files as long as it 
gets the clearance via hdfs.

bq. do like this approach of making core changes to HDFS which'll benefit 
HBase

+1

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-20 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997141#comment-12997141
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I opened LUCENE-2930 to store the last/max term of a field in the Lucene terms 
dictionary.  We can use this to more efficiently know the index's last commit 
point, and start indexing from there.  The alternative is to iterate the 
*entire* terms dictionary, which for the unique timestamp, would be the length 
of the number of documents.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-18 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12996444#comment-12996444
 ] 

Jason Rutherglen commented on HBASE-3529:
-

1) It may be more expedient for now to store the index in a dedicated 
directory, and save it to HDFS periodically.  However I'm not sure 'when' the 
loading into HDFS would occur, eg, if HBase is always writing to HDFS then 
there's no way to sync with that mechanism.  Perhaps it'd need to be based on 
the iterative index size changes?  Ie, if the index has grown by 25% since the 
last save?

2) I'd like to design the recovery logic now.  It's simple to save the 
timestmap into Lucene, then on recovery get the max timestamp, and iterate from 
there over the HRegion for the remaining 'lost' rows/documents.  What's the 
most efficient way to scan over  timestamp key values?

3) We can create indexes for the entire HRegion or for the individual column 
families.  Perhaps this should be optional?  I wonder if there are 
dis/advantages from a user perspective?  If interleaving postings was efficient 
we could even design a system to enable parts of posting lists to be changed 
per column family, where duplicate docids would be written to intermediate 
in-memory indexes, and 'interleaved' during posting iteration.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-18 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12996461#comment-12996461
 ] 

Ted Yu commented on HBASE-3529:
---

From Scan.java:
 * To only retrieve columns within a specific range of version timestamps,
 * execute {@link #setTimeRange(long, long) setTimeRange}.


 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995331#comment-12995331
 ] 

Jason Rutherglen commented on HBASE-3529:
-

There's one possible issue that's come to mind and that is the possible 
overhead associated with accessing the Lucene index if it's stored in HDFS.  
Meaning, in Lucene today we have implementations such as NIOFSDirectory which 
uses NIO's positional read underneath, and it's made highly concurrent search 
apps much faster (as before we were sync'ing per byte[1024] read call).  I'm 
curious if HDFS has effectively implemented something similar to NIOFSDir 
underneath?  I see pread mentioned in HFile however I think it's referring to 
the HDFS specific implementation?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-16 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995398#comment-12995398
 ] 

stack commented on HBASE-3529:
--

@Jason Yeah, the pread is for hdfs.  Its going to be slow though because for 
EVERY pread invocation, HDFS sets up socket, loads new block, seeks to pread 
location, then returns bytes and closes sockets.  This is to be fixed but thats 
how it currently works.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994800#comment-12994800
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I opened LUCENE-2919 to split indexes by the primary key, eg, the HBase keys.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-15 Thread Dan Harvey (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994814#comment-12994814
 ] 

Dan Harvey commented on HBASE-3529:
---

How would you deal with the data types / serialisation, would you assume the 
cell data is just UTF8 bytes to start with?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994820#comment-12994820
 ] 

Jason Rutherglen commented on HBASE-3529:
-

I added a DocumentTransformer class that looks like this.

{code}
public abstract class DocumentTransformer {
  public abstract MapTerm,Document transform(Mapbyte[], ListKeyValue 
familyMap) throws Exception;
  public abstract Term[] getIDTerms(Mapbyte[], ListKeyValue familyMap) 
throws Exception;
}
{code}

The user can then define how they want to transform the underlying data to 
Lucene documents.  I'm trying to find a JSON library to build the unit 
tests/demo app with.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-15 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994836#comment-12994836
 ] 

stack commented on HBASE-3529:
--

@Jason HBase ships with jersey-json (1.4).  See here for doc: 
http://jackson.codehaus.org/Tutorial (Should be easy enough updating 
jersey-json if needed).

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994384#comment-12994384
 ] 

stack commented on HBASE-3529:
--

@Jason Sounds excellent.  Could you do this up in a coprocessor?  
http://hbaseblog.com/2010/11/30/hbase-coprocessors/

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994395#comment-12994395
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Thanks.  Right the coprocessor is the key to sync'ing HBase and Lucene.  This's 
where I'll probably start.

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994406#comment-12994406
 ] 

Jason Rutherglen commented on HBASE-3529:
-

Another issue brought up is the size of a region vs. the size of the Lucene 
index.  If the region is compressed the resultant Lucene index may in fact be a 
reasonable size.  Typically a maximum Lucene index size of 1 - 2 GB is optimal? 
 If the default region size 256 MB, and the data's been compressed by (what 
ratio?), then 256 MB could be ideal?

 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira