[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379712#comment-16379712 ] caolong commented on HBASE-3529: this is a best idea,It's a pity it to be leater > Add search to HBase > --- > > Key: HBASE-3529 > URL: https://issues.apache.org/jira/browse/HBASE-3529 > Project: HBase > Issue Type: Improvement >Affects Versions: 0.90.0 >Reporter: Jason Rutherglen >Priority: Major > Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch > > > Using the Apache Lucene library we can add freetext search to HBase. The > advantages of this are: > * HBase is highly scalable and distributed > * HBase is realtime > * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) > * Lucene offers many types of queries not currently available in HBase (eg, > AND, OR, NOT, phrase, etc) > * It's easier to build scalable realtime systems on top of already > architecturally sound, scalable realtime data system, eg, HBase. > * Scaling realtime search will be as simple as scaling HBase. > Phase 1 - Indexing: > * Integrate Lucene into HBase such that an index mirrors a given region. > This means cascading add, update, and deletes between a Lucene index and an > HBase region (and vice versa). > * Define meta-data to mark a region as indexed, and use a Solr schema to > allow the user to define the fields and analyzers. > * Integrate with the HLog to ensure that index recovery can occur properly > (eg, on region server failure) > * Mirror region splits with indexes (use Lucene's IndexSplitter?) > * When a region is written to HDFS, also write the corresponding Lucene index > to HDFS. > * A row key will be the ID of a given Lucene document. The Lucene docstore > will explicitly not be used because the document/row data is stored in HBase. > We will need to solve what the best data structure for efficiently mapping a > docid -> row key is. It could be a docstore, field cache, column stride > fields, or some other mechanism. > * Write unit tests for the above > Phase 2 - Queries: > * Enable distributed Lucene queries > * Regions that have Lucene indexes are inherently available and may be > searched on, meaning there's no need for a separate search related system in > Zookeeper. > * Integrate search with HBase's RPC mechanis -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350934#comment-15350934 ] Dzmitry.Lahoda commented on HBASE-3529: --- I am seeking to replace oracle db + elastic search in data intensive legal ediscovery application. Hbase seems suit, and internal integration with lucene would be interesting option. I know about external integrations of hbase with solr/elastcisearch integrations, but these have own problems as I understand (need to data storage and orchestration of cluster instead of one). > Add search to HBase > --- > > Key: HBASE-3529 > URL: https://issues.apache.org/jira/browse/HBASE-3529 > Project: HBase > Issue Type: Improvement >Affects Versions: 0.90.0 >Reporter: Jason Rutherglen > Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch > > > Using the Apache Lucene library we can add freetext search to HBase. The > advantages of this are: > * HBase is highly scalable and distributed > * HBase is realtime > * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) > * Lucene offers many types of queries not currently available in HBase (eg, > AND, OR, NOT, phrase, etc) > * It's easier to build scalable realtime systems on top of already > architecturally sound, scalable realtime data system, eg, HBase. > * Scaling realtime search will be as simple as scaling HBase. > Phase 1 - Indexing: > * Integrate Lucene into HBase such that an index mirrors a given region. > This means cascading add, update, and deletes between a Lucene index and an > HBase region (and vice versa). > * Define meta-data to mark a region as indexed, and use a Solr schema to > allow the user to define the fields and analyzers. > * Integrate with the HLog to ensure that index recovery can occur properly > (eg, on region server failure) > * Mirror region splits with indexes (use Lucene's IndexSplitter?) > * When a region is written to HDFS, also write the corresponding Lucene index > to HDFS. > * A row key will be the ID of a given Lucene document. The Lucene docstore > will explicitly not be used because the document/row data is stored in HBase. > We will need to solve what the best data structure for efficiently mapping a > docid -> row key is. It could be a docstore, field cache, column stride > fields, or some other mechanism. > * Write unit tests for the above > Phase 2 - Queries: > * Enable distributed Lucene queries > * Regions that have Lucene indexes are inherently available and may be > searched on, meaning there's no need for a separate search related system in > Zookeeper. > * Integrate search with HBase's RPC mechanis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13727536#comment-13727536 ] linwukang commented on HBASE-3529: -- I think the most difficut part to integrate Solr into hbase is How to maintain consistency between solr and hbase. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanis -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262494#comment-13262494 ] Martin Alig commented on HBASE-3529: @Json: Are you still working on this issue? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062266#comment-13062266 ] Jason Rutherglen commented on HBASE-3529: - With some recent patches committed to Lucene, I can post a patch to HBase trunk that should work fine, that will only require the special HDFS-347 modification/build. Perhaps it's possible to Maven in the custom HDFS-347 so that no external libraries need to manually downloaded. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062269#comment-13062269 ] Andrew Purtell commented on HBASE-3529: --- bq. Perhaps it's possible to Maven in the custom HDFS-347 so that no external libraries need to manually downloaded. Post 0.92 we plan to modularize the Maven build already for pluggable RPC and security-variant code. We can also conditionally build coprocessors set in their own packages. In this case, something like {{-D HDFS-347}} enables build of it, and pulls down a suitably patched Hadoop core jar? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062270#comment-13062270 ] Jason Rutherglen commented on HBASE-3529: - bq. We can also conditionally build coprocessors set in their own packages Ok, that sounds interesting. Currently I'm pretending like search will be a part of HBase core. :) If there is another directory to place it in, eg, a coprocessor or contrib directory, I will place it there. bq. In this case, something like {{-D HDFS-347}} enables build of it, and pulls down a suitably patched Hadoop core jar? Yeah I have no idea how to post the HDFS-347-LUCENE version to a Maven repo and get that working. I can however probably figure it out. I like the idea of posting a patch, putting things on Github seems quite remote, even to me, and I admit to preferring the simplicity of SVN on this currently one man project. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062273#comment-13062273 ] Andrew Purtell commented on HBASE-3529: --- bq. Currently I'm pretending like search will be a part of HBase core. Like security, I think there will be enough interest for this that core but conditional makes a lot of sense. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062286#comment-13062286 ] Jason Rutherglen commented on HBASE-3529: - What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062292#comment-13062292 ] Andrew Purtell commented on HBASE-3529: --- bq. What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS. See HBASE-4048 and HBase-3810. 3810 is still pending. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062302#comment-13062302 ] Jason Rutherglen commented on HBASE-3529: - I opened a trivial issue LUCENE-3296 so that the custom IW config can be passed in. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062318#comment-13062318 ] Jason Rutherglen commented on HBASE-3529: - I'm signing up to [1] for the HDFS-347 Maven hosting. 1. http://nexus.sonatype.org/oss-repository-hosting.html Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050658#comment-13050658 ] Jason Rutherglen commented on HBASE-3529: - To implement distributed search with sort, we'll need to serialize the field values across the RPC channel. This can be implemented by assuming the sort is by ord which yields BytesRef values, which are easy to sort. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046696#comment-13046696 ] Jason Rutherglen commented on HBASE-3529: - bq. Does that mean that in order to implement distributed search you'll immediately convert this to HBase+Solr instead of HBase+Lucene I think the distributed search capability has been removed from Lucene (I just sent an email to Lucene dev)? We should add it back? Hence the possible Solr integration. bq. If so, what about NRTness that will be lost until Solr gets NRT search? There's a Solr issue to add this though one wouldn't want to implement NRT without LUCENE-3092 + SOLR-2565. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046016#comment-13046016 ] Alex Baranau commented on HBASE-3529: - Another problem we faced: looks like there's an issue in TestLuceneCoprocessor tests life-cycle or smth else: * the testSearchRPC test fails if we run mvn clean -Dtest=TestLuceneCoprocessor test, other 2 pass (it fails on first assert: expected 20, but found 10) * if I add @Ignore to other two tests, i.e. the maven command runs only testSearchRPC, it works well Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046023#comment-13046023 ] Jason Rutherglen commented on HBASE-3529: - Hi Alex, I have new code I will commit to Github. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046026#comment-13046026 ] Alex Baranau commented on HBASE-3529: - Thank you! Berlin is waiting! (kidding, we are going to leave very soon) Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046258#comment-13046258 ] Otis Gospodnetic commented on HBASE-3529: - A few more comments/questions for Jason: * I see PKIndexSplitter usage for splitting the index when a region splits. I see you split the index, open 2 IndexWriters for 2 new Lucene indices, but then either you are not adding documents to them, or I'm not seeing it? * Are there issues around distributed search? It looks like it wasn't in your github branch. * What will happen when a region changes its location/regionserver for whatever reason? I see HDFS-2004 got -1ed and you said without that search will be slow. Do you have an alternative plan? * What is the reason for storing those 2 extra row fields? (the UID one at the other one... I think it's called rowStr or something like that) * What about storing the index in HBase itself? (a la Solandra, I suppose) Would this be doable? Would it make things simpler in the sense that any splitting or moving around, etc. may be handled by HBase and we wouldn't have to make sure the Lucene index always mirrors what's in a region and make sure it follows the region wherever it goes? Lars' idea/question, and I hope I didn't misunderstand or misrepresent his ideas. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046267#comment-13046267 ] Jason Rutherglen commented on HBASE-3529: - Otis, I think many of your questions have been addressed in this issue, though indeed the comment trail is long at this point. bq. Do you have an alternative plan? https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13040465page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13040465 bq. Are there issues around distributed search? It looks like it wasn't in your github branch https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913 bq. What about storing the index in HBase itself? I think that's a great idea to test, though in a different Jira issue. bq. PKIndexSplitter That's LUCENE-2919. Given it's not been committed I may need to bring it over into the HBase search source tree. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046274#comment-13046274 ] Otis Gospodnetic commented on HBASE-3529: - Re https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913 Does that mean that in order to implement distributed search you'll immediately convert this to HBase+Solr instead of HBase+Lucene, so that you don't have to do Lucene-level distributed search? If so, what about NRTness that will be lost until Solr gets NRT search? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042913#comment-13042913 ] Jason Rutherglen commented on HBASE-3529: - SOLR-1431 is updated to trunk. I'm tempted to start trying to plug in Solr. I think the way to do this is to use the HTable.coprocessorExec method (for the distributed search), where the Solr shards are of the form 'shards=start:hexstartkey,end:hexendkey'. Then HBase will take care of the rest from an RPC perspective. Eg, forwarding the request to the individual HRegion's running the SolrCoprocessor. I think we'll use a single Solr schema per region, though we can add a special delimiter in the field name to indicate that the prefix is the column family, then the column name. Something like 'headers:subject' may work. The main caveat is that the fields marked stored in fact will not be stored into Lucene (because they're in HBase). Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13040465#comment-13040465 ] Jason Rutherglen commented on HBASE-3529: - In discussing with J-D (thanks!), we can place logic in the Lucene Coprocessor preOpen method to find out if any of the blocks of the Lucene files in HDFS are not local (by asking the NameNode), then we can: 1) Rewrite, partially optimize, or fully optimize the index, thereby rewriting the index files which causes them to 'go local'. 2) Extend the default placement policy and balancer to skip 'balancing' Lucene files, because we want them to stay local. 3) Use HDFS-2004 to manually move non-local blocks to the local DataNode. Where #3 is more complex and will likely be much more time consuming. This functionality is important as it could currently be considered the only 'blocker' on putting HBase search into a test/production environment. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039882#comment-13039882 ] Jason Rutherglen commented on HBASE-3529: - I opened HDFS-2004 to implement pinning HDFS files (in this case the Lucene index files) to the local DataNode. I think this is necessary functionality for HBase search because all index files need to be local (we're MMap'ing). I think the common use case is a region server goes down, when the new one is brought up, files will likely not be local? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033281#comment-13033281 ] Jason Rutherglen commented on HBASE-3529: - I updated the Lucene version to the latest from trunk which includes the new asynchronous flushing of the RAM buffer. As expected, this has put the indexing creation using HDFS in line with Lucene (because the overhead from the DataNode does not delay further indexing). Also it looks like the query times are in fact nearly the same as well. Lucene indexing duration: 57858 ms Lucene query time #1: 14208 ms Lucene query time #2: 7024 ms Lucene query time #3: 6902 ms HBase indexing duration: 50631 ms HBase query time #1: 8625 ms HBase query time #2: 7081 ms HBase query time #3: 7139 ms Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033288#comment-13033288 ] stack commented on HBASE-3529: -- Nice! Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033311#comment-13033311 ] Jason Rutherglen commented on HBASE-3529: - bq. Awesome stuff. These query times above are using the hacky (non-secure non-checksummed) implementation of HDFS-347? It's hackier than that. It's basically obtaining the java.io.File directly from the FSInputStream. However it's a good baseline to benchmark against things like HADOOP-6311 + HDFS-347. Those need to wait for HBase that works with Hadoop 0.22/trunk anyways? {quote} User defines some special property on a column family that they want to be searchable, this property would include a solr schema which specifies analyzers and fields {quote} Currently there's a DocumentTransformer class which needs to be implemented to transform column-family edits into a Lucene document. That could use the Solr schema for example or any other separate system to tokenize the byte[]s into a Document. {quote}User can now perform an arbitrary lucene search over the table, resulting in completely up-to-date results? (ie spans both memstore and flushed data)?{quote} I think for now we need to offer an external commit on the index, as Lucene only has near realtime search (eg, small segments will be written out, which will overwhelm HDFS). LUCENE-2312 will implement realtime search (eg, searching on the RAM buffer as it's being built). The recent LUCENE-3092 could be used in the meantime to build segments in RAM, and only flush to HDFS when it's too RAM consuming, then we would not need to force the user to 'commit' the index. To answer the question, yes, though today the indexing performance will not be as good as when LUCENE-2312 is implemented or the user will need to 'commit' the index to search on the latest data. Getting all of Solr work work with this system is fairly doable. Each Solr core would map to a region. Things like replication would be disabled. The config files would be stored in HDFS (instead of the local filesystem). For distributed queries, we need SOLR-1431, and then to implement distributed networking using HBase RPC instead of Solr's HTTP RPC. There are other smaller internal things that'd need to change in Solr. I think HBase RPC is aware of where regions live etc so I don't think we need to worry about putting failover logic into the distributed search code? I'm going to post additional benchmarks shortly, eg, for 100,000 and 1 mil documents. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032015#comment-13032015 ] Jason Rutherglen commented on HBASE-3529: - I think the next round of benchmarking could involve showing that we need to directly access the underlying block file in order to not lose performance when running Lucene on HDFS. This is somewhat as per the comment on HDFS-347: https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13013719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13013719 {quote}The next thing we wanted to look at was random I/O. There is a lot more overhead on the datanode for this particular use case so this could be a place where direct access could really excel{quote} We can test using HDFS-941 vs. direct block file access using MMap (by obtaining the local file path and the unix domain sockets). I think then we'll show that for the Lucene case, we're on the right track by using direct file access. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032196#comment-13032196 ] Jason Rutherglen commented on HBASE-3529: - HDFS-941 isn't applying to trunk, and we'll need a semi-unique build of the HDFSDirectory and benchmarking code updated to Hadoop trunk (as opposed to Hadoop 0.20-append). Given Unix Domain Sockets HADOOP-6311 is for trunk (rather than 0.20-append) we may want to wait for a version of HBase that runs on Hadoop trunk, (eg, the current direct file access works fine, Unix Domain Sockets is only for security, not speed). Then we can put off benchmarking HDFS-941 as well. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13020892#comment-13020892 ] Jason Rutherglen commented on HBASE-3529: - I updated the HBase search branch at Github and created complete instructions for how to execute the benchmark. This should also help with examining the code. The HBASE-SEARCH project contains 10,000 bz2 compressed wiki-en documents which account for 100 MB of the download. The slightly modified Lucene libraries are located in the lib/ directory (so that you do not need to download the entire Lucene branch source). https://github.com/jasonrutherglen/HBASE-SEARCH/blob/trunk/BENCHMARK.txt The Lucene vs. HBase Search indexing and search times will be located in the file: target/surefire-reports/org.apache.hadoop.hbase.search.TestSearchBenchmark-output.txt {noformat} Benchmark Execution Instructions Create a directory for the HBase Lucene installation. Then run the following: git clone git://github.com/jasonrutherglen/HDFS-347-HBASE.git HDFS-347-HBASE cd HDFS-347-HBASE ant mvn-install cd .. git clone git://github.com/jasonrutherglen/HBASE-SEARCH.git HBASE-SEARCH cd HBASE-SEARCH cd lib ./install-libs.sh cd .. cd wiki-en tar -jxvf 1.bz2 cd .. mvn test -Dtest=TestSearchBenchmark {noformat} Feel free to let me know if there are problems or if you have questions. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019879#comment-13019879 ] Jason Rutherglen commented on HBASE-3529: - Here are some basic benchmark numbers. The code is more or less pushed to Github. I need to verify it all works for a clean download of the various parts, of which there are 3, Lucene, HDFS-347 Hadoop 0.20 append modified, and HBase with Search. The architecture is to write out a single block per Lucene file. In this way we can simply obtain one underlying java.io.File directly from the DFSClient. The file is then MMap'ed using a modified version of the MMapDirectory called HDFSDirectory. The benchmark shows that storing Lucene indexes into HDFS and reading directly from HDFS is viable (as opposed to copying the files out of HDFS first to the local filesystem). Here are times in milliseconds, on the Wiki-EN corpus: lucene indexing duration: 50202 lucene query time #1: 11780 lucene query time #2: 6211 lucene query time #3: 6181 hbase indexing duration: 70681 hbase query time #1: 8332 hbase query time #2: 6785 hbase query time #3: 6621 As you can see, the indexing is a little bit slower when writing to HDFS. However with the new changes going into Lucene (ie, LUCENE-2324), a pause when flushing (due to HDFS overhead) will not slow down indexing. So expect indexing parity soon. The main query times to look at are the #2 and #3, allowing for warmup of the system IO cache in #1. HBase queries are somewhat slower because each new DFSInputStream created must contact the DataNode. We can optimize this however I think for now we're good. Here are the queries being run (50 times per round), they are non-trivial. states unit* uni* u*d un*d united~0.75 united~0.6 unit~0.7 unit~0.5, // 2 doctitle:/.*[Uu]nited.*/ united OR states united AND states nebraska AND states \united states\ \united states\~3 Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13018835#comment-13018835 ] Jason Rutherglen commented on HBASE-3529: - I'm working on profiling and optimizing the HDFS random access, so that the Lucene HDFS queries are the same as native file system access using NIOFSDirectory. I think one extremely direct approach is to set the max block size to something above all Lucene segments files (at runtime via the DFSClient.create method). This will guarantee that there is only one underlying java.io.File per HDFS file, and so random access will avoid navigating block structures (which require expensive network calls, a binary search, and object creation overhead). Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017210#comment-13017210 ] Jason Rutherglen commented on HBASE-3529: - I placed the HDFS-347 changes in a Github repository located at: https://github.com/jasonrutherglen/HDFS-347-HBASE Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016055#comment-13016055 ] Otis Gospodnetic commented on HBASE-3529: - Jason, what is the current state of this work? Does it work with the trunk? Is there a list of issues/problems that need to be fixed before this can be called working? Thanks! Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016079#comment-13016079 ] Jason Rutherglen commented on HBASE-3529: - @Otis The next step is to benchmark the query performance which may be degraded due to the random positional read performance of HDFS. For this maybe we should use: http://code.google.com/a/apache-extras.org/p/luceneutil/ Also, the blocking issues should [ideally] be resolved. You can take a look at the Solr one SOLR-1431, and commit it. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016085#comment-13016085 ] Otis Gospodnetic commented on HBASE-3529: - Thanks Jason. What's the Solr dependency about? I thought your idea is to go with pure Lucene-level HBase + indexing integration, not Solr. I do see you mention Solr's schema in the initial comments in this issue, but can't find any mentions of Solr in your patch. Could you please clarify the approach? Oh, and if the ML is a better medium, I can move my questions there. Thanks. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016088#comment-13016088 ] Jason Rutherglen commented on HBASE-3529: - @Otis We can benchmark using Lucene in conjunction with HDFS-347, of which I have a more streamlined version of that'll be available in Github. Implementing Solr for benchmarking would create too much overhead. I think we may want to integrate with Solr [in the future] for out of the box distributed queries, facets, and also to make use of the schema. I'll likely open additional Solr related issues when we get there. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008006#comment-13008006 ] Ted Yu commented on HBASE-3529: --- postWALRestore would pass one WALEdit which is for one row. postPut is for one row as well. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007761#comment-13007761 ] Andrew Purtell commented on HBASE-3529: --- @Todd Hosting subprojects sounds reasonable to me. We want to make a friendly home for cool new work but can also accommodate downstream packagers who don't want any kind of support implied. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007768#comment-13007768 ] Jason Rutherglen commented on HBASE-3529: - bq. In DefaultDocumentTransformer, I think we should check whether row has changed It's possible to modify multiple rows per postPut or postWALRestore? Are the KeyValue(s) sorted by row, as we probably want to group row modifications together. Also it seems that it's possible to only update a select few columns of a row? So we may need to reload the entire row and index it again? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007780#comment-13007780 ] stack commented on HBASE-3529: -- @Andrew I'm against taking on src/contribs given past experience where they tended to add friction to major core changes. With hbase up in Apache git, I think its easier for projects that are not in our src tree to follow along (github makes it easy doc'ing, etc., the related external project). Discussion of the add-on up on hbase is grand (and encouraged I'd say since it lets the rest of the hbase space know of the addition) but no src I'd say. Any changes to core an external project requires to work we should take on too (if good justification). Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007782#comment-13007782 ] Andrew Purtell commented on HBASE-3529: --- @Stack I didn't say contrib, I said sub projects. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007796#comment-13007796 ] stack commented on HBASE-3529: -- @Andrew Pardon me for my misread but I'd be agin keeping up subprojects too because of the admin load. We don't need it IMO. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Attachments: HBASE-3529.patch, lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, lucene-misc-4.0-SNAPSHOT.jar Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002438#comment-13002438 ] Jason Rutherglen commented on HBASE-3529: - To get Solr distributed queries working across the searchable HBase cluster, we'll need SOLR-1431 completed. Then in this issue, we'll implement the underlying data transfer protocol using HBase RPC (instead of HTTP). Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000866#comment-13000866 ] Jason Rutherglen commented on HBASE-3529: - @Stack Thanks for the analysis. I forgot to mention that each subquery would also require it's own FSInputStream, which would be too many file descriptors. The heap required for 25 bytes * 2 mil docs is 50MB, eg, that's too much? I think we can go ahead with the positional read which'd only require an FSInputStream per file, to be shared by all readers of that file (using FileChannel.read(ByteBuffer dst, long position) underneath. Given the number of blocks per Lucene file will be 10 and the blocks are of a fixed size, we can divide the (offset / blocksize) to efficiently obtain the block index? I think it'll be efficient to translate a file offset into a local block file, eg, I'm not sure why LocatedBlocks.findBlock uses a binary search because I'm not familiar enough with HDFS. Then we'd just need to cache the LocatedBlock(s), instead of looking them up from the DataNode on each small read byte[1024] call. In summary: * DFSClient.DFSInputStream.getBlockRange looks fast enough for many calls per second * locatedBlocks.findBlock uses a binary search for some reason, that'll be a bottleneck, why can't we divide the number the offset by the number of blocks. Oh ok, that's because block sizes are variable. I guess if the number of blocks is small the binary search will always be fast? Or we can detect if the blocks are of the same size and divide to get the correct block? * DFSClient.DFSInputStream.fetchBlockByteRange is a hotspot because it calls chooseDataNode, whose return value [DNAddrPair] can be cached inside of LocatedBlock? * Later in fetchBlockByteRange we call DFSClient.createClientDatanodeProtocolProxy() and make a local RPC call, getBlockPathInfo. I think the results of this [BlockPathInfo] can be cached into LocatedBlock as well? * Then instead of instantiating a new BlockReader object, we can call FileChannel.read(ByteBuffer b, long pos) directly? * With this solution in place we can safely store documents in the docstore without any worries, and in addition use the system that most efficient in Lucene today, all the while using the fewest file descriptors possible. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001209#comment-13001209 ] Jason Rutherglen commented on HBASE-3529: - We'll want to keep a single ConcurrentMergeScheduler per HRegionServer (rather than per HRegion) even though there'll be an IndexWriter per HRegion (eg, the default is to have a CMS per IW, which could potentially generate too many threads). I'm wondering if there's a global attribute space to put the CMS so that it can be reused across HRegions? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001269#comment-13001269 ] ryan rawson commented on HBASE-3529: can you submit this to the proper jira? This isn't hdfs :) Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001277#comment-13001277 ] Jason Rutherglen commented on HBASE-3529: - @Ryan Sure, I just wanted to iterate here a little bit, and then test it out with the HDFSDirectory implementation, before submitting it to HDFS-347. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001280#comment-13001280 ] stack commented on HBASE-3529: -- @Jason Do you need to hack on hdfs first? Its critical to making the search work on hbase? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001285#comment-13001285 ] Jason Rutherglen commented on HBASE-3529: - bq. Do you need to hack on hdfs first? Its critical to making the search work on hbase? Yes, HDFS as it is would make queries execute extremely slowly (because of random small reads), also I don't know how to implement the HDFSDirectory (the Lucene interface to the filesystem) without knowing how HDFS works. In this case, we need to use NIO positional read underneath. I think the patch shows NIO pos is doable and hopefully it'll be completed shortly, enough to implement HDFSDirectory and then run a performance comparison of HDFSDirectory vs. NIOFSDirectory. Eg, we'll build identical indexes in both dirs, run the same queries and examine the difference in query speed. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001289#comment-13001289 ] stack commented on HBASE-3529: -- OK. Why niopositional read? How is that different than the pread that is already in the dfsclient api? You don't like going via the Block API? Above you say in parens '...(using FileChannel.read(ByteBuffer dst, long position)...' What if the data is not local, usually it is ( 99% of the time), but is not always; e.g. in time of failure or perhaps after a rebalance. You going to get the FileChannel off the socket (thats the nio bit)? You do get the bit that hdfs-347 is a naughty hack as is. A version that respects 'security', where the 'cleared' fd is passed via unix domain sockets, for the dfsclient to use going direct is probably what'll go in sometime soon hopefully. You are messing down deep below hbase in dfs. I'm a little worried that you'll do a bunch of custom work that may work for your lucene directory implementation but that it will be so particular, it won't be accepted back into hdfs. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001292#comment-13001292 ] Ted Yu commented on HBASE-3529: --- In certain deployment, data node and region server are not on the same machine. The above would pose performance issue. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001299#comment-13001299 ] Jason Rutherglen commented on HBASE-3529: - bq. Why niopositional read? How is that different than the pread that is already in the dfsclient api I think the goal of HDFS-347 is it'll automatically switch between reading over the network and reading locally? So the pread'll do one or the other? bq. You going to get the FileChannel off the socket (thats the nio bit)? That's just for the local file. bq. What if the data is not local, usually it is ( 99% of the time), but is not always; e.g. in time of failure or perhaps after a rebalance. If we read off a socket I think there's going to be be a serious degradation in performance. I think that's an invariant of search? {quote}A version that respects 'security', where the 'cleared' fd is passed via unix domain sockets, for the dfsclient to use going direct is probably what'll go in sometime soon hopefully.{quote} That'll be good! I think this initial version (of HDFS modifications) is simply to get things going, as these other [HDFS] improvements are added we can use them and the DFSInputStream methods used by HDFSDirectory'll be the same? {quote}You are messing down deep below hbase in dfs. I'm a little worried that you'll do a bunch of custom work that may work for your lucene directory implementation but that it will be so particular, it won't be accepted back into hdfs.{quote} If we need to pass the FD using Unix domain sockets then the HDFS work won't be useful. If the UDS's enable positional read, then the [Lucene] HDFSDirectory will work well. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000283#comment-13000283 ] Jason Rutherglen commented on HBASE-3529: - I started on the search part, which is nice as it can utilize HBase's Coprocessor RPC mechanism. The design issue is if we need to store a unique [family, column, row, timestamp] per column/field into Lucene? Or perhaps this only needs to be stored per column family? This'll be used on iteration of the results from Lucene, which yields docids, we'll then lookup the values in the doc store, call Get for each doc, and add the Result to the search response. I think this is how it should work? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000398#comment-13000398 ] stack commented on HBASE-3529: -- You'll have to include row, column family, and qualifier at least if you are to get from lucene back to the the latest version of the cell, won't you? If you want to index more than just the current version of a cell, you'll have to include the hbase timestamp in the lucene index. If your lucene indices are per column family, you could leave the column family out of the lucene document and it can be picked up from context; that would leave row, qualifier and timestamp in the lucene document. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000667#comment-13000667 ] Jason Rutherglen commented on HBASE-3529: - In regards to HDFS-347 and the issues around fast local file access. I started reimplementing HDFS-347, however I realized it'll be fruitless without an efficient [cached] way of finding the local file a given offset corresponds to. Is there a way for the DFSClient to listen for changes to the DataNode and then keep a memory resident 'cache' for the purpose of quickly accessing which local file(s) a given positional read + length corresponds to? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000693#comment-13000693 ] stack commented on HBASE-3529: -- @Jason Which offset are you talking off? The storefile in hbase keeps offsets in a file index. When we ask to read from a position in the hfile, dfsclient does a quick calc to figure which block and then relatively, the offset into the target block. Are you talking of something more fine grained or something else? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000697#comment-13000697 ] Jason Rutherglen commented on HBASE-3529: - Sorry, I thought through the file access a little more. I think we can use the block local reader as is, because Lucene reads the postings sequentially, we don't really need random file access (eg, the offset issue more or less goes away), we simply need to allow seek'ing forward, and most postings will live inside of a single (64 - 128MB block). The issue with this system is we may need to maintain an FSInputStream per thread per file because we probably don't want to open a new FSInputStream per query given the overhead or creating and destroying them? Will this cause issues with the max file descriptors? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000706#comment-13000706 ] stack commented on HBASE-3529: -- @Jason Currently HBase keeps all files open all the time (Yeah, users have to up their ulimit if they have more than a smidgeon of data in hbase--requirement #4 or #5). Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000721#comment-13000721 ] Jason Rutherglen commented on HBASE-3529: - Ah, going back to storing the row, qualifier and timestamp in a Lucene document/docstore, is that does require totally random reads. I wonder if there's some efficient way to store row pointers in RAM (compression?) or a Hadooop data structure that can be used? I think that storing this information in the Lucene field cache is going to cause OOMs. It'd be great if we could simply store a long that points to the exact row and column family we'd like to reference, as that could easily be stored in RAM, and would possibly enable faster lookup? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000730#comment-13000730 ] stack commented on HBASE-3529: -- Are you thinking you could exploit hbase scan somehow? If so, how you think it would work? Whats a lucene docid? A long? Or a double? You could toBytes that and that'd be the hbase row (HBase rows are byte arrays). The column family could be one byte -- that'd give you 256 maximum column family names. Qualifier probably has to be lucene document field name. You could try and keep these short. Timestamp is a long. So thats two longs (docid + ts), one byte for cf, and say, 8 characters for field name.. thats about 25 bytes or so per lucene doc. Will that cause you to run out of mem? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999789#comment-12999789 ] Jason Rutherglen commented on HBASE-3529: - It looks simple to change HDFS-347 (the HDFS-347-branch-20-append.txt patch) to read using positional reads, I'm sure it's necessary as a block reader is instantiated per DFSInputStream? read(long position, byte[] buffer, int offset, int length) calls getBlockRange which is sync'd. Then the read method calls fetchBlockByteRange which calls BlockReader.newBlockReader, eg, the blockreader is per thread and isn't reused? So the contention would be in getBlockRange? Perhaps there's not an issue, or not much of one, if the HDFS-347-branch-20-append.txt patch (or something like it) is applied (using HADOOP-6311)? I guess the go ahead is to write a Lucene Directory that uses HDFS underneath, that gains concurrency by using DFSInputStream.read(long position, ...)? Oh, the other issue would be all the overhead from simply loading a byte[1024] (eg, all the new object creation etc). Hmm... That'll be a problem. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999528#comment-12999528 ] Jason Rutherglen commented on HBASE-3529: - Where is a good 'temp' directory to place the Lucene indexes relative to other local HBase files? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999533#comment-12999533 ] ryan rawson commented on HBASE-3529: there are no local hbase files. You'll have to come up with something yourself i guess? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999537#comment-12999537 ] Jason Rutherglen commented on HBASE-3529: - Maybe something relative to HDFS then? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999542#comment-12999542 ] Andrew Purtell commented on HBASE-3529: --- I mailed a comment back but it is not showing up fast enough. We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999545#comment-12999545 ] Andrew Purtell commented on HBASE-3529: --- We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999546#comment-12999546 ] ryan rawson commented on HBASE-3529: it's going to be tricky, since with security some people may choose to run hdfs and hbase on different users. Futhermore most hadoop installs have multiple jbod-style disks, and places like /tmp won't have much room (my /tmp has 2GB). If you can avoid local files as much as possible, I'd try to do that. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999552#comment-12999552 ] Jason Rutherglen commented on HBASE-3529: - {quote}We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS.{quote} This'd be good, however for Lucene we'll need to directly access the local filesystem for performance reasons, eg, HDFS sounds like it's going to be slower than going direct (at the moment). Because the indexes will be local, we'll need to periodically sync the local index to HDFS. This isn't as difficult as it sounds, because we can save off a Lucene commit point and write the checkpoint's index files to HDFS, while letting other Lucene operations proceed. I'd say we can move to writing directly to HDFS when HBase no longer uses a heap based block store (and instead relies on the system IO cache). Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999559#comment-12999559 ] Andrew Purtell commented on HBASE-3529: --- Writing the indexes to HDFS is possible after LUCENE-2373? We get direct reads from HDFS via HDFS-347 and the OS block cache can help there? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999562#comment-12999562 ] Gary Helmling commented on HBASE-3529: -- Yeah, as Ryan mentions, with security, writing to HDFS via a coprocessor extension will be easiest to enable. I wonder if that plus HDFS-347 (which allows reading directly from the local FS if the block exists on the local DN) would allow for good enough performance? Of course, HDFS-347 itself is tricky from a security perspective. If local disk writes are the only solution, then the best option may be to make the user plan for it and explicitly specify a Lucene index path in the coprocessor configuration. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999565#comment-12999565 ] Jason Rutherglen commented on HBASE-3529: - bq. Writing the indexes to HDFS is possible after LUCENE-2373? Right, that's implemented in trunk as the append codecs. https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-misc/org/apache/lucene/index/codecs/appending/AppendingCodec.html bq. We get direct reads from HDFS via HDFS-347 and the OS block cache can help there? BlockReaderLocal is sync'd on each method, that's something we've outgrown in Lucene a while back (and in it's place NIOFSDirectory is most used, with MMap second). We'd likely have a couple of options here, write to HDFS and [probably] slow queries to some extent, or write directly to a local directory and have the mechanical overhead of copying index files in/out of HDFS. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999579#comment-12999579 ] Jason Rutherglen commented on HBASE-3529: - Also, I'm curious about how the HLog works, eg, it's archived into HDFS, is there a difference between what's archived and what's live (and would interleaving be necessary?). The reason the HLog needs to be replayed [I think] is deletes need to be executed. If we simply iterate/scan from a given timestamp, we'd get the new rows however we'd miss executing deletes. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999595#comment-12999595 ] Jason Rutherglen commented on HBASE-3529: - In the RegionObserver/Coprocessor I don't think there are methods to access the log replay (on server restart), is that something that's planned? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999598#comment-12999598 ] Jason Rutherglen commented on HBASE-3529: - To answer the previous question there's this issue: HBASE-3257 And on memstore flush, we'll do a Lucene index commit to ensure that when we replay the HLog, we won't need to access [potentially] out of date HLog entries. We can store the checkpoint meta-data into the Lucene commit, which obviates the need to implement terms dict last term access. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999676#comment-12999676 ] Jason Rutherglen commented on HBASE-3529: - Is https://issues.apache.org/jira/secure/attachment/12470743/HDFS-347-branch-20-append.txt the patch applied to CDH? If so, the readChunk method isn't implemented. Is there a plan to implement that, perhaps with NIO positional read? Implementing readChunk would make storing the indexes in HDFS entirely tenable. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999684#comment-12999684 ] ryan rawson commented on HBASE-3529: HDFS-347 is not in CDH nor in branch-20-append. As for a plan to implement it, perhaps you should? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999687#comment-12999687 ] Jason Rutherglen commented on HBASE-3529: - {quote}HDFS-347 is not in CDH nor in branch-20-append. As for a plan to implement it, perhaps you should?{quote} Really? Ah, I guess I misread this: https://issues.apache.org/jira/browse/HBASE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997267#comment-12997267 Sure, I can give a go at an NIO positional read version, it'll be a good learning experience. Are there any caveats to be aware of? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999688#comment-12999688 ] ryan rawson commented on HBASE-3529: I do not know, the whole thing is pretty green field. There are a few different implementations of HDFS-347, and I haven't actually seen a credible attempt at really getting it into a shipping hadoop yet. The test patches are pretty great, but they are POC and won't actually be shipping (due to hadoop security). You can give it a shot, but be warned you might not get much for your troubles in terms of committed code. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999718#comment-12999718 ] stack commented on HBASE-3529: -- @Jason We could get hdfs-347 applied to branch-0.20-append. Us HBasers are going to talk it up, that folks should apply it to their hadoop since the benefit is so great. CDH will have something like an hdfs-347 but probably not till CDH4 (Todd talks of a version of hdfs-347 but one that will work w/ security -- see his patch up in hdfs-237 as opposed to the amended Dhruba patch posted by Ryan). A hdfs-347 probably won't show in apache hadoop till 0.23/0.24 would be my guess. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999722#comment-12999722 ] Jason Rutherglen commented on HBASE-3529: - @Stack I didn't see any patches at HDFS-237. I'd be curious to learn what the security issues are, I guess they're articulated in HDFS-347 as solvable by transferring file descriptors, though I'm not sure why the user running the Hadoop Java process should not be accessing certain local files? Also, maybe there are higher level synchronization issues to be aware of (eg, HDFS-1605)? I'm sure much of this can be changed, though it may require a separate call 'path' and classes to avoid any extraneous synchronization. I do like this approach of making core changes to HDFS which'll benefit HBase and this issue, then also streamlines the Lucene integration (ie, there'll be no need for replicating the index back into HDFS from local disk), which'll reduce the aggregate complexity and testing. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999727#comment-12999727 ] stack commented on HBASE-3529: -- @Jason Pardon me. The HDFS-237 above is a mistype on my part. I meant HDFS-347 (I was about to make jokes about your dyslexia but on review the affliction blew up in my face). The hbase process can access local files as long as it gets the clearance via hdfs. bq. do like this approach of making core changes to HDFS which'll benefit HBase +1 Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997141#comment-12997141 ] Jason Rutherglen commented on HBASE-3529: - I opened LUCENE-2930 to store the last/max term of a field in the Lucene terms dictionary. We can use this to more efficiently know the index's last commit point, and start indexing from there. The alternative is to iterate the *entire* terms dictionary, which for the unique timestamp, would be the length of the number of documents. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12996444#comment-12996444 ] Jason Rutherglen commented on HBASE-3529: - 1) It may be more expedient for now to store the index in a dedicated directory, and save it to HDFS periodically. However I'm not sure 'when' the loading into HDFS would occur, eg, if HBase is always writing to HDFS then there's no way to sync with that mechanism. Perhaps it'd need to be based on the iterative index size changes? Ie, if the index has grown by 25% since the last save? 2) I'd like to design the recovery logic now. It's simple to save the timestmap into Lucene, then on recovery get the max timestamp, and iterate from there over the HRegion for the remaining 'lost' rows/documents. What's the most efficient way to scan over timestamp key values? 3) We can create indexes for the entire HRegion or for the individual column families. Perhaps this should be optional? I wonder if there are dis/advantages from a user perspective? If interleaving postings was efficient we could even design a system to enable parts of posting lists to be changed per column family, where duplicate docids would be written to intermediate in-memory indexes, and 'interleaved' during posting iteration. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12996461#comment-12996461 ] Ted Yu commented on HBASE-3529: --- From Scan.java: * To only retrieve columns within a specific range of version timestamps, * execute {@link #setTimeRange(long, long) setTimeRange}. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995331#comment-12995331 ] Jason Rutherglen commented on HBASE-3529: - There's one possible issue that's come to mind and that is the possible overhead associated with accessing the Lucene index if it's stored in HDFS. Meaning, in Lucene today we have implementations such as NIOFSDirectory which uses NIO's positional read underneath, and it's made highly concurrent search apps much faster (as before we were sync'ing per byte[1024] read call). I'm curious if HDFS has effectively implemented something similar to NIOFSDir underneath? I see pread mentioned in HFile however I think it's referring to the HDFS specific implementation? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995398#comment-12995398 ] stack commented on HBASE-3529: -- @Jason Yeah, the pread is for hdfs. Its going to be slow though because for EVERY pread invocation, HDFS sets up socket, loads new block, seeks to pread location, then returns bytes and closes sockets. This is to be fixed but thats how it currently works. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994800#comment-12994800 ] Jason Rutherglen commented on HBASE-3529: - I opened LUCENE-2919 to split indexes by the primary key, eg, the HBase keys. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994814#comment-12994814 ] Dan Harvey commented on HBASE-3529: --- How would you deal with the data types / serialisation, would you assume the cell data is just UTF8 bytes to start with? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994820#comment-12994820 ] Jason Rutherglen commented on HBASE-3529: - I added a DocumentTransformer class that looks like this. {code} public abstract class DocumentTransformer { public abstract MapTerm,Document transform(Mapbyte[], ListKeyValue familyMap) throws Exception; public abstract Term[] getIDTerms(Mapbyte[], ListKeyValue familyMap) throws Exception; } {code} The user can then define how they want to transform the underlying data to Lucene documents. I'm trying to find a JSON library to build the unit tests/demo app with. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994836#comment-12994836 ] stack commented on HBASE-3529: -- @Jason HBase ships with jersey-json (1.4). See here for doc: http://jackson.codehaus.org/Tutorial (Should be easy enough updating jersey-json if needed). Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994384#comment-12994384 ] stack commented on HBASE-3529: -- @Jason Sounds excellent. Could you do this up in a coprocessor? http://hbaseblog.com/2010/11/30/hbase-coprocessors/ Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994395#comment-12994395 ] Jason Rutherglen commented on HBASE-3529: - Thanks. Right the coprocessor is the key to sync'ing HBase and Lucene. This's where I'll probably start. Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (HBASE-3529) Add search to HBase
[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994406#comment-12994406 ] Jason Rutherglen commented on HBASE-3529: - Another issue brought up is the size of a region vs. the size of the Lucene index. If the region is compressed the resultant Lucene index may in fact be a reasonable size. Typically a maximum Lucene index size of 1 - 2 GB is optimal? If the default region size 256 MB, and the data's been compressed by (what ratio?), then 256 MB could be ideal? Add search to HBase --- Key: HBASE-3529 URL: https://issues.apache.org/jira/browse/HBASE-3529 Project: HBase Issue Type: Improvement Affects Versions: 0.90.0 Reporter: Jason Rutherglen Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid - row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira