subject:"\[jira\] \[Commented\] \(HBASE\-3529\) Add search to HBase"

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062266#comment-13062266
]

Jason Rutherglen commented on HBASE-3529:
-

With some recent patches committed to Lucene, I can post a patch to HBase trunk
that should work fine, that will only require the special HDFS-347
modification/build. Perhaps it's possible to Maven in the custom HDFS-347 so
that no external libraries need to manually downloaded.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062269#comment-13062269
]

Andrew Purtell commented on HBASE-3529:
---

bq. Perhaps it's possible to Maven in the custom HDFS-347 so that no external
libraries need to manually downloaded.

Post 0.92 we plan to modularize the Maven build already for pluggable RPC and
security-variant code. We can also conditionally build coprocessors set in
their own packages. In this case, something like {{-D HDFS-347}} enables build
of it, and pulls down a suitably patched Hadoop core jar?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062270#comment-13062270
]

Jason Rutherglen commented on HBASE-3529:
-

bq. We can also conditionally build coprocessors set in their own packages

Ok, that sounds interesting. Currently I'm pretending like search will be a
part of HBase core. :) If there is another directory to place it in, eg, a
coprocessor or contrib directory, I will place it there.

bq. In this case, something like {{-D HDFS-347}} enables build of it, and pulls
down a suitably patched Hadoop core jar?

Yeah I have no idea how to post the HDFS-347-LUCENE version to a Maven repo and
get that working. I can however probably figure it out.

I like the idea of posting a patch, putting things on Github seems quite
remote, even to me, and I admit to preferring the simplicity of SVN on this
currently one man project.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062273#comment-13062273
]

Andrew Purtell commented on HBASE-3529:
---

bq. Currently I'm pretending like search will be a part of HBase core.

Like security, I think there will be enough interest for this that core but
conditional makes a lot of sense.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062286#comment-13062286
]

Jason Rutherglen commented on HBASE-3529:
-

What's the best way to set custom attributes on the Coprocessor? Eg, I want to
tell the Lucene Coprocessor where to look for a configuration file in HDFS.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-07-08 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062292#comment-13062292
]

Andrew Purtell commented on HBASE-3529:
---

bq. What's the best way to set custom attributes on the Coprocessor? Eg, I want
to tell the Lucene Coprocessor where to look for a configuration file in HDFS.

See HBASE-4048 and HBase-3810. 3810 is still pending.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062302#comment-13062302
]

Jason Rutherglen commented on HBASE-3529:
-

I opened a trivial issue LUCENE-3296 so that the custom IW config can be passed
in.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13062318#comment-13062318
]

Jason Rutherglen commented on HBASE-3529:
-

I'm signing up to [1] for the HDFS-347 Maven hosting.

1. http://nexus.sonatype.org/oss-repository-hosting.html

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-16 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050658#comment-13050658
]

Jason Rutherglen commented on HBASE-3529:
-

To implement distributed search with sort, we'll need to serialize the field
values across the RPC channel. This can be implemented by assuming the sort is
by ord which yields BytesRef values, which are easy to sort.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-09 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046696#comment-13046696
]

Jason Rutherglen commented on HBASE-3529:
-

bq. Does that mean that in order to implement distributed search you'll
immediately convert this to HBase+Solr instead of HBase+Lucene

I think the distributed search capability has been removed from Lucene (I just
sent an email to Lucene dev)? We should add it back? Hence the possible Solr
integration.

bq. If so, what about NRTness that will be lost until Solr gets NRT search?

There's a Solr issue to add this though one wouldn't want to implement NRT
without LUCENE-3092 + SOLR-2565.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Alex Baranau (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046016#comment-13046016
]

Alex Baranau commented on HBASE-3529:
-

Another problem we faced: looks like there's an issue in TestLuceneCoprocessor
tests life-cycle or smth else:
* the testSearchRPC test fails if we run mvn clean
-Dtest=TestLuceneCoprocessor test, other 2 pass (it fails on first assert:
expected 20, but found 10)
* if I add @Ignore to other two tests, i.e. the maven command runs only
testSearchRPC, it works well

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046023#comment-13046023
]

Jason Rutherglen commented on HBASE-3529:
-

Hi Alex, I have new code I will commit to Github.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Alex Baranau (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046026#comment-13046026
]

Alex Baranau commented on HBASE-3529:
-

Thank you! Berlin is waiting! (kidding, we are going to leave very soon)

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046258#comment-13046258
]

Otis Gospodnetic commented on HBASE-3529:
-

A few more comments/questions for Jason:

* I see PKIndexSplitter usage for splitting the index when a region splits. I
see you split the index, open 2 IndexWriters for 2 new Lucene indices, but then
either you are not adding documents to them, or I'm not seeing it?

* Are there issues around distributed search? It looks like it wasn't in your
github branch.

* What will happen when a region changes its location/regionserver for whatever
reason? I see HDFS-2004 got -1ed and you said without that search will be
slow. Do you have an alternative plan?

* What is the reason for storing those 2 extra row fields? (the UID one at the
other one... I think it's called rowStr or something like that)

* What about storing the index in HBase itself? (a la Solandra, I suppose)
Would this be doable? Would it make things simpler in the sense that any
splitting or moving around, etc. may be handled by HBase and we wouldn't have
to make sure the Lucene index always mirrors what's in a region and make sure
it follows the region wherever it goes? Lars' idea/question, and I hope I
didn't misunderstand or misrepresent his ideas.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046267#comment-13046267
]

Jason Rutherglen commented on HBASE-3529:
-

Otis, I think many of your questions have been addressed in this issue, though
indeed the comment trail is long at this point.

bq. Do you have an alternative plan?

https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13040465page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13040465

bq. Are there issues around distributed search? It looks like it wasn't in your
github branch

https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913

bq. What about storing the index in HBase itself?

I think that's a great idea to test, though in a different Jira issue.

bq. PKIndexSplitter

That's LUCENE-2919. Given it's not been committed I may need to bring it over
into the HBase search source tree.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-08 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046274#comment-13046274
 ] 

Otis Gospodnetic commented on HBASE-3529:
-

Re 
https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913

Does that mean that in order to implement distributed search you'll immediately 
convert this to HBase+Solr instead of HBase+Lucene, so that you don't have to 
do Lucene-level distributed search?  If so, what about NRTness that will be 
lost until Solr gets NRT search?


 Add search to HBase
 ---

 Key: HBASE-3529
 URL: https://issues.apache.org/jira/browse/HBASE-3529
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
 Attachments: HBASE-3529.patch


 Using the Apache Lucene library we can add freetext search to HBase.  The 
 advantages of this are:
 * HBase is highly scalable and distributed
 * HBase is realtime
 * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
 * Lucene offers many types of queries not currently available in HBase (eg, 
 AND, OR, NOT, phrase, etc)
 * It's easier to build scalable realtime systems on top of already 
 architecturally sound, scalable realtime data system, eg, HBase.
 * Scaling realtime search will be as simple as scaling HBase.
 Phase 1 - Indexing:
 * Integrate Lucene into HBase such that an index mirrors a given region.  
 This means cascading add, update, and deletes between a Lucene index and an 
 HBase region (and vice versa).
 * Define meta-data to mark a region as indexed, and use a Solr schema to 
 allow the user to define the fields and analyzers.
 * Integrate with the HLog to ensure that index recovery can occur properly 
 (eg, on region server failure)
 * Mirror region splits with indexes (use Lucene's IndexSplitter?)
 * When a region is written to HDFS, also write the corresponding Lucene index 
 to HDFS.
 * A row key will be the ID of a given Lucene document.  The Lucene docstore 
 will explicitly not be used because the document/row data is stored in HBase. 
  We will need to solve what the best data structure for efficiently mapping a 
 docid - row key is.  It could be a docstore, field cache, column stride 
 fields, or some other mechanism.
 * Write unit tests for the above
 Phase 2 - Queries:
 * Enable distributed Lucene queries
 * Regions that have Lucene indexes are inherently available and may be 
 searched on, meaning there's no need for a separate search related system in 
 Zookeeper.
 * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-06-02 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13042913#comment-13042913
]

Jason Rutherglen commented on HBASE-3529:
-

SOLR-1431 is updated to trunk. I'm tempted to start trying to plug in
Solr. I think the way to do this is to use the HTable.coprocessorExec
method (for the distributed search), where the Solr shards are of the form
'shards=start:hexstartkey,end:hexendkey'. Then HBase will take care of the
rest from an RPC perspective. Eg, forwarding the request to the individual
HRegion's running the SolrCoprocessor.

I think we'll use a single Solr schema per region, though we can add a
special delimiter in the field name to indicate that the prefix is the
column family, then the column name. Something like 'headers:subject' may
work. The main caveat is that the fields marked stored in fact
will not be stored into Lucene (because they're in HBase).

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-27 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13040465#comment-13040465
]

Jason Rutherglen commented on HBASE-3529:
-

In discussing with J-D (thanks!), we can place logic in the Lucene
Coprocessor preOpen method to find out if any of the blocks of the Lucene
files in HDFS are not local (by asking the NameNode), then we can:

1) Rewrite, partially optimize, or fully optimize the index, thereby
rewriting the index files which causes them to 'go local'.

2) Extend the default placement policy and balancer to skip 'balancing'
Lucene files, because we want them to stay local.

3) Use HDFS-2004 to manually move non-local blocks to the local DataNode.

Where #3 is more complex and will likely be much more time consuming.

This functionality is important as it could currently be considered the only
'blocker' on putting HBase search into a test/production environment.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-26 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13039882#comment-13039882
]

Jason Rutherglen commented on HBASE-3529:
-

I opened HDFS-2004 to implement pinning HDFS files (in this case the Lucene
index files) to the local DataNode. I think this is necessary functionality
for HBase search because all index files need to be local (we're MMap'ing). I
think the common use case is a region server goes down, when the new one is
brought up, files will likely not be local?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-13 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033281#comment-13033281
]

Jason Rutherglen commented on HBASE-3529:
-

I updated the Lucene version to the latest from trunk which includes the new
asynchronous flushing of the RAM buffer. As expected, this has put the
indexing creation using HDFS in line with Lucene (because the overhead from the
DataNode does not delay further indexing). Also it looks like the query times
are in fact nearly the same as well.

Lucene indexing duration: 57858 ms
Lucene query time #1: 14208 ms
Lucene query time #2: 7024 ms
Lucene query time #3: 6902 ms

HBase indexing duration: 50631 ms
HBase query time #1: 8625 ms
HBase query time #2: 7081 ms
HBase query time #3: 7139 ms

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-13 Thread stack (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033288#comment-13033288
]

stack commented on HBASE-3529:
--

Nice!

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-13 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033311#comment-13033311
]

Jason Rutherglen commented on HBASE-3529:
-

bq. Awesome stuff. These query times above are using the hacky (non-secure
non-checksummed) implementation of HDFS-347?

It's hackier than that. It's basically obtaining the java.io.File directly
from the FSInputStream. However it's a good baseline to benchmark against
things like HADOOP-6311 + HDFS-347. Those need to wait for HBase that works
with Hadoop 0.22/trunk anyways?

{quote}
User defines some special property on a column family that they want to be
searchable, this property would include a solr schema which specifies analyzers
and fields
{quote}

Currently there's a DocumentTransformer class which needs to be implemented to
transform column-family edits into a Lucene document. That could use the Solr
schema for example or any other separate system to tokenize the byte[]s into a
Document.

{quote}User can now perform an arbitrary lucene search over the table,
resulting in completely up-to-date results? (ie spans both memstore and flushed
data)?{quote}

I think for now we need to offer an external commit on the index, as Lucene
only has near realtime search (eg, small segments will be written out, which
will overwhelm HDFS). LUCENE-2312 will implement realtime search (eg,
searching on the RAM buffer as it's being built). The recent LUCENE-3092 could
be used in the meantime to build segments in RAM, and only flush to HDFS when
it's too RAM consuming, then we would not need to force the user to 'commit'
the index.

To answer the question, yes, though today the indexing performance will not be
as good as when LUCENE-2312 is implemented or the user will need to 'commit'
the index to search on the latest data.

Getting all of Solr work work with this system is fairly doable. Each Solr
core would map to a region. Things like replication would be disabled. The
config files would be stored in HDFS (instead of the local filesystem). For
distributed queries, we need SOLR-1431, and then to implement distributed
networking using HBase RPC instead of Solr's HTTP RPC. There are other smaller
internal things that'd need to change in Solr. I think HBase RPC is aware of
where regions live etc so I don't think we need to worry about putting failover
logic into the distributed search code?

I'm going to post additional benchmarks shortly, eg, for 100,000 and 1 mil
documents.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-11 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032015#comment-13032015
]

Jason Rutherglen commented on HBASE-3529:
-

I think the next round of benchmarking could involve showing that we need to
directly access the underlying block file in order to not lose performance when
running Lucene on HDFS. This is somewhat as per the comment on HDFS-347:

https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13013719page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13013719

{quote}The next thing we wanted to look at was random I/O. There is a lot
more overhead on the datanode for this particular use case so this
could be a place where direct access could really excel{quote}

We can test using HDFS-941 vs. direct block file access using MMap (by
obtaining the local file path and the unix domain sockets). I think then we'll
show that for the Lucene case, we're on the right track by using direct file
access.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-05-11 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032196#comment-13032196
]

Jason Rutherglen commented on HBASE-3529:
-

HDFS-941 isn't applying to trunk, and we'll need a semi-unique build of the
HDFSDirectory and benchmarking code updated to Hadoop trunk (as opposed to
Hadoop 0.20-append). Given Unix Domain Sockets HADOOP-6311 is for trunk
(rather than 0.20-append) we may want to wait for a version of HBase that runs
on Hadoop trunk, (eg, the current direct file access works fine, Unix Domain
Sockets is only for security, not speed). Then we can put off benchmarking
HDFS-941 as well.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-17 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13020892#comment-13020892
]

Jason Rutherglen commented on HBASE-3529:
-

I updated the HBase search branch at Github and created complete instructions
for how to execute the benchmark. This should also help with examining the
code. The HBASE-SEARCH project contains 10,000 bz2 compressed wiki-en documents
which account for 100 MB of the download. The slightly modified Lucene
libraries are located in the lib/ directory (so that you do not need to
download the entire Lucene branch source).

https://github.com/jasonrutherglen/HBASE-SEARCH/blob/trunk/BENCHMARK.txt

The Lucene vs. HBase Search indexing and search times will be located in the
file:
target/surefire-reports/org.apache.hadoop.hbase.search.TestSearchBenchmark-output.txt

{noformat}
Benchmark Execution Instructions

Create a directory for the HBase Lucene installation. Then run the following:

git clone git://github.com/jasonrutherglen/HDFS-347-HBASE.git HDFS-347-HBASE
cd HDFS-347-HBASE
ant mvn-install
cd ..

git clone git://github.com/jasonrutherglen/HBASE-SEARCH.git HBASE-SEARCH
cd HBASE-SEARCH
cd lib
./install-libs.sh
cd ..
cd wiki-en
tar -jxvf 1.bz2
cd ..
mvn test -Dtest=TestSearchBenchmark
{noformat}

Feel free to let me know if there are problems or if you have questions.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Attachments: HBASE-3529.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-14 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019879#comment-13019879
]

Jason Rutherglen commented on HBASE-3529:
-

Here are some basic benchmark numbers. The code is more or less pushed to
Github. I need to verify it all works for a clean download of the various
parts, of which there are 3, Lucene, HDFS-347 Hadoop 0.20 append modified, and
HBase with Search.

The architecture is to write out a single block per Lucene file. In this way
we can simply obtain one underlying java.io.File directly from the DFSClient.
The file is then MMap'ed using a modified version of the MMapDirectory called
HDFSDirectory.

The benchmark shows that storing Lucene indexes into HDFS and reading directly
from HDFS is viable (as opposed to copying the files out of HDFS first to the
local filesystem).

Here are times in milliseconds, on the Wiki-EN corpus:

lucene indexing duration: 50202
lucene query time #1: 11780
lucene query time #2: 6211
lucene query time #3: 6181

hbase indexing duration: 70681
hbase query time #1: 8332
hbase query time #2: 6785
hbase query time #3: 6621

As you can see, the indexing is a little bit slower when writing to HDFS.
However with the new changes going into Lucene (ie, LUCENE-2324), a pause when
flushing (due to HDFS overhead) will not slow down indexing. So expect
indexing parity soon.

The main query times to look at are the #2 and #3, allowing for warmup of the
system IO cache in #1. HBase queries are somewhat slower because each new
DFSInputStream created must contact the DataNode. We can optimize this however
I think for now we're good.

Here are the queries being run (50 times per round), they are non-trivial.

states
unit*
uni*
u*d
un*d
united~0.75
united~0.6
unit~0.7
unit~0.5, // 2
doctitle:/.*[Uu]nited.*/
united OR states
united AND states
nebraska AND states
\united states\
\united states\~3

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-12 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13018835#comment-13018835
]

Jason Rutherglen commented on HBASE-3529:
-

I'm working on profiling and optimizing the HDFS random access, so that the
Lucene HDFS queries are the same as native file system access using
NIOFSDirectory.

I think one extremely direct approach is to set the max block size to something
above all Lucene segments files (at runtime via the DFSClient.create method).
This will guarantee that there is only one underlying java.io.File per HDFS
file, and so random access will avoid navigating block structures (which
require expensive network calls, a binary search, and object creation overhead).

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-07 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017210#comment-13017210
]

Jason Rutherglen commented on HBASE-3529:
-

I placed the HDFS-347 changes in a Github repository located at:
https://github.com/jasonrutherglen/HDFS-347-HBASE

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016055#comment-13016055
]

Otis Gospodnetic commented on HBASE-3529:
-

Jason, what is the current state of this work? Does it work with the trunk?
Is there a list of issues/problems that need to be fixed before this can be
called working? Thanks!

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016079#comment-13016079
]

Jason Rutherglen commented on HBASE-3529:
-

@Otis The next step is to benchmark the query performance which may be degraded
due to the random positional read performance of HDFS. For this maybe we
should use: http://code.google.com/a/apache-extras.org/p/luceneutil/ Also, the
blocking issues should [ideally] be resolved. You can take a look at the Solr
one SOLR-1431, and commit it.

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Otis Gospodnetic (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016085#comment-13016085
]

Otis Gospodnetic commented on HBASE-3529:
-

Thanks Jason. What's the Solr dependency about? I thought your idea is to go
with pure Lucene-level HBase + indexing integration, not Solr. I do see you
mention Solr's schema in the initial comments in this issue, but can't find any
mentions of Solr in your patch. Could you please clarify the approach? Oh,
and if the ML is a better medium, I can move my questions there. Thanks.

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

2011-04-05 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016088#comment-13016088
]

Jason Rutherglen commented on HBASE-3529:
-

@Otis We can benchmark using Lucene in conjunction with HDFS-347, of which I
have a more streamlined version of that'll be available in Github.
Implementing Solr for benchmarking would create too much overhead.

I think we may want to integrate with Solr [in the future] for out of the box
distributed queries, facets, and also to make use of the schema. I'll likely
open additional Solr related issues when we get there.

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-17 Thread Ted Yu (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008006#comment-13008006
]

Ted Yu commented on HBASE-3529:
---

postWALRestore would pass one WALEdit which is for one row.
postPut is for one row as well.

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007761#comment-13007761
]

Andrew Purtell commented on HBASE-3529:
---

@Todd Hosting subprojects sounds reasonable to me. We want to make a friendly
home for cool new work but can also accommodate downstream packagers who don't
want any kind of support implied.

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007768#comment-13007768
]

Jason Rutherglen commented on HBASE-3529:
-

bq. In DefaultDocumentTransformer, I think we should check whether row has
changed

It's possible to modify multiple rows per postPut or postWALRestore? Are the
KeyValue(s) sorted by row, as we probably want to group row modifications
together. Also it seems that it's possible to only update a select few columns
of a row? So we may need to reload the entire row and index it again?

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread stack (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007780#comment-13007780
]

stack commented on HBASE-3529:
--

@Andrew I'm against taking on src/contribs given past experience where they
tended to add friction to major core changes. With hbase up in Apache git, I
think its easier for projects that are not in our src tree to follow along
(github makes it easy doc'ing, etc., the related external project). Discussion
of the add-on up on hbase is grand (and encouraged I'd say since it lets the
rest of the hbase space know of the addition) but no src I'd say. Any changes
to core an external project requires to work we should take on too (if good
justification).

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007782#comment-13007782
]

Andrew Purtell commented on HBASE-3529:
---

@Stack I didn't say contrib, I said sub projects.

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-16 Thread stack (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007796#comment-13007796
]

stack commented on HBASE-3529:
--

@Andrew Pardon me for my misread but I'd be agin keeping up subprojects too
because of the admin load. We don't need it IMO.

Add search to HBase
---

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-03 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13002438#comment-13002438
]

Jason Rutherglen commented on HBASE-3529:
-

To get Solr distributed queries working across the searchable HBase cluster,
we'll need SOLR-1431 completed. Then in this issue, we'll implement the
underlying data transfer protocol using HBase RPC (instead of HTTP).

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000866#comment-13000866
]

Jason Rutherglen commented on HBASE-3529:
-

@Stack Thanks for the analysis. I forgot to mention that each subquery would
also require it's own FSInputStream, which would be too many file descriptors.
The heap required for 25 bytes * 2 mil docs is 50MB, eg, that's too much?

I think we can go ahead with the positional read which'd only require an
FSInputStream per file, to be shared by all readers of that file (using
FileChannel.read(ByteBuffer dst, long position) underneath. Given the number
of blocks per Lucene file will be 10 and the blocks are of a fixed size, we
can divide the (offset / blocksize) to efficiently obtain the block index? I
think it'll be efficient to translate a file offset into a local block file,
eg, I'm not sure why LocatedBlocks.findBlock uses a binary search because I'm
not familiar enough with HDFS. Then we'd just need to cache the
LocatedBlock(s), instead of looking them up from the DataNode on each small
read byte[1024] call.

In summary:

* DFSClient.DFSInputStream.getBlockRange looks fast enough for many calls per
second
* locatedBlocks.findBlock uses a binary search for some reason, that'll be a
bottleneck, why can't we divide the number the offset by the number of blocks.
Oh ok, that's because block sizes are variable. I guess if the number of
blocks is small the binary search will always be fast? Or we can detect if the
blocks are of the same size and divide to get the correct block?
* DFSClient.DFSInputStream.fetchBlockByteRange is a hotspot because it calls
chooseDataNode, whose return value [DNAddrPair] can be cached inside of
LocatedBlock?
* Later in fetchBlockByteRange we call
DFSClient.createClientDatanodeProtocolProxy() and make a local RPC call,
getBlockPathInfo. I think the results of this [BlockPathInfo] can be cached
into LocatedBlock as well?
* Then instead of instantiating a new BlockReader object, we can call
FileChannel.read(ByteBuffer b, long pos) directly?
* With this solution in place we can safely store documents in the docstore
without any worries, and in addition use the system that most efficient in
Lucene today, all the while using the fewest file descriptors possible.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001209#comment-13001209
]

Jason Rutherglen commented on HBASE-3529:
-

We'll want to keep a single ConcurrentMergeScheduler per HRegionServer (rather
than per HRegion) even though there'll be an IndexWriter per HRegion (eg, the
default is to have a CMS per IW, which could potentially generate too many
threads). I'm wondering if there's a global attribute space to put the CMS so
that it can be reused across HRegions?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread ryan rawson (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001269#comment-13001269
]

ryan rawson commented on HBASE-3529:

can you submit this to the proper jira? This isn't hdfs :)

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001277#comment-13001277
]

Jason Rutherglen commented on HBASE-3529:
-

@Ryan Sure, I just wanted to iterate here a little bit, and then test it out
with the HDFSDirectory implementation, before submitting it to HDFS-347.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread stack (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001280#comment-13001280
]

stack commented on HBASE-3529:
--

@Jason Do you need to hack on hdfs first? Its critical to making the search
work on hbase?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001285#comment-13001285
]

Jason Rutherglen commented on HBASE-3529:
-

bq. Do you need to hack on hdfs first? Its critical to making the search work
on hbase?

Yes, HDFS as it is would make queries execute extremely slowly (because of
random small reads), also I don't know how to implement the HDFSDirectory (the
Lucene interface to the filesystem) without knowing how HDFS works. In this
case, we need to use NIO positional read underneath. I think the patch shows
NIO pos is doable and hopefully it'll be completed shortly, enough to implement
HDFSDirectory and then run a performance comparison of HDFSDirectory vs.
NIOFSDirectory. Eg, we'll build identical indexes in both dirs, run the same
queries and examine the difference in query speed.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread stack (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001289#comment-13001289
]

stack commented on HBASE-3529:
--

OK.

Why niopositional read? How is that different than the pread that is already
in the dfsclient api? You don't like going via the Block API? Above you say
in parens '...(using FileChannel.read(ByteBuffer dst, long position)...' What
if the data is not local, usually it is ( 99% of the time), but is not always;
e.g. in time of failure or perhaps after a rebalance. You going to get the
FileChannel off the socket (thats the nio bit)?

You do get the bit that hdfs-347 is a naughty hack as is. A version that
respects 'security', where the 'cleared' fd is passed via unix domain sockets,
for the dfsclient to use going direct is probably what'll go in sometime soon
hopefully.

You are messing down deep below hbase in dfs. I'm a little worried that you'll
do a bunch of custom work that may work for your lucene directory
implementation but that it will be so particular, it won't be accepted back
into hdfs.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-03-01 Thread Ted Yu (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001292#comment-13001292
]

Ted Yu commented on HBASE-3529:
---

In certain deployment, data node and region server are not on the same machine.
The above would pose performance issue.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13001299#comment-13001299
]

Jason Rutherglen commented on HBASE-3529:
-

bq. Why niopositional read? How is that different than the pread that is
already in the dfsclient api

I think the goal of HDFS-347 is it'll automatically switch between reading over
the network and reading locally? So the pread'll do one or the other?

bq. You going to get the FileChannel off the socket (thats the nio bit)?

That's just for the local file.

bq. What if the data is not local, usually it is ( 99% of the time), but is
not always; e.g. in time of failure or perhaps after a rebalance.

If we read off a socket I think there's going to be be a serious degradation in
performance. I think that's an invariant of search?

{quote}A version that respects 'security', where the 'cleared' fd is passed via
unix domain sockets, for the dfsclient to use going direct is probably what'll
go in sometime soon hopefully.{quote}

That'll be good! I think this initial version (of HDFS modifications) is
simply to get things going, as these other [HDFS] improvements are added we can
use them and the DFSInputStream methods used by HDFSDirectory'll be the same?

{quote}You are messing down deep below hbase in dfs. I'm a little worried that
you'll do a bunch of custom work that may work for your lucene directory
implementation but that it will be so particular, it won't be accepted back
into hdfs.{quote}

If we need to pass the FD using Unix domain sockets then the HDFS work won't be
useful. If the UDS's enable positional read, then the [Lucene] HDFSDirectory
will work well.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000283#comment-13000283
]

Jason Rutherglen commented on HBASE-3529:
-

I started on the search part, which is nice as it can utilize HBase's
Coprocessor RPC mechanism. The design issue is if we need to store a unique
[family, column, row, timestamp] per column/field into Lucene? Or perhaps this
only needs to be stored per column family? This'll be used on iteration of the
results from Lucene, which yields docids, we'll then lookup the values in the
doc store, call Get for each doc, and add the Result to the search response. I
think this is how it should work?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000398#comment-13000398
]

stack commented on HBASE-3529:
--

You'll have to include row, column family, and qualifier at least if you are to
get from lucene back to the the latest version of the cell, won't you? If you
want to index more than just the current version of a cell, you'll have to
include the hbase timestamp in the lucene index.

If your lucene indices are per column family, you could leave the column family
out of the lucene document and it can be picked up from context; that would
leave row, qualifier and timestamp in the lucene document.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000667#comment-13000667
]

Jason Rutherglen commented on HBASE-3529:
-

In regards to HDFS-347 and the issues around fast local file access. I started
reimplementing HDFS-347, however I realized it'll be fruitless without an
efficient [cached] way of finding the local file a given offset corresponds to.
Is there a way for the DFSClient to listen for changes to the DataNode and
then keep a memory resident 'cache' for the purpose of quickly accessing which
local file(s) a given positional read + length corresponds to?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000693#comment-13000693
]

stack commented on HBASE-3529:
--

@Jason Which offset are you talking off? The storefile in hbase keeps offsets
in a file index. When we ask to read from a position in the hfile, dfsclient
does a quick calc to figure which block and then relatively, the offset into
the target block. Are you talking of something more fine grained or something
else?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000697#comment-13000697
]

Jason Rutherglen commented on HBASE-3529:
-

Sorry, I thought through the file access a little more. I think we can use the
block local reader as is, because Lucene reads the postings sequentially, we
don't really need random file access (eg, the offset issue more or less goes
away), we simply need to allow seek'ing forward, and most postings will live
inside of a single (64 - 128MB block). The issue with this system is we may
need to maintain an FSInputStream per thread per file because we probably don't
want to open a new FSInputStream per query given the overhead or creating and
destroying them? Will this cause issues with the max file descriptors?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000706#comment-13000706
]

stack commented on HBASE-3529:
--

@Jason Currently HBase keeps all files open all the time (Yeah, users have to
up their ulimit if they have more than a smidgeon of data in hbase--requirement
#4 or #5).

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000721#comment-13000721
]

Jason Rutherglen commented on HBASE-3529:
-

Ah, going back to storing the row, qualifier and timestamp in a Lucene
document/docstore, is that does require totally random reads. I wonder if
there's some efficient way to store row pointers in RAM (compression?) or a
Hadooop data structure that can be used? I think that storing this information
in the Lucene field cache is going to cause OOMs. It'd be great if we could
simply store a long that points to the exact row and column family we'd like to
reference, as that could easily be stored in RAM, and would possibly enable
faster lookup?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13000730#comment-13000730
]

stack commented on HBASE-3529:
--

Are you thinking you could exploit hbase scan somehow? If so, how you think it
would work?

Whats a lucene docid? A long? Or a double? You could toBytes that and that'd
be the hbase row (HBase rows are byte arrays). The column family could be one
byte -- that'd give you 256 maximum column family names. Qualifier probably
has to be lucene document field name. You could try and keep these short.
Timestamp is a long. So thats two longs (docid + ts), one byte for cf, and
say, 8 characters for field name.. thats about 25 bytes or so per lucene doc.
Will that cause you to run out of mem?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-26 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999789#comment-12999789
]

Jason Rutherglen commented on HBASE-3529:
-

It looks simple to change HDFS-347 (the HDFS-347-branch-20-append.txt patch) to
read using positional reads, I'm sure it's necessary as a block reader is
instantiated per DFSInputStream? read(long position, byte[] buffer, int offset,
int length) calls getBlockRange which is sync'd. Then the read method calls
fetchBlockByteRange which calls BlockReader.newBlockReader, eg, the blockreader
is per thread and isn't reused? So the contention would be in getBlockRange?
Perhaps there's not an issue, or not much of one, if the
HDFS-347-branch-20-append.txt patch (or something like it) is applied (using
HADOOP-6311)?

I guess the go ahead is to write a Lucene Directory that uses HDFS underneath,
that gains concurrency by using DFSInputStream.read(long position, ...)? Oh,
the other issue would be all the overhead from simply loading a byte[1024] (eg,
all the new object creation etc). Hmm... That'll be a problem.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999528#comment-12999528
]

Jason Rutherglen commented on HBASE-3529:
-

Where is a good 'temp' directory to place the Lucene indexes relative to other
local HBase files?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999533#comment-12999533
]

ryan rawson commented on HBASE-3529:

there are no local hbase files. You'll have to come up with something yourself
i guess?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999537#comment-12999537
]

Jason Rutherglen commented on HBASE-3529:
-

Maybe something relative to HDFS then?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999542#comment-12999542
]

Andrew Purtell commented on HBASE-3529:
---

I mailed a comment back but it is not showing up fast enough.

We have internally been discussing the addition of a Coprocessor framework API
for reading and writing streams from/to the region data directory in HDFS.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999545#comment-12999545
]

Andrew Purtell commented on HBASE-3529:
---

We have internally been discussing the addition of a Coprocessor framework API
for reading and writing streams from/to the region data directory in HDFS.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999546#comment-12999546
]

ryan rawson commented on HBASE-3529:

it's going to be tricky, since with security some people may choose to run hdfs
and hbase on different users. Futhermore most hadoop installs have multiple
jbod-style disks, and places like /tmp won't have much room (my /tmp has
2GB). If you can avoid local files as much as possible, I'd try to do that.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999552#comment-12999552
]

Jason Rutherglen commented on HBASE-3529:
-

{quote}We have internally been discussing the addition of a Coprocessor
framework API for reading and writing streams from/to the region data directory
in HDFS.{quote}

This'd be good, however for Lucene we'll need to directly access the local
filesystem for performance reasons, eg, HDFS sounds like it's going to be
slower than going direct (at the moment). Because the indexes will be local,
we'll need to periodically sync the local index to HDFS. This isn't as
difficult as it sounds, because we can save off a Lucene commit point and write
the checkpoint's index files to HDFS, while letting other Lucene operations
proceed. I'd say we can move to writing directly to HDFS when HBase no longer
uses a heap based block store (and instead relies on the system IO cache).

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Andrew Purtell (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999559#comment-12999559
]

Andrew Purtell commented on HBASE-3529:
---

Writing the indexes to HDFS is possible after LUCENE-2373? We get direct reads
from HDFS via HDFS-347 and the OS block cache can help there?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread Gary Helmling (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999562#comment-12999562
]

Gary Helmling commented on HBASE-3529:
--

Yeah, as Ryan mentions, with security, writing to HDFS via a coprocessor
extension will be easiest to enable.

I wonder if that plus HDFS-347 (which allows reading directly from the local FS
if the block exists on the local DN) would allow for good enough performance?
Of course, HDFS-347 itself is tricky from a security perspective.

If local disk writes are the only solution, then the best option may be to make
the user plan for it and explicitly specify a Lucene index path in the
coprocessor configuration.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999565#comment-12999565
]

Jason Rutherglen commented on HBASE-3529:
-

bq. Writing the indexes to HDFS is possible after LUCENE-2373?

Right, that's implemented in trunk as the append codecs.
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-misc/org/apache/lucene/index/codecs/appending/AppendingCodec.html

bq. We get direct reads from HDFS via HDFS-347 and the OS block cache can help
there?

BlockReaderLocal is sync'd on each method, that's something we've outgrown in
Lucene a while back (and in it's place NIOFSDirectory is most used, with MMap
second). We'd likely have a couple of options here, write to HDFS and
[probably] slow queries to some extent, or write directly to a local directory
and have the mechanical overhead of copying index files in/out of HDFS.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999579#comment-12999579
]

Jason Rutherglen commented on HBASE-3529:
-

Also, I'm curious about how the HLog works, eg, it's archived into HDFS, is
there a difference between what's archived and what's live (and would
interleaving be necessary?). The reason the HLog needs to be replayed [I
think] is deletes need to be executed. If we simply iterate/scan from a given
timestamp, we'd get the new rows however we'd miss executing deletes.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999595#comment-12999595
]

Jason Rutherglen commented on HBASE-3529:
-

In the RegionObserver/Coprocessor I don't think there are methods to access the
log replay (on server restart), is that something that's planned?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999598#comment-12999598
]

Jason Rutherglen commented on HBASE-3529:
-

To answer the previous question there's this issue: HBASE-3257

And on memstore flush, we'll do a Lucene index commit to ensure that when we
replay the HLog, we won't need to access [potentially] out of date HLog
entries. We can store the checkpoint meta-data into the Lucene commit, which
obviates the need to implement terms dict last term access.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999676#comment-12999676
]

Jason Rutherglen commented on HBASE-3529:
-

Is
https://issues.apache.org/jira/secure/attachment/12470743/HDFS-347-branch-20-append.txt
the patch applied to CDH? If so, the readChunk method isn't implemented. Is
there a plan to implement that, perhaps with NIO positional read? Implementing
readChunk would make storing the indexes in HDFS entirely tenable.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999684#comment-12999684
]

ryan rawson commented on HBASE-3529:

HDFS-347 is not in CDH nor in branch-20-append.

As for a plan to implement it, perhaps you should?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999687#comment-12999687
]

Jason Rutherglen commented on HBASE-3529:
-

{quote}HDFS-347 is not in CDH nor in branch-20-append.

As for a plan to implement it, perhaps you should?{quote}

Really? Ah, I guess I misread this:
https://issues.apache.org/jira/browse/HBASE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12997267#comment-12997267

Sure, I can give a go at an NIO positional read version, it'll be a good
learning experience. Are there any caveats to be aware of?

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999688#comment-12999688
]

ryan rawson commented on HBASE-3529:

I do not know, the whole thing is pretty green field. There are a few different
implementations of HDFS-347, and I haven't actually seen a credible attempt at
really getting it into a shipping hadoop yet. The test patches are pretty
great, but they are POC and won't actually be shipping (due to hadoop security).

You can give it a shot, but be warned you might not get much for your troubles
in terms of committed code.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

2011-02-25 Thread stack (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12999718#comment-12999718
]

stack commented on HBASE-3529:
--

@Jason We could get hdfs-347 applied to branch-0.20-append. Us HBasers are
going to talk it up, that folks should apply it to their hadoop since the
benefit is so great. CDH will have something like an hdfs-347 but probably not
till CDH4 (Todd talks of a version of hdfs-347 but one that will work w/
security -- see his patch up in hdfs-237 as opposed to the amended Dhruba patch
posted by Ryan). A hdfs-347 probably won't show in apache hadoop till
0.23/0.24 would be my guess.

Add search to HBase
---

Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase