ported lucandra: lucene index on HBase
Hi, Lucandra stores a lucene index on cassandra: http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend As the author of lucandra writes: "I’m sure something similar could be built on hbase." So here it is: http://github.com/thkoch2001/lucehbase This is only a first prototype which has not been tested on anything real yet. But if you're interested, please join me to get it production ready! I propose to keep this thread on hbase-user and java-dev only. Would it make sense to aim this project to become an hbase contrib? Or a lucene contrib? Best regards, Thomas Koch, http://www.koch.ro - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849639#action_12849639 ] Michael McCandless commented on LUCENE-2215: This is a neat collector! I like the idea of chaining/filtering... couldn't we put this in core (under TFC/TSDC.create), but instead of doubling the 12 specialized (anonymous) impls we now have, just delegate? Ie, we'd make a FilteredCollector, taking another collector when it's created, and then on every collect call, only if the hit is "weak" enough (ie is worse than what the app provided as prev low score/doc) would it forward it to the delegate? I guess we should test perf w/ (the new additions to benchmark -- yay!) to see if specializing the code (even anonymously) is warranted. The indent whitespace needs to fixed to 2 spaces... > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Mon, Mar 22, 2010 at 12:45 PM, Marvin Humphrey wrote: > On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote: >> Also, will Lucy store the original stats? > > These? > > * Total number of tokens in the field. > * Number of unique terms in the field. > * Doc boost. > * Field boost. Also sum(tf). Robert can generate more :) > That would depend on which Similiarity the user specs for that field. In > other words, it's just another data-reduction decision: if the Sim needs it, > keep it, and if doesn't, throw it away. OK. > Incidentally, what are you planning to do about field boost if it's not always > 1.0? Are you going to store full 32-bit floats? For starters, yes. We may (later) want to make a new attr that sets the #bits (levels/precision) you want... then uses packed ints to encode. >> Ie so the chosen Sim can properly recompute all boost bytes (if it uses >> those), for scoring models that "pivot" based on avg's of these stats? > > Yes, we could support that. > > It's not high on my todo-list for core Lucy, though: poor payoff for all the > complexity it would introduce, particularly file format complexity with its > heavy backwards compatibility burden. Right now, we only have the boost > bytes, and the fact that they are used for length normalization, field boost, > and doc boost is incidental. If we add all the raw stats, that's a bunch of > stuff we have to support for a long time, yet which doesn't yield practical > advantages for us yet. > > I'd be much more interested in finding a way to support such a feature as an > extension. I was specifically asking if Lucy will allow the user to force true average to be recomputed, ie, at commit time from the writer. It's more costly and often not needed (ie, once your index is large enough, new docs "typically" won't shift the average much). But I imagine some users will want "true average". >> > In any case, the proposal to start delaying Sim choice to search-time -- >> > while >> > a nice feature for Lucene -- is a non-starter for Lucy. We can't do that >> > because it would kill the cheap-Searcher model to generate boost bytes at >> > Searcher construction time and cache them within the object. We need those >> > boost bytes written to disk so we can mmap them and share them amongst many >> > cheap Searchers. >> >> It'd seem like Lucy could re-gen the boost bytes if a different Sim >> were selected, or, the current Sim hadn't yet computed & cached its >> bytes? But then logically this means a "reader" needs write >> permission to the index dir, which is not good... > > Whatever's reading the boost bytes can't tell the difference between process > RAM and mmap'd RAM, so write-permission on the index dir isn't required. Hmm if you could somehow soften this... so that a custom Sim could regen its boost bytes (if it needed to), write them into the index, and then "whoever's reading" can mmap... that'd buy you some flexibility back. > What's trickier is that Schemas are not normally mutable, and that they are > part of the index. You don't have to supply an Analyzer, or a Similarity, or > anything else when opening a Searcher -- you just provide the location of the > index, and the Schema gets deserialized from the latest schema_NNN.json file. > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much > a thing of the past for us. That's nice... though... is it too rigid? Do users even want to pick a different analyzer at search time? > But it makes your feature request of runtime settability for > Similarity awkward to implement: by the time you have a Schema > object to work with, the Searcher is already open. > > Searcher searcher = new Searcher("/path/to/index"); > Schema schema = searcher.getSchema(); > schema.setSim("content", altSim); // Too late, and not implemented anyway. I see... >> > To my mind, these are all related data reduction tasks: >> > >> > * Omit doc-boost and field-boost, replacing them with a single float >> >docXfield multiplier -- because you never need doc-boost on its own. >> > * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost, >> >replacing them all with a single boost byte -- because for the kind of >> >scoring you want to do, you don't need all those raw stats. >> > * Omit the boost byte, because you don't need to do scoring at all. >> > * Omit positions because you don't need PhraseQueries, etc. to match. >> >> I wouldn't group this one with the others -- I mean technically it is >> "data reduction" -- but omitting positions means certain queries >> (PhraseQuery) won't work even in "match only" searching. Whereas the >> rest of these examples affect how scoring is done (or whether it's >> done). > > Couldn't disagree more. Omitting positions is *exactly* the kind of data > reduction task which we know is safe to perform when a user specifically tells > us they don't need PhraseQueries by specifying a MinimalSimi
[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Smith updated LUCENE-2345: -- Attachment: LUCENE-2345_3.0.patch Here's a patch against 3.0 that provides the SegmentReaderFactory ability (not tested yet, but i'll be doing that shortly as i integrate this functionality) It adds a SegmentReaderFactory. The IndexWriter now has a getter and setter for setting this SegmentReader has a new protected method init() which is called after the segment reader has been initialized (to allow subclasses to hook this action and do additional initialization, etc added 2 new IndexReader.open() calls that allow specifying the SegmentReaderFactory > Make it possible to subclass SegmentReader > -- > > Key: LUCENE-2345 > URL: https://issues.apache.org/jira/browse/LUCENE-2345 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Reporter: Tim Smith > Fix For: 3.1 > > Attachments: LUCENE-2345_3.0.patch > > > I would like the ability to subclass SegmentReader for numerous reasons: > * to capture initialization/close events > * attach custom objects to an instance of a segment reader (caches, > statistics, so on and so forth) > * override methods on segment reader as needed > currently this isn't really possible > I propose adding a SegmentReaderFactory that would allow creating custom > subclasses of SegmentReader > default implementation would be something like: > {code} > public class SegmentReaderFactory { > public SegmentReader get(boolean readOnly) { > return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); > } > public SegmentReader reopen(SegmentReader reader, boolean readOnly) { > return newSegmentReader(readOnly); > } > } > {code} > It would then be made possible to pass a SegmentReaderFactory to IndexWriter > (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, > etc) > I could prepare a patch if others think this has merit > Obviously, this API would be "experimental/advanced/will change in future" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849728#action_12849728 ] Shai Erera commented on LUCENE-2345: bq. The IndexWriter now has a getter and setter for setting this If this is not expected to change during the lifetime of IW, I think it should be added to IWC when you upgrade the patch to 3.1. > Make it possible to subclass SegmentReader > -- > > Key: LUCENE-2345 > URL: https://issues.apache.org/jira/browse/LUCENE-2345 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Reporter: Tim Smith > Fix For: 3.1 > > Attachments: LUCENE-2345_3.0.patch > > > I would like the ability to subclass SegmentReader for numerous reasons: > * to capture initialization/close events > * attach custom objects to an instance of a segment reader (caches, > statistics, so on and so forth) > * override methods on segment reader as needed > currently this isn't really possible > I propose adding a SegmentReaderFactory that would allow creating custom > subclasses of SegmentReader > default implementation would be something like: > {code} > public class SegmentReaderFactory { > public SegmentReader get(boolean readOnly) { > return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); > } > public SegmentReader reopen(SegmentReader reader, boolean readOnly) { > return newSegmentReader(readOnly); > } > } > {code} > It would then be made possible to pass a SegmentReaderFactory to IndexWriter > (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, > etc) > I could prepare a patch if others think this has merit > Obviously, this API would be "experimental/advanced/will change in future" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader
[ https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849731#action_12849731 ] Tim Smith commented on LUCENE-2345: --- that was my plan > Make it possible to subclass SegmentReader > -- > > Key: LUCENE-2345 > URL: https://issues.apache.org/jira/browse/LUCENE-2345 > Project: Lucene - Java > Issue Type: Wish > Components: Index >Reporter: Tim Smith > Fix For: 3.1 > > Attachments: LUCENE-2345_3.0.patch > > > I would like the ability to subclass SegmentReader for numerous reasons: > * to capture initialization/close events > * attach custom objects to an instance of a segment reader (caches, > statistics, so on and so forth) > * override methods on segment reader as needed > currently this isn't really possible > I propose adding a SegmentReaderFactory that would allow creating custom > subclasses of SegmentReader > default implementation would be something like: > {code} > public class SegmentReaderFactory { > public SegmentReader get(boolean readOnly) { > return readOnly ? new ReadOnlySegmentReader() : new SegmentReader(); > } > public SegmentReader reopen(SegmentReader reader, boolean readOnly) { > return newSegmentReader(readOnly); > } > } > {code} > It would then be made possible to pass a SegmentReaderFactory to IndexWriter > (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, > etc) > I could prepare a patch if others think this has merit > Obviously, this API would be "experimental/advanced/will change in future" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: > >> Also, will Lucy store the original stats? > > > > These? > > > > * Total number of tokens in the field. > > * Number of unique terms in the field. > > * Doc boost. > > * Field boost. > > Also sum(tf). Robert can generate more :) Hmm, aren't "Total number of tokens in the field" and sum(tf) normally equivalent? I guess there might be analyzers for which that isn't true, e.g. those which perform synonym-injection? In any case, "sum(tf)" is probably a better definition, because it makes no ancillary claims... > > Incidentally, what are you planning to do about field boost if it's not > > always > > 1.0? Are you going to store full 32-bit floats? > > For starters, yes. OK, how are those going to be encoded? IEEE 754? Big-endian? http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness > We may (later) want to make a new attr that sets > the #bits (levels/precision) you want... then uses packed ints to > encode. I'm concerned that the bit-wise entropy of floats may make them a poor match for compression via packed ints. We'll probably get a compressed representation which is larger than the original. Are there any standard algorithms out there for compressing IEEE 754 floats? RLE works, but only with certain data patterns. ... [ time passes ] ... Hmm, maybe not: http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data > I was specifically asking if Lucy will allow the user to force true > average to be recomputed, ie, at commit time from the writer. That's theoretically possible. We'd have to implement the reader the same way we have DeletionsReader -- the most recent segment may contain data which applies to older segments. Here's the DeletionsReader code, which searches backwards through the segments looking for a particular file: /* Start with deletions files in the most recently added segments and work * backwards. The first one we find which addresses our segment is the * one we need. */ for (i = VA_Get_Size(segments) - 1; i >= 0; i--) { Segment *other_seg = (Segment*)VA_Fetch(segments, i); Hash *metadata = (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9); if (metadata) { Hash *files = (Hash*)CERTIFY( Hash_Fetch_Str(metadata, "files", 5), HASH); Hash *seg_files_data = (Hash*)Hash_Fetch(files, (Obj*)my_seg_name); if (seg_files_data) { Obj *count = (Obj*)CERTIFY( Hash_Fetch_Str(seg_files_data, "count", 5), OBJ); del_count = (i32_t)Obj_To_I64(count); del_file = (CharBuf*)CERTIFY( Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF); break; } } } What we'd do is write the regenerated boost bytes for *all* segments to the most recent segment. It would be roughly analogous to building up an NRT reader. > > What's trickier is that Schemas are not normally mutable, and that they are > > part of the index. You don't have to supply an Analyzer, or a Similarity, > > or > > anything else when opening a Searcher -- you just provide the location of > > the > > index, and the Schema gets deserialized from the latest schema_NNN.json > > file. > > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty > > much > > a thing of the past for us. > > That's nice... though... is it too rigid? Do users even want to pick > a different analyzer at search time? It's not common. To my mind, the way a field is tokenized is part of its field definition, thus the Analyzer is part of the field definition, thus the analyzer is part of the schema and needs to be stored with the index. Still, we support different Analyzers at search time by way of QueryParser. QueryParser's constructor requires a Schema, but also accepts an optional Analyzer which if supplied will be used instead of the Analyzers from the Schema. > > Maybe aggressive automatic data-reduction makes more sense in the context of > > "flexible matching", which is more expansive than "flexible scoring"? > > I think so. Maybe it shouldn't be called a Similarity (which to me > (though, carrying a heavy curse of knowledge burden...) means > "scoring")? Matcher? Heh. "Matcher" is taken. It's a crucial class, too, roughly combining the roles of Lucene's Scorer and DocIDSetIterator. The first alternative that comes to mind is "Relevance", because not only can one thing's relevance to another be continuously variable (i.e. score), it can also be binary: relevant/not-relevant (i.e. match). But I don't see why "Relevance", "Matcher", or anything else would be so much better than "Similarity". I think this is your hang up. ;) > > I'm +0 (FWIW) on search-time Sim settability for Lucene. It's a nice > > feature, > > but I don't
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849806#action_12849806 ] Jason Rutherglen commented on LUCENE-2324: -- Michael, I'm guessing this patch needs to be updated as per LUCENE-2329? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2324-no-pooling.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849808#action_12849808 ] Jason Rutherglen commented on LUCENE-2324: -- Actually, I just browsed the patch again, I don't think it implements private doc writers as of yet? I think you're right, we can get this issue completed. LUCENE-2312's path looks clear at this point. Shall I take a whack at it? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2324-no-pooling.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-2324: -- Attachment: (was: lucene-2324-no-pooling.patch) > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849819#action_12849819 ] Michael Busch commented on LUCENE-2324: --- Hey Jason, Disregard my patch here. I just experimented with removal of pooling, but then did LUCENE-2329 instead. TermsHash and TermsHashPerThread are now much simpler, because all the pooling code is gone after 2329 was committed. Should make it a little easier to get this patch done. Sure it'd be awesome if you could provide a patch here. I can help you, we should just frequently post patches here so that we don't both work on the same areas. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849843#action_12849843 ] Grant Ingersoll commented on LUCENE-2215: - Mike, don't you think, though, that through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test. > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849844#action_12849844 ] Jason Rutherglen commented on LUCENE-2324: -- Michael, I'm working on a patch and will post one (hopefully) shortly. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849851#action_12849851 ] Uwe Schindler commented on LUCENE-2215: --- Hey, and I want to fix the NaN thing in TSDC: LUCENE-2271 Maybe when we delegate, we can also use my cool code that switches the delegate to remove on comparison after the queue is full. > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849863#action_12849863 ] Michael McCandless commented on LUCENE-2215: bq. ...through a fairly simple update of some of the clauses to appropriate short circuit things that we can just hook this into the existing collectors w/o no need for any delegation or changes? Let me try a patch. Now that the benchmark stuff is in, we should be able to test. This'd make me nervous... Ie I don't think we should insert bytecodes for the 99.9% of searches that wouldn't make use of this, even if we can't uncover a slowdown with benchmarking. We should still benchmark it though (I'm curious)... we should also benchmark the delegate solution. > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849899#action_12849899 ] Michael Busch commented on LUCENE-2324: --- Awesome! > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search
Explore other in-memory postinglist formats for realtime search --- Key: LUCENE-2346 URL: https://issues.apache.org/jira/browse/LUCENE-2346 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 The current in-memory posting list format might not be optimal for searching. VInt decoding performance and the lack of skip lists would arguably be the biggest bottlenecks. For LUCENE-2312 we should investigate other formats. Some ideas: - PFOR or packed ints for posting slices? - Maybe even int[] slices instead of byte slices? This would be great for search performance, but the additional memory overhead might not be acceptable. - For realtime search it's usually desirable to evaluate the most recent documents first. So using backward pointers instead of forward pointers and having the postinglist pointer point to the most recent docID in a list is something to consider. - Skipping: if we use fixed-length postings ([packed] ints) we can do binary search within a slice. We can also locate a pointer then without scanning and thus skip entire slices quickly. Is that sufficient or would we need more skipping layers, so that it's possible to skip directly to particular slices? It would be awesome to find a format that doesn't slow down "normal" indexing, but is very efficient for in-memory searches. If we can't find such a fits-all format, we should have a separate indexing chain for real-time indexing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2347) Dump WordNet to SOLR Synonym format
Dump WordNet to SOLR Synonym format --- Key: LUCENE-2347 URL: https://issues.apache.org/jira/browse/LUCENE-2347 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0.1 Reporter: Bill Bell This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get all your syns loaded easily. 1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ WordNet V2 to SOLR by first using the Sys2Index program http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html Get WNprolog from http://wordnetcode.princeton.edu/2.0/ 2. We modified this program to work with SOLR (See attached) on amidev.kaango.com in /vol/src/lucene/contrib/wordnet vi /vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java 3. Run ant 4. java -classpath /vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr > index_synonyms.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2347) Dump WordNet to SOLR Synonym format
[ https://issues.apache.org/jira/browse/LUCENE-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Bell updated LUCENE-2347: -- Attachment: Syns2Solr.java > Dump WordNet to SOLR Synonym format > --- > > Key: LUCENE-2347 > URL: https://issues.apache.org/jira/browse/LUCENE-2347 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 3.0.1 >Reporter: Bill Bell > Attachments: Syns2Solr.java > > > This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get > all your syns loaded easily. > 1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ > WordNet V2 to SOLR by first using the Sys2Index program > http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html > Get WNprolog from http://wordnetcode.princeton.edu/2.0/ > 2. We modified this program to work with SOLR (See attached) on > amidev.kaango.com in /vol/src/lucene/contrib/wordnet > vi > /vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java > 3. Run ant > 4. java -classpath > /vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar > org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr > index_synonyms.txt -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849961#action_12849961 ] Grant Ingersoll commented on LUCENE-2215: - Yeah, but one could make the argument, Mike, that the existing "optimizations" are useless for the most common case, since I think it's safe to say most applications implement paging. Of course, that being said, most users don't page all that deeply. Also, for something like Solr that prefetches the top 50 it might not be good, either. Still, in my mind it is one additional boolean check, as in: {code} if ( (current stuff) || (pagingInfoPresent == true && paging check) ) ... {code} pagingInfoPresent can be determined at construction time and that whole clause would be short circuited very quickly. That being said, delegation could be done at construction time, too and more cleanly separates things. I'll try to put up my version tomorrow. > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849965#action_12849965 ] Jason Rutherglen commented on LUCENE-2324: -- I'm a little confused in the flushedDocCount, remap deletes conversion portions of DocWriter. flushedDocCount is used as a global counter, however when we move to per thread doc writers, it won't be global anymore. Is there a different (easier) way to perform remap deletes? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2215) paging collector
[ https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850002#action_12850002 ] Shai Erera commented on LUCENE-2215: bq. since I think it's safe to say most applications implement paging Let's be careful about the semantics here Grant. Most if not all applications implement paging indeed, but I believe only FEW actually store user contexts between searches. PagingCollector relies on the application to store the lowest ranking doc that was returned previously, which means storing context between user's searches. I agree w/ Mike's statement about 99.9% of the searches would never run that code, which is why I've proposed a delegation/wrapper approach from the beginning. I also think that we should make some allowances here and there, for the non-common case, and introduce better software design than specialized code. A Collector filter approach for some rare (or even less common) cases seems very reasonable to me. Also, I think that if we add to TSDC a create method which takes into account the previously scored lowest doc, it will confuse people. Now they will need to think "where do I get this low score from?" - but perhaps after I see the code, it wouldn't be such a bad thing just have a feeling TSDC and TFC should be left on their own, and extreme paging stuff should either be its own specialized collector, or a wrapper. > paging collector > > > Key: LUCENE-2215 > URL: https://issues.apache.org/jira/browse/LUCENE-2215 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.4, 3.0 >Reporter: Adam Heinz >Assignee: Grant Ingersoll >Priority: Minor > Attachments: IterablePaging.java, LUCENE-2215.patch, > PagingCollector.java, TestingPagingCollector.java > > > http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898 > Somebody assign this to Aaron McCurry and we'll see if we can get enough > votes on this issue to convince him to upload his patch. :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers - Key: LUCENE-2348 URL: https://issues.apache.org/jira/browse/LUCENE-2348 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9.2 Reporter: Trejkaz DuplicateFilter currently works by building a single doc ID set, without taking into account that getDocIdSet() will be called once per segment and only with each segment's local reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
[ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-2348: Component/s: (was: Search) contrib/* Changing to contrib, only just realised it was in that location... > DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment > readers > - > > Key: LUCENE-2348 > URL: https://issues.apache.org/jira/browse/LUCENE-2348 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.9.2 >Reporter: Trejkaz > > DuplicateFilter currently works by building a single doc ID set, without > taking into account that getDocIdSet() will be called once per segment and > only with each segment's local reader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2323) reorganize contrib modules
[ https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850012#action_12850012 ] Robert Muir commented on LUCENE-2323: - Committed 927696 (and 927697 for the solr piece). Will keep the issue open and work on a patch for the next part. > reorganize contrib modules > -- > > Key: LUCENE-2323 > URL: https://issues.apache.org/jira/browse/LUCENE-2323 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2323.patch > > > it would be nice to reorganize contrib modules, so that they are bundled > together by functionality. > For example: > * the wikipedia contrib is a tokenizer, i think really belongs in > contrib/analyzers > * there are two highlighters, i think could be one highlighters package. > * there are many queryparsers and queries in different places in contrib -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org