[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13568108#comment-13568108 ] Robert Muir commented on LUCENE-4682: - {quote} Not sure exactly what to conclude since a byte's frequency is FST dependent ... {quote} So we can just add it as another param to the massive FST.Builder ctor! Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: fstByteStats.txt, kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13567367#comment-13567367 ] Robert Muir commented on LUCENE-4682: - There's a pretty interesting approach here (http://www.aclweb.org/anthology/W/W09/W09-1505.pdf) that might be a good tradeoff, so we aren't stuck with a black and white situation (array arcs or not). instead we could (in theory) waste a few bits/mark/escape bytes in a similar way so we can do an approximate binary search... Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13567434#comment-13567434 ] Dawid Weiss commented on LUCENE-4682: - Yep, it's similar to what I was suggesting -- instead of expanding the subnodes into a full array encode a tree-like structure that would allow binary-searching right away. They still use a binary search over this packed representation; I'd just encode the binary tree directly somehow. I guess it's a matter of checking which will be faster/more efficient. (can't devote any time for it now, sorry). Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557229#comment-13557229 ] Commit Tag Bot commented on LUCENE-4682: [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revisionrevision=1435141 LUCENE-4677, LUCENE-4682, LUCENE-4678, LUCENE-3298: Merged /lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109 Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552497#comment-13552497 ] Dawid Weiss commented on LUCENE-4682: - Yeah, there are many ideas layered on top of each other and it's gotten beyond the point of being easy to comprehend. As for the next bit -- in any implementation I've seen this leads to significant reduction in automaton size. But I'm not saying it's the optimal way to do it, perhaps there are other encoding options that would reach similar compression levels without the added complexity. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552614#comment-13552614 ] Michael McCandless commented on LUCENE-4682: I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and it was a big hit: * With NEXT: 59267794 bytes * Without NEXT: 82543993 bytes So FST would be ~39% larger if we remove NEXT ... however lookup sped up from 726 ns per lookup to 636 ns. But: we could get this speedup today, if we just fixed setting of a NEXT arc's target to be lazy instead. Today it's very costly for non-array arcs because we scan to the end of all nodes to set the target, even if the caller isn't going to use it, which is really ridiculous. I also tested delta-coding the arc target instead of the abs vInt we have today ... it wasn't a real test, instead I just measured how many bytes the vInt delta would be vs how many bytes the vInt abs it today, and the results were disappointing: * Abs vInt (what we do today): 26681349 bytes * Delta vInt: 25479316 bytes Which is surprising ... I guess we don't see much locality for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them. Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552619#comment-13552619 ] Robert Muir commented on LUCENE-4682: - {quote} So FST would be ~39% larger if we remove NEXT {quote} But according to your notes above, we have 28% waste for this (with a long output). Are we making the right tradeoff? {quote} Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? {quote} Or, not do it at all if we cant figure it out? The reversing holds back other improvements so benchmarking it by itself could be misleading. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552624#comment-13552624 ] Dawid Weiss commented on LUCENE-4682: - bq. I also tested delta-coding the arc target instead of the abs vInt we have today ... I did such experiments when I was working on that paper. Remember -- you don't publish negative results, unfortunately. Obviously I didn't try everything but: 1) NEXT was important, 2) the structure of the FST doesn't yield to easy local deltas; it's not easily separable and pointers typically jump all over. bq. Which is surprising ... I guess we don't see much locality for the nodes ... or, eg the common suffixes freeze early on and then lots of future nodes refer to them. Not really that surprising -- you encode common suffixes after all so most of them will appear in a properly sized sample. This is actually why the strategy of moving nodes around works too -- you move those that are super frequent but they'll most likely be reordered at the top suffix frequencies of the automaton anyway. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552633#comment-13552633 ] Michael McCandless commented on LUCENE-4682: {quote} bq. So FST would be ~39% larger if we remove NEXT But according to your notes above, we have 28% waste for this (with a long output). Are we making the right tradeoff? {quote} Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today. {quote} bq. Maybe, we can find a way to do NEXT without the confusing per-node-reverse-bytes? Or, not do it at all if we cant figure it out? The reversing holds back other improvements so benchmarking it by itself could be misleading. {quote} I don't think we should drop NEXT unless we have some alternative? 39% increase is size is non-trivial! I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back? Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552641#comment-13552641 ] Robert Muir commented on LUCENE-4682: - {quote} Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that I think we should use a skip list? Surely the bytes required to encode the skip list are less than our waste today. {quote} {quote} I know reversing held back delta-code of the node target, but, that looks like it won't gain much. What else is it holding back? {quote} I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers :) Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction (LUCENE-4593). But fortunately you did that anyway... It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552653#comment-13552653 ] Michael McCandless commented on LUCENE-4682: bq. I mean in general NEXT/reversing adds more complexity here which makes it harder to try different things? Like a big doberman and BEWARE sign scaring off developers LOL :) But yeah I agree. bq. Its a big part of what sent me over the edge trying to refactor FST to have a store abstraction (LUCENE-4593). But fortunately you did that anyway... Right but it's not good if bus factor is 1 ... it's effectively dead code when that happens. bq. It would be really really really good for FSTs long term to do things like remove reversing, remove packed (fold these optos or at least most of them in by default), etc. +1, except that NEXT buys us a much smaller FST now. We can't just drop it ... we need some way to simplify it (eg somehow stop reversing). Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552184#comment-13552184 ] Michael McCandless commented on LUCENE-4682: OK I ran several performance tests, comparing trunk (lots of waste), the up to 25% waste, and 0 waste (no array arcs). First, I tested luceneutil on all of English Wikipedia, on an optimized index: Base = trunk (lots of waste), comp = up to 25% waste: {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff Respell 167.37 (3.3%) 128.36 (3.2%) -23.3% ( -28% - -17%) Fuzzy2 83.54 (8.0%) 74.17 (7.2%) -11.2% ( -24% -4%) PKLookup 359.62 (2.9%) 346.88 (4.5%) -3.5% ( -10% -4%) Wildcard 28.48 (4.3%) 28.25 (3.9%) -0.8% ( -8% -7%) Fuzzy1 82.77 (7.1%) 82.36 (7.7%) -0.5% ( -14% - 15%) HighSloppyPhrase0.78 (4.6%)0.78 (4.5%) -0.4% ( -9% -9%) IntNRQ3.38 (8.1%)3.37 (8.1%) -0.2% ( -15% - 17%) MedSloppyPhrase 28.13 (2.4%) 28.13 (2.4%) -0.0% ( -4% -4%) LowSloppyPhrase 25.10 (1.8%) 25.11 (1.8%) 0.0% ( -3% -3%) Prefix3 18.40 (4.8%) 18.41 (5.3%) 0.1% ( -9% - 10%) MedTerm 65.42 (18.2%) 65.46 (17.7%) 0.1% ( -30% - 43%) LowTerm 316.34 (6.9%) 317.45 (7.1%) 0.4% ( -12% - 15%) LowPhrase 23.30 (2.1%) 23.39 (1.8%) 0.4% ( -3% -4%) HighTerm 52.11 (18.6%) 52.33 (18.5%) 0.4% ( -30% - 46%) OrHighMed 25.65 (6.7%) 25.76 (7.2%) 0.4% ( -12% - 15%) AndHighHigh 19.42 (1.8%) 19.52 (2.0%) 0.5% ( -3% -4%) OrHighLow 24.19 (7.1%) 24.36 (7.4%) 0.7% ( -12% - 16%) OrHighHigh 14.32 (6.6%) 14.45 (7.3%) 0.9% ( -12% - 15%) HighSpanNear3.30 (3.9%)3.33 (3.1%) 0.9% ( -5% -8%) LowSpanNear8.94 (4.3%)9.04 (3.4%) 1.1% ( -6% -9%) AndHighMed 78.66 (1.6%) 79.82 (1.4%) 1.5% ( -1% -4%) MedSpanNear 30.45 (3.1%) 30.91 (2.8%) 1.5% ( -4% -7%) MedPhrase 130.74 (5.8%) 133.03 (5.9%) 1.8% ( -9% - 14%) AndHighLow 571.07 (3.5%) 582.83 (2.8%) 2.1% ( -4% -8%) HighPhrase 11.62 (6.1%) 11.88 (6.4%) 2.2% ( -9% - 15%) {noformat} trunk .tip file was 23928963 and comp was 17125654 bytes (~28% smaller!). Then, base=trunk, comp=0 waste (no array arcs): {noformat} TaskQPS base StdDevQPS comp StdDev Pct diff PKLookup 359.26 (2.3%) 152.91 (5.3%) -57.4% ( -63% - -51%) Respell 168.82 (3.4%) 128.85 (2.2%) -23.7% ( -28% - -18%) Fuzzy2 85.59 (8.2%) 75.03 (6.6%) -12.3% ( -25% -2%) LowTerm 331.01 (7.9%) 324.30 (7.2%) -2.0% ( -15% - 14%) HighTerm 54.75 (19.2%) 54.00 (19.2%) -1.4% ( -33% - 45%) MedTerm 68.68 (18.5%) 68.04 (19.0%) -0.9% ( -32% - 44%) AndHighMed 80.50 (1.4%) 79.78 (1.4%) -0.9% ( -3% -1%) Wildcard 29.13 (4.5%) 28.89 (4.0%) -0.8% ( -8% -8%) LowPhrase 23.88 (2.3%) 23.69 (1.5%) -0.8% ( -4% -3%) AndHighLow 584.30 (2.9%) 582.17 (2.9%) -0.4% ( -6% -5%) Fuzzy1 83.84 (6.9%) 83.62 (6.7%) -0.3% ( -12% - 14%) AndHighHigh 19.88 (1.8%) 19.84 (1.5%) -0.2% ( -3% -3%) LowSloppyPhrase 25.61 (1.9%) 25.56 (2.0%) -0.2% ( -4% -3%) LowSpanNear9.20 (3.6%)9.20 (3.9%) 0.0% ( -7% -7%) MedSpanNear 31.32 (2.7%) 31.36 (2.9%) 0.1% ( -5% -5%) MedSloppyPhrase 28.67 (2.3%) 28.73 (2.2%) 0.2% ( -4% -4%) Prefix3 18.83 (4.9%) 18.88 (5.4%) 0.3% ( -9% - 11%)
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552189#comment-13552189 ] Robert Muir commented on LUCENE-4682: - Well, i think actually we shouldnt allow 25% waste. maybe only something like 10%. In these waste-cases, doing array-arcs is a really inefficient/overkill way to just gain binary search... I think if it would be too wasteful (e.g. exceed 10%), we should just insert some skip points or something. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552218#comment-13552218 ] Dawid Weiss commented on LUCENE-4682: - bq. I think if it would be too wasteful (e.g. exceed 10%), we should just insert some skip points or something I had the same thought. We only go with the extremes -- either a linked list (essentially) or an array. A tree structure would be in between and would give logN searches and optimal space utilization. Obviously it'd also complicate the code. :) Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552374#comment-13552374 ] Robert Muir commented on LUCENE-4682: - I agree: about the code complication, its hard to try different things currently :) I think this is due to the reversing and BIT_TARGET_NEXT? does this really save us more than the bloat of not being able to delta-encode 'target' ? The packed case dereferences to nodeID and looks that up from packed ints right? But I wonder about that too: is the cost of this additional indirection in space worth it versus just delta-encoding with vint? Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551932#comment-13551932 ] Michael McCandless commented on LUCENE-4682: A couple more ideas: * Since the root arc is [usually?] cached ... we [usually] shouldn't make the root node into an array? * The building process sometimes has freedom in where the outputs are pushed ... so in theory we could push the outputs forwards if it would mean fewer wasted bytes on the prior node ... this would be a tricky optimization problem I think. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551934#comment-13551934 ] Michael McCandless commented on LUCENE-4682: Maybe we should just tighten up the FST thresholds for when we make an array arc: {noformat} /** * @see #shouldExpand(UnCompiledNode) */ final static int FIXED_ARRAY_SHALLOW_DISTANCE = 3; // 0 = only root node. /** * @see #shouldExpand(UnCompiledNode) */ final static int FIXED_ARRAY_NUM_ARCS_SHALLOW = 5; /** * @see #shouldExpand(UnCompiledNode) */ final static int FIXED_ARRAY_NUM_ARCS_DEEP = 10; {noformat} When I print out the waste, it's generally the smaller nodes that have higher proportional waste: {noformat} [java] waste: 44 numArcs=16 perArc=2.75 [java] waste: 20 numArcs=11 perArc=1.8181819 [java] waste: 13 numArcs=5 perArc=2.6 [java] waste: 20 numArcs=12 perArc=1.666 [java] waste: 60 numArcs=20 perArc=3.0 [java] waste: 0 numArcs=5 perArc=0.0 [java] waste: 48 numArcs=15 perArc=3.2 [java] waste: 16 numArcs=5 perArc=3.2 [java] waste: 20 numArcs=6 perArc=3.333 [java] waste: 8 numArcs=6 perArc=1.334 [java] waste: 24 numArcs=8 perArc=3.0 [java] waste: 32 numArcs=9 perArc=3.556 [java] waste: 17 numArcs=7 perArc=2.4285715 [java] waste: 13 numArcs=5 perArc=2.6 [java] waste: 17 numArcs=6 perArc=2.833 [java] waste: 28 numArcs=8 perArc=3.5 [java] waste: 20 numArcs=16 perArc=1.25 [java] waste: 44 numArcs=15 perArc=2.934 [java] waste: 28 numArcs=13 perArc=2.1538463 [java] waste: 28 numArcs=15 perArc=1.867 {noformat} Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551938#comment-13551938 ] Robert Muir commented on LUCENE-4682: - As an experiment i turned off array arcs for kuromoji in my trunk checkout: FST before: [java] 53645 nodes, 253185 arcs, 1535612 bytes... done after: [java] 53645 nodes, 253185 arcs, 1228816 bytes... done JAR before: rw-rw-r- 1 rmuir rmuir 4581420 Jan 12 09:56 lucene-analyzers-kuromoji-4.1-SNAPSHOT.jar after: rw-rw-r- 1 rmuir rmuir 4306792 Jan 12 09:56 lucene-analyzers-kuromoji-5.0-SNAPSHOT.jar Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551939#comment-13551939 ] Michael McCandless commented on LUCENE-4682: Even more than the 271,187 I measured (20% smaller FST), I think because the FST is now smaller we use fewer bytes writing the delta-coded node addresses ... Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551940#comment-13551940 ] Robert Muir commented on LUCENE-4682: - in the fixedArray case: {code} // write a false first arc: writer.writeByte(ARCS_AS_FIXED_ARRAY); writer.writeVInt(nodeIn.numArcs); // placeholder -- we'll come back and write the number // of bytes per arc (int) here: // TODO: we could make this a vInt instead writer.writeInt(0); fixedArrayStart = writer.getPosition(); {code} I think we should actually make that TODO line a writeByte. If it turns out the max arcSize is 255 i think we should just not encode as array arcs (just save our position before we write ARCS_AS_FIXED_ARRAY, rewind to that, and encode normally) This would reduce the overhead of array-arcs, but also maybe prevent some worst cases causing waste as a side effect. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551944#comment-13551944 ] Dawid Weiss commented on LUCENE-4682: - bq. Even more than the 271,187 I measured (20% smaller FST), I think because the FST is now smaller we use fewer bytes writing the delta-coded node addresses Yes, these things are all tightly coupled. Dawid Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551950#comment-13551950 ] Michael McCandless commented on LUCENE-4682: Another datapoint: the FreeDB suggester (tool in luceneutil to create/test it) is 1.05 GB FST, and has 87.5 MB wasted bytes (~8%). Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552057#comment-13552057 ] Michael McCandless commented on LUCENE-4682: +1 This is much cleaner (write header in the end). I built the AnalyzingSuggester for FreeDB: trunk is 1.046 GB and with patch it's 0.917 GB = ~9% smaller! Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552061#comment-13552061 ] Robert Muir commented on LUCENE-4682: - I can cleanup+commit the patch with the heuristic commented out (so we still get the cutover to vint, which i think is an obvious win?) This way we can benchmark and make sure the heuristic is set appropriately/doesnt hurt performance? Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552063#comment-13552063 ] Michael McCandless commented on LUCENE-4682: +1 Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552065#comment-13552065 ] Dawid Weiss commented on LUCENE-4682: - +1. Nice. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552067#comment-13552067 ] Uwe Schindler commented on LUCENE-4682: --- +1 Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552071#comment-13552071 ] Robert Muir commented on LUCENE-4682: - ok i committed the vInt for maxBytesPerArc, but left out the heuristic (so we still have the waste!!!) Here's the comment i added: {code} // TODO: try to avoid wasteful cases: disable doFixedArray in that case /* * * LUCENE-4682: what is a fair heuristic here? * It could involve some of these: * 1. how busy the node is: nodeIn.inputCount relative to frontier[0].inputCount? * 2. how much binSearch saves over scan: nodeIn.numArcs * 3. waste: numBytes vs numBytesExpanded * * the one below just looks at #3 if (doFixedArray) { // rough heuristic: make this 1.25 waste factor a parameter to the phd ctor int numBytes = lastArcStart - startAddress; int numBytesExpanded = maxBytesPerArc * nodeIn.numArcs; if (numBytesExpanded numBytes*1.25) { doFixedArray = false; } } */ {code} I think it would just be best to do some performance benchmarks and figure this out. I know all the kuromoji waste is at node.depth=1 exactly. Also I indexed all of geonames with this heuristic and it barely changed the FST size: trunk FST: 45296685 packedFST: 39083451 vint maxBytesPerArc: FST: 45052386 packedFST: 39083451 vint maxBytesPerArc+heuristic: FST: 44988400 packedFST: 39029108 So the waste and heuristic doesn't affect all FSTs, only certain ones. Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552072#comment-13552072 ] Commit Tag Bot commented on LUCENE-4682: [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revisionrevision=1432522 LUCENE-4682: vInt-encode maxBytesPerArc Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs
[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552075#comment-13552075 ] Robert Muir commented on LUCENE-4682: - Another simple idea: instead of boolean allowArrayArcs we just make this a float: acceptableArrayArcOverhead (or maybe a better name). you would pass 0 to disable array arcs completely (and we'd internally still have our boolean allowArrayArcs and not waste time computing stuff if this is actually = 0) Reduce wasted bytes in FST due to array arcs Key: LUCENE-4682 URL: https://issues.apache.org/jira/browse/LUCENE-4682 Project: Lucene - Core Issue Type: Improvement Components: core/FSTs Reporter: Michael McCandless Priority: Minor Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch When a node is close to the root, or it has many outgoing arcs, the FST writes the arcs as an array (each arc gets N bytes), so we can e.g. bin search on lookup. The problem is N is set to the max(numBytesPerArc), so if you have an outlier arc e.g. with a big output, you can waste many bytes for all the other arcs that didn't need so many bytes. I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 1535612 = ~18% wasted. It would be nice to reduce this. One thing we could do without packing is: in addNode, if we detect that number of wasted bytes is above some threshold, then don't do the expansion. Another thing, if we are packing: we could record stats in the first pass about which nodes wasted the most, and then in the second pass (paack) we could set the threshold based on the top X% nodes that waste ... Another idea is maybe to deref large outputs, so that the numBytesPerArc is more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org