[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13568108#comment-13568108
 ] 

Robert Muir commented on LUCENE-4682:
-

{quote}
Not sure exactly what to conclude since a byte's frequency is FST dependent ...
{quote}

So we can just add it as another param to the massive FST.Builder ctor!

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: fstByteStats.txt, kuromoji.wasted.bytes.txt, 
 LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13567367#comment-13567367
 ] 

Robert Muir commented on LUCENE-4682:
-

There's a pretty interesting approach here 
(http://www.aclweb.org/anthology/W/W09/W09-1505.pdf)
that might be a good tradeoff, so we aren't stuck with a black and white 
situation (array arcs or not).

instead we could (in theory) waste a few bits/mark/escape bytes in a similar 
way so we can do 
an approximate binary search...


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-30 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13567434#comment-13567434
 ] 

Dawid Weiss commented on LUCENE-4682:
-

Yep, it's similar to what I was suggesting -- instead of expanding the subnodes 
into a full array encode a tree-like structure that would allow 
binary-searching right away. They still use a binary search over this packed 
representation; I'd just encode the binary tree directly somehow. I guess it's 
a matter of checking which will be faster/more efficient. (can't devote any 
time for it now, sorry).

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-18 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13557229#comment-13557229
 ] 

Commit Tag Bot commented on LUCENE-4682:


[branch_4x commit] Robert Muir
http://svn.apache.org/viewvc?view=revisionrevision=1435141

LUCENE-4677, LUCENE-4682, LUCENE-4678, LUCENE-3298: Merged 
/lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552497#comment-13552497
 ] 

Dawid Weiss commented on LUCENE-4682:
-

Yeah, there are many ideas layered on top of each other and it's gotten beyond 
the point of being easy to comprehend.

As for the next bit -- in any implementation I've seen this leads to 
significant reduction in automaton size. But I'm not saying it's the optimal 
way to do it, perhaps there are other encoding options that would reach similar 
compression levels without the added complexity.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552614#comment-13552614
 ] 

Michael McCandless commented on LUCENE-4682:


I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and 
it was a big hit:

  * With NEXT: 59267794 bytes

  * Without NEXT: 82543993 bytes

So FST would be ~39% larger if we remove NEXT ... however lookup sped up from 
726 ns per lookup to 636 ns.  But: we could get this speedup today, if we just 
fixed setting of a NEXT arc's target to be lazy instead.  Today it's very 
costly for non-array arcs because we scan to the end of all nodes to set the 
target, even if the caller isn't going to use it, which is really ridiculous.

I also tested delta-coding the arc target instead of the abs vInt we have 
today ... it wasn't a real test, instead I just measured how many bytes the 
vInt delta would be vs how many bytes the vInt abs it today, and the results 
were disappointing:

  * Abs vInt (what we do today): 26681349 bytes

  * Delta vInt: 25479316 bytes

Which is surprising ... I guess we don't see much locality for the nodes ... 
or, eg the common suffixes freeze early on and then lots of future nodes refer 
to them.

Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552619#comment-13552619
 ] 

Robert Muir commented on LUCENE-4682:
-

{quote}
So FST would be ~39% larger if we remove NEXT
{quote}

But according to your notes above, we have 28% waste for this (with a long 
output).
Are we making the right tradeoff?

{quote}
Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?
{quote}

Or, not do it at all if we cant figure it out? The reversing holds back other 
improvements so
benchmarking it by itself could be misleading.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552624#comment-13552624
 ] 

Dawid Weiss commented on LUCENE-4682:
-

bq. I also tested delta-coding the arc target instead of the abs vInt we have 
today ...

I did such experiments when I was working on that paper. Remember -- you don't 
publish negative results, unfortunately. Obviously I didn't try everything but: 
1) NEXT was important, 2) the structure of the FST doesn't yield to easy local 
deltas; it's not easily separable and pointers typically jump all over.

bq. Which is surprising ... I guess we don't see much locality for the nodes 
... or, eg the common suffixes freeze early on and then lots of future nodes 
refer to them.

Not really that surprising -- you encode common suffixes after all so most of 
them will appear in a properly sized sample. This is actually why the strategy 
of moving nodes around works too -- you move those that are super frequent but 
they'll most likely be reordered at the top suffix frequencies of the 
automaton anyway.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552633#comment-13552633
 ] 

Michael McCandless commented on LUCENE-4682:


{quote}
bq. So FST would be ~39% larger if we remove NEXT

But according to your notes above, we have 28% waste for this (with a long 
output).
Are we making the right tradeoff?
{quote}

Wait: the 28% waste comes from the array arcs (unrelated to NEXT?).  To fix 
that I think we should use a skip list?  Surely the bytes required to encode 
the skip list are less than our waste today.

{quote}
bq. Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?

Or, not do it at all if we cant figure it out? The reversing holds back other 
improvements so
benchmarking it by itself could be misleading.
{quote}

I don't think we should drop NEXT unless we have some alternative?  39% 
increase is size is non-trivial!

I know reversing held back delta-code of the node target, but, that looks like 
it won't gain much.  What else is it holding back?

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552641#comment-13552641
 ] 

Robert Muir commented on LUCENE-4682:
-

{quote}
Wait: the 28% waste comes from the array arcs (unrelated to NEXT?). To fix that 
I think we should use a skip list? Surely the bytes required to encode the skip 
list are less than our waste today.
{quote}

{quote}
I know reversing held back delta-code of the node target, but, that looks like 
it won't gain much. What else is it holding back?
{quote}

I mean in general NEXT/reversing adds more complexity here which makes it 
harder to try different things? Like a big doberman and BEWARE sign scaring off 
developers :)

Its a big part of what sent me over the edge trying to refactor FST to have a 
store abstraction (LUCENE-4593). But fortunately you did that anyway...

It would be really really really good for FSTs long term to do things like 
remove reversing, remove packed (fold these optos or at least most of them in 
by default), etc.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552653#comment-13552653
 ] 

Michael McCandless commented on LUCENE-4682:


bq. I mean in general NEXT/reversing adds more complexity here which makes it 
harder to try different things? Like a big doberman and BEWARE sign scaring off 
developers 

LOL :)

But yeah I agree.

bq. Its a big part of what sent me over the edge trying to refactor FST to have 
a store abstraction (LUCENE-4593). But fortunately you did that anyway...

Right but it's not good if bus factor is 1 ... it's effectively dead code when 
that happens.

bq. It would be really really really good for FSTs long term to do things like 
remove reversing, remove packed (fold these optos or at least most of them in 
by default), etc.

+1, except that NEXT buys us a much smaller FST now.  We can't just drop it ... 
we need some way to simplify it (eg somehow stop reversing).

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552184#comment-13552184
 ] 

Michael McCandless commented on LUCENE-4682:


OK I ran several performance tests, comparing trunk (lots of waste),
the up to 25% waste, and 0 waste (no array arcs).

First, I tested luceneutil on all of English Wikipedia, on an optimized index:

Base = trunk (lots of waste), comp = up to 25% waste:

{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
 Respell  167.37  (3.3%)  128.36  (3.2%)  
-23.3% ( -28% -  -17%)
  Fuzzy2   83.54  (8.0%)   74.17  (7.2%)  
-11.2% ( -24% -4%)
PKLookup  359.62  (2.9%)  346.88  (4.5%)   
-3.5% ( -10% -4%)
Wildcard   28.48  (4.3%)   28.25  (3.9%)   
-0.8% (  -8% -7%)
  Fuzzy1   82.77  (7.1%)   82.36  (7.7%)   
-0.5% ( -14% -   15%)
HighSloppyPhrase0.78  (4.6%)0.78  (4.5%)   
-0.4% (  -9% -9%)
  IntNRQ3.38  (8.1%)3.37  (8.1%)   
-0.2% ( -15% -   17%)
 MedSloppyPhrase   28.13  (2.4%)   28.13  (2.4%)   
-0.0% (  -4% -4%)
 LowSloppyPhrase   25.10  (1.8%)   25.11  (1.8%)
0.0% (  -3% -3%)
 Prefix3   18.40  (4.8%)   18.41  (5.3%)
0.1% (  -9% -   10%)
 MedTerm   65.42 (18.2%)   65.46 (17.7%)
0.1% ( -30% -   43%)
 LowTerm  316.34  (6.9%)  317.45  (7.1%)
0.4% ( -12% -   15%)
   LowPhrase   23.30  (2.1%)   23.39  (1.8%)
0.4% (  -3% -4%)
HighTerm   52.11 (18.6%)   52.33 (18.5%)
0.4% ( -30% -   46%)
   OrHighMed   25.65  (6.7%)   25.76  (7.2%)
0.4% ( -12% -   15%)
 AndHighHigh   19.42  (1.8%)   19.52  (2.0%)
0.5% (  -3% -4%)
   OrHighLow   24.19  (7.1%)   24.36  (7.4%)
0.7% ( -12% -   16%)
  OrHighHigh   14.32  (6.6%)   14.45  (7.3%)
0.9% ( -12% -   15%)
HighSpanNear3.30  (3.9%)3.33  (3.1%)
0.9% (  -5% -8%)
 LowSpanNear8.94  (4.3%)9.04  (3.4%)
1.1% (  -6% -9%)
  AndHighMed   78.66  (1.6%)   79.82  (1.4%)
1.5% (  -1% -4%)
 MedSpanNear   30.45  (3.1%)   30.91  (2.8%)
1.5% (  -4% -7%)
   MedPhrase  130.74  (5.8%)  133.03  (5.9%)
1.8% (  -9% -   14%)
  AndHighLow  571.07  (3.5%)  582.83  (2.8%)
2.1% (  -4% -8%)
  HighPhrase   11.62  (6.1%)   11.88  (6.4%)
2.2% (  -9% -   15%)
{noformat}

trunk .tip file was 23928963 and comp was 17125654 bytes (~28% smaller!).

Then, base=trunk, comp=0 waste (no array arcs):

{noformat}
TaskQPS base  StdDevQPS comp  StdDev
Pct diff
PKLookup  359.26  (2.3%)  152.91  (5.3%)  
-57.4% ( -63% -  -51%)
 Respell  168.82  (3.4%)  128.85  (2.2%)  
-23.7% ( -28% -  -18%)
  Fuzzy2   85.59  (8.2%)   75.03  (6.6%)  
-12.3% ( -25% -2%)
 LowTerm  331.01  (7.9%)  324.30  (7.2%)   
-2.0% ( -15% -   14%)
HighTerm   54.75 (19.2%)   54.00 (19.2%)   
-1.4% ( -33% -   45%)
 MedTerm   68.68 (18.5%)   68.04 (19.0%)   
-0.9% ( -32% -   44%)
  AndHighMed   80.50  (1.4%)   79.78  (1.4%)   
-0.9% (  -3% -1%)
Wildcard   29.13  (4.5%)   28.89  (4.0%)   
-0.8% (  -8% -8%)
   LowPhrase   23.88  (2.3%)   23.69  (1.5%)   
-0.8% (  -4% -3%)
  AndHighLow  584.30  (2.9%)  582.17  (2.9%)   
-0.4% (  -6% -5%)
  Fuzzy1   83.84  (6.9%)   83.62  (6.7%)   
-0.3% ( -12% -   14%)
 AndHighHigh   19.88  (1.8%)   19.84  (1.5%)   
-0.2% (  -3% -3%)
 LowSloppyPhrase   25.61  (1.9%)   25.56  (2.0%)   
-0.2% (  -4% -3%)
 LowSpanNear9.20  (3.6%)9.20  (3.9%)
0.0% (  -7% -7%)
 MedSpanNear   31.32  (2.7%)   31.36  (2.9%)
0.1% (  -5% -5%)
 MedSloppyPhrase   28.67  (2.3%)   28.73  (2.2%)
0.2% (  -4% -4%)
 Prefix3   18.83  (4.9%)   18.88  (5.4%)
0.3% (  -9% -   11%)

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552189#comment-13552189
 ] 

Robert Muir commented on LUCENE-4682:
-

Well, i think actually we shouldnt allow 25% waste. maybe only something like 
10%.

In these waste-cases, doing array-arcs is a really inefficient/overkill way to 
just 
gain binary search... I think if it would be too wasteful (e.g. exceed 10%), we 
should 
just insert some skip points or something.



 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-13 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552218#comment-13552218
 ] 

Dawid Weiss commented on LUCENE-4682:
-

bq. I think if it would be too wasteful (e.g. exceed 10%), we should 
just insert some skip points or something

I had the same thought. We only go with the extremes -- either a linked list 
(essentially) or an array. A tree structure would be in between and would give 
logN searches and optimal space utilization. Obviously it'd also complicate the 
code. :)

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552374#comment-13552374
 ] 

Robert Muir commented on LUCENE-4682:
-

I agree: about the code complication, its hard to try different things 
currently :)

I think this is due to the reversing and BIT_TARGET_NEXT? does this really save 
us more
than the bloat of not being able to delta-encode 'target' ?

The packed case dereferences to nodeID and looks that up from packed ints 
right? But 
I wonder about that too: is the cost of this additional indirection in space 
worth it
versus just delta-encoding with vint?


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551932#comment-13551932
 ] 

Michael McCandless commented on LUCENE-4682:


A couple more ideas:

  * Since the root arc is [usually?] cached ... we [usually] shouldn't make the 
root node into an array?

  * The building process sometimes has freedom in where the outputs are pushed 
... so in theory we could push the outputs forwards if it would mean fewer 
wasted bytes on the prior node ... this would be a tricky optimization problem 
I think.

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor

 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551934#comment-13551934
 ] 

Michael McCandless commented on LUCENE-4682:


Maybe we should just tighten up the FST thresholds for when we make an array 
arc:
{noformat}
  /**
   * @see #shouldExpand(UnCompiledNode)
   */
  final static int FIXED_ARRAY_SHALLOW_DISTANCE = 3; // 0 = only root node.

  /**
   * @see #shouldExpand(UnCompiledNode)
   */
  final static int FIXED_ARRAY_NUM_ARCS_SHALLOW = 5;

  /**
   * @see #shouldExpand(UnCompiledNode)
   */
  final static int FIXED_ARRAY_NUM_ARCS_DEEP = 10;
{noformat}

When I print out the waste, it's generally the smaller nodes that have higher 
proportional waste:
{noformat}
 [java] waste: 44 numArcs=16 perArc=2.75
 [java] waste: 20 numArcs=11 perArc=1.8181819
 [java] waste: 13 numArcs=5 perArc=2.6
 [java] waste: 20 numArcs=12 perArc=1.666
 [java] waste: 60 numArcs=20 perArc=3.0
 [java] waste: 0 numArcs=5 perArc=0.0
 [java] waste: 48 numArcs=15 perArc=3.2
 [java] waste: 16 numArcs=5 perArc=3.2
 [java] waste: 20 numArcs=6 perArc=3.333
 [java] waste: 8 numArcs=6 perArc=1.334
 [java] waste: 24 numArcs=8 perArc=3.0
 [java] waste: 32 numArcs=9 perArc=3.556
 [java] waste: 17 numArcs=7 perArc=2.4285715
 [java] waste: 13 numArcs=5 perArc=2.6
 [java] waste: 17 numArcs=6 perArc=2.833
 [java] waste: 28 numArcs=8 perArc=3.5
 [java] waste: 20 numArcs=16 perArc=1.25
 [java] waste: 44 numArcs=15 perArc=2.934
 [java] waste: 28 numArcs=13 perArc=2.1538463
 [java] waste: 28 numArcs=15 perArc=1.867
{noformat}

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor

 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551938#comment-13551938
 ] 

Robert Muir commented on LUCENE-4682:
-

As an experiment i turned off array arcs for kuromoji in my trunk checkout:

FST
before: [java]   53645 nodes, 253185 arcs, 1535612 bytes...   done
after:  [java]   53645 nodes, 253185 arcs, 1228816 bytes...   done

JAR
before: rw-rw-r-  1 rmuir rmuir 4581420 Jan 12 09:56 
lucene-analyzers-kuromoji-4.1-SNAPSHOT.jar
after:  rw-rw-r- 1 rmuir rmuir  4306792 Jan 12 09:56 
lucene-analyzers-kuromoji-5.0-SNAPSHOT.jar

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551939#comment-13551939
 ] 

Michael McCandless commented on LUCENE-4682:


Even more than the 271,187 I measured (20% smaller FST), I think because the 
FST is now smaller we use fewer bytes writing the delta-coded node addresses ...

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551940#comment-13551940
 ] 

Robert Muir commented on LUCENE-4682:
-

in the fixedArray case:
{code}
// write a false first arc:
writer.writeByte(ARCS_AS_FIXED_ARRAY);
writer.writeVInt(nodeIn.numArcs);
// placeholder -- we'll come back and write the number
// of bytes per arc (int) here:
// TODO: we could make this a vInt instead
writer.writeInt(0);
fixedArrayStart = writer.getPosition();
{code}

I think we should actually make that TODO line a writeByte.

If it turns out the max arcSize is  255 i think we should just not encode as 
array arcs (just save our position before we write ARCS_AS_FIXED_ARRAY, rewind 
to that, and encode normally)

This would reduce the overhead of array-arcs, but also maybe prevent some worst 
cases causing waste as a side effect.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551944#comment-13551944
 ] 

Dawid Weiss commented on LUCENE-4682:
-

bq. Even more than the 271,187 I measured (20% smaller FST), I think because 
the FST is now smaller we use fewer bytes writing the delta-coded node 
addresses 

Yes, these things are all tightly coupled.

Dawid

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551950#comment-13551950
 ] 

Michael McCandless commented on LUCENE-4682:


Another datapoint: the FreeDB suggester (tool in luceneutil to create/test it) 
is 1.05 GB FST, and has 87.5 MB wasted bytes (~8%).

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552057#comment-13552057
 ] 

Michael McCandless commented on LUCENE-4682:


+1

This is much cleaner (write header in the end).

I built the AnalyzingSuggester for FreeDB: trunk is 1.046 GB and with patch 
it's 0.917 GB = ~9% smaller!

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552061#comment-13552061
 ] 

Robert Muir commented on LUCENE-4682:
-

I can cleanup+commit the patch with the heuristic commented out (so we still 
get the cutover to vint, which i think is an obvious win?)

This way we can benchmark and make sure the heuristic is set 
appropriately/doesnt hurt performance?

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552063#comment-13552063
 ] 

Michael McCandless commented on LUCENE-4682:


+1

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552065#comment-13552065
 ] 

Dawid Weiss commented on LUCENE-4682:
-

+1. Nice.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552067#comment-13552067
 ] 

Uwe Schindler commented on LUCENE-4682:
---

+1

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552071#comment-13552071
 ] 

Robert Muir commented on LUCENE-4682:
-

ok i committed the vInt for maxBytesPerArc, but left out the heuristic (so we 
still have the waste!!!)

Here's the comment i added:
{code}
// TODO: try to avoid wasteful cases: disable doFixedArray in that case
/* 
 * 
 * LUCENE-4682: what is a fair heuristic here?
 * It could involve some of these:
 * 1. how busy the node is: nodeIn.inputCount relative to 
frontier[0].inputCount?
 * 2. how much binSearch saves over scan: nodeIn.numArcs
 * 3. waste: numBytes vs numBytesExpanded
 * 
 * the one below just looks at #3
if (doFixedArray) {
  // rough heuristic: make this 1.25 waste factor a parameter to the phd 
ctor
  int numBytes = lastArcStart - startAddress;
  int numBytesExpanded = maxBytesPerArc * nodeIn.numArcs;
  if (numBytesExpanded  numBytes*1.25) {
doFixedArray = false;
  }
}
*/
{code}

I think it would just be best to do some performance benchmarks and figure this 
out.
I know all the kuromoji waste is at node.depth=1 exactly.

Also I indexed all of geonames with this heuristic and it barely changed the 
FST size:

trunk
FST: 45296685
packedFST: 39083451
vint maxBytesPerArc:
FST: 45052386
packedFST: 39083451
vint maxBytesPerArc+heuristic:
FST: 44988400
packedFST: 39029108

So the waste and heuristic doesn't affect all FSTs, only certain ones.


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552072#comment-13552072
 ] 

Commit Tag Bot commented on LUCENE-4682:


[trunk commit] Robert Muir
http://svn.apache.org/viewvc?view=revisionrevision=1432522

LUCENE-4682: vInt-encode maxBytesPerArc


 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

2013-01-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552075#comment-13552075
 ] 

Robert Muir commented on LUCENE-4682:
-

Another simple idea: instead of boolean allowArrayArcs we just make this a 
float: acceptableArrayArcOverhead (or maybe a better name).

you would pass 0 to disable array arcs completely (and we'd internally still 
have our boolean allowArrayArcs and not waste 
time computing stuff if this is actually = 0)

 Reduce wasted bytes in FST due to array arcs
 

 Key: LUCENE-4682
 URL: https://issues.apache.org/jira/browse/LUCENE-4682
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/FSTs
Reporter: Michael McCandless
Priority: Minor
 Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch


 When a node is close to the root, or it has many outgoing arcs, the FST 
 writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
 search on lookup.
 The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
 arc e.g. with a big output, you can waste many bytes for all the other arcs 
 that didn't need so many bytes.
 I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
 1535612 = ~18% wasted.
 It would be nice to reduce this.
 One thing we could do without packing is: in addNode, if we detect that 
 number of wasted bytes is above some threshold, then don't do the expansion.
 Another thing, if we are packing: we could record stats in the first pass 
 about which nodes wasted the most, and then in the second pass (paack) we 
 could set the threshold based on the top X% nodes that waste ...
 Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
 more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org