[GitHub] incubator-joshua issue #48: Fixed crashing when using Trie based KenLM model...

2016-08-30 Thread KellenSunderland
Github user KellenSunderland commented on the issue:

https://github.com/apache/incubator-joshua/pull/48
  
Hey Matt, exactly right for the summary.

I am actually not sure how often collisions happen, it may be worth 
measuring.  Collisions causing crashing in this scenario happen once in about 
100k-250k translation requests.  This is a less likely scenario than just a 
state collision though, as you have to get a collision that gives you an 
out-of-range word id as a unigram.  

We've tested turning off state sharing between KenLMs, and indeed that also 
solves the crashing issue.  I don't think I know the details of KenLM and 
Joshua well enough to properly judge if this would be a better solution than 
the one I provided.  If someone with more knowledge then I provides a new PR 
I'll happily +1 it.  

One downside to just turning off state sharing is that we will still get 
collisions, we just won't get crashing.  I think if have collisions (even with 
a single model) we usually get an incorrect result (not a crash). 

There are some other implementation approaches that could also be 
considered to fix the issue too (I'd like to hear what Kenneth thinks would be 
the best approach).  One idea would be to define a standard hash function for 
the State struct, and then we could use the State itself as a key for a normal 
unordered_map.  Then we wouldn't need this multimap.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-joshua issue #48: Fixed crashing when using Trie based KenLM model...

2016-08-30 Thread mjpost
Github user mjpost commented on the issue:

https://github.com/apache/incubator-joshua/pull/48
  
Holy smokes, thanks for tracking this down. So if I understand correctly, 
this only occurs under the following circumstances:

- decoding with multiple KenLM language models
- built with different vocabularies (the usual case)
- a hash collision occurs and returns a state containing an ID that is 
invalid in the calling KenLM

Do you have any idea how often hash collisions actually occur?

I wonder if turning off sharing of KenLM states across LMs would also have 
worked, with little to no effect on performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---