Github user KellenSunderland commented on the issue:
https://github.com/apache/incubator-joshua/pull/48
Hey Matt, exactly right for the summary.
I am actually not sure how often collisions happen, it may be worth
measuring. Collisions causing crashing in this scenario happen once in about
100k-250k translation requests. This is a less likely scenario than just a
state collision though, as you have to get a collision that gives you an
out-of-range word id as a unigram.
We've tested turning off state sharing between KenLMs, and indeed that also
solves the crashing issue. I don't think I know the details of KenLM and
Joshua well enough to properly judge if this would be a better solution than
the one I provided. If someone with more knowledge then I provides a new PR
I'll happily +1 it.
One downside to just turning off state sharing is that we will still get
collisions, we just won't get crashing. I think if have collisions (even with
a single model) we usually get an incorrect result (not a crash).
There are some other implementation approaches that could also be
considered to fix the issue too (I'd like to hear what Kenneth thinks would be
the best approach). One idea would be to define a standard hash function for
the State struct, and then we could use the State itself as a key for a normal
unordered_map. Then we wouldn't need this multimap.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---