I'm seeing errors like this one (using backwards codecs):
java.lang.ArrayIndexOutOfBoundsException: Index 69 out of bounds for
length 33
at
org.apache.lucene.codecs.lucene50.ForUtil.readBlock(ForUtil.java:196)
at
org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$EverythingEnum.r
You might be able to get something “good enough” with one of the pattern
tokenizers, see: https://lucene.apache.org/solr/guide/8_6/tokenizers.html.
Won’t be 100% of course.
And Paul’s comments are well taken, especially since your input will be
inconsistent I’d guess. How much you want to bet t
Hello Trevor,
I don’t know of an analyzer for mixes of code and text but I know of
an analyser for mixes of code and formulæ.
Clearly, you could build a custom analyzer that would tokenize
differently depending on weather you’re in code or in text. That’s
no super hard.
However, where thin