8192:
-
Commit e80ee7fff85918e68c212757c0e6c4bddbdb5ab6 in lucene-solr's branch
refs/heads/branch_7x from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e80ee7f ]
LUCENE-8192: always enforce index-time offsets are correct with
BaseTokenStreamTestCase
> Remove offsetsAre
ect from BaseTokenStreamTestCase
> -
>
> Key: LUCENE-8192
> URL: https://issues.apache.org/jira/browse/LUCENE-8192
> Project: Lucene - Core
> Issue Type: Bug
>
8192:
-
Commit e595541ef3f9642632ac85d03c62616b5f70f1e4 in lucene-solr's branch
refs/heads/master from [~rcmuir]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e595541 ]
LUCENE-8192: always enforce index-time offsets are correct with
BaseTokenStreamTestCase
> Remove offsetsAre
ct from BaseTokenStreamTestCase
> -
>
> Key: LUCENE-8192
> URL: https://issues.apache.org/jira/browse/LUCENE-8192
> Project: Lucene - Core
> Issue Type: Bug
>
yay!
> Remove offsetsAreCorrect from BaseTokenStreamTestCase
> -
>
> Key: LUCENE-8192
> URL: https://issues.apache.org/jira/browse/LUCENE-8192
> Project: Lucene - Core
>
hese checks weren't really "under" the
boolean, but it was difficult to see that.
I moved them in the latest patch to make this more obvious, but it doesn't
change the logic.
> Remove offsetsAreCorrect f
[
https://issues.apache.org/jira/browse/LUCENE-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-8192:
Attachment: LUCENE-8192.patch
> Remove offsetsAreCorrect from BaseTokenStreamTestC
sInc checks that indexwriter will do too.
I'll update the patch.
> Remove offsetsAreCorrect from BaseTokenStreamTestCase
> -
>
> Key: LUCENE-8192
> URL: https://issues.apache.org
step? It removes some useless leniency.
> Remove offsetsAreCorrect from BaseTokenStreamTestCase
> -
>
> Key: LUCENE-8192
> URL: https://issues.apache.org/jira/browse/LUCENE-8192
>
[
https://issues.apache.org/jira/browse/LUCENE-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-8192:
Attachment: LUCENE-8192_take_two.patch
> Remove offsetsAreCorrect from BaseTokenStreamTestC
seems to be a higher bar, and even tests for
filters that claim to support graphs (SynonymGraphFilter) screw this up?
Just at a glance, it seems like we want to separate these concerns. The first
one should not be optional.
> Remove offsetsAreCorrect f
[
https://issues.apache.org/jira/browse/LUCENE-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-8192:
Attachment: LUCENE-8192_prototype.patch
> Remove offsetsAreCorrect from BaseTokenStreamTestC
Robert Muir created LUCENE-8192:
---
Summary: Remove offsetsAreCorrect from BaseTokenStreamTestCase
Key: LUCENE-8192
URL: https://issues.apache.org/jira/browse/LUCENE-8192
Project: Lucene - Core
Hey all, I wrote a small BaseTokenStreamTestCase, so when a test fails, it
gives some useful debugging output explaining the token stream. Makes it
pretty easy to get your tests / offsets configured properly.
Any comments/ - is BaseTokenStreamTestCase the right place to add this
utility logic to
is still useful, but no longer so drastic. So sorry for being
unclear. 🤓 Maybe I change or remove the last sentence in my comment to remove
the misunderstanding.
> Should BaseTokenStreamTestCase catch analyzers that create dupl
> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
>
>
> Key: LUCENE-7622
> URL: https://issues.apache.org/jira/browse/LUCENE-7622
> Project: Luc
text by duplicating them
> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
>
>
> Key: LUCENE-7622
> URL: https://issues.apache.org/jira/browse/LU
#x27;t do this the
statistics would be wrong. I agree, for this case it would be better to have a
separate field, but some people like to have it in the same.
> Should BaseTokenStreamTestCase catch analyzers that create dupl
o pursuing this further now ... I
think it's maybe too anal to insist on this from all analyzers ... so I'm
posting the patch here in case anyone else gets itchy!
> Should BaseTokenStreamTestCase catch analyzers that
Michael McCandless created LUCENE-7622:
--
Summary: Should BaseTokenStreamTestCase catch analyzers that
create duplicate tokens?
Key: LUCENE-7622
URL: https://issues.apache.org/jira/browse/LUCENE-7622
call restoreState();
> clearAttributes() is not needed before restoreState().
>
> If you don’t do this, your filter will work incorrect if other filters
> come **after** it.
>
>
>
> The assertion in BaseTokenStreamTestCase is therefore correct and really
> mandatory. There ar
assertion in BaseTokenStreamTestCase is therefore correct and really
mandatory. There are many filters that show how to do this token inserting
correctly.
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de <http://www.thetaphi.de/>
eMail: u...@theta
Hi all
While writing the unit tests for a new token filter I came across an
issue(?) with BaseTokenStreamTestCase.assertTokenStreamContents(): it goes
to some length to assure that clearAttributes() was called for every token
produced by the filter under test.
I suppose this helps most of the tim
its not really a use case: you have to clear attributes when creating
a new token or you will have dirty state that is not appropriate...
On Fri, May 16, 2014 at 12:28 AM, Nitzan Shaked wrote:
> Hi all
>
> While writing the unit tests for a new token filter I came across an
> issue(?) with BaseTo
ramework.
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseToken
te), e.g.,
oal.analysis.miscellaneous.EmptyTokenStream. Remove EmptyTokenizer from
test-framework.
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseToken
!
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
>
ith EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
> Key: LUCENE-4656
>
and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
> Key: LUCENE-4656
> URL: https://issu
ith EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
> Key: LUCENE-4656
>
and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
> Key: LUCENE-4656
> URL: https://issu
s in affected files. I will commit this later and
backport.
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStre
ribute), fix BaseTokenStreamTestCase
> --
>
> Key: LUCENE-4656
> URL: https://issues.apache.org/jira/browse/LUCENE-4656
>
atch.
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
>
ith EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
> Key: LUCENE-4656
>
ithout CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
> Key: LUCENE-4656
> URL: https://issues.apache.org/jira/
core! Only in 4.x's TestDocument!
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseToke
your patch and they passed, so +1. +1 to
removing EmptyTokenizer too.
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStre
that horrible piece of sh* :-)
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStre
Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTestCase
> --
>
> K
/java to queryparser/src/test at least as an
improvement, since it is kinda funky.
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (witho
CharTermAttribute), fix BaseTokenStreamTestCase
(was: Fix EmptyTokenizer)
> Fix IndexWriter working together with EmptyTokenizer and EmptyTokenStream
> (without CharTermAttribute), fix BaseTokenStreamTe
: if you want to see what the test strings look like
now, have a look at ant test -Dtestcase=TestMockAnalyzer
-Dtestmethod=testRandomStrings -Dtests.verbose=true
> improve BaseTokenStreamTestCase random string generat
randomRealistic so in
that case we get whole words in the same unicode block (good for stemmers),
also sometimes uses randomRegexpIshString, so we get lots of punctuation (good
for tokenizers/filters, etc)
> improve BaseTokenStreamTestCase random string generat
n use this too.
> improve BaseTokenStreamTestCase random string generation
>
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene -
reat!
> improve BaseTokenStreamTestCase random string generation
>
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/browse/LUCENE-3911
> Project: Lucene - Java
>
_testUtil
string generation methods too :)
> improve BaseTokenStreamTestCase random string generation
>
>
> Key: LUCENE-3911
> URL: https://issues.apache.org/jira/bro
hort words, since the maxWordLength we
pass in is really a max...
but we would want that to be the exact number of elements. I'll improve this.
> improve BaseTokenStreamTestCase random string generation
>
>
>
[
https://issues.apache.org/jira/browse/LUCENE-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-3911:
Attachment: LUCENE-3911.patch
> improve BaseTokenStreamTestCase random string generat
improve BaseTokenStreamTestCase random string generation
Key: LUCENE-3911
URL: https://issues.apache.org/jira/browse/LUCENE-3911
Project: Lucene - Java
Issue Type: Task
> BaseTokenStreamTestCase should test analyzers on real-ish content
> -
>
> Key: LUCENE-3905
> URL: https://issues.apache.org/jira/browse/LUCENE-3905
> Project: Lucene - Java
>
for ngram love...
> BaseTokenStreamTestCase should test analyzers on real-ish content
> -
>
> Key: LUCENE-3905
> URL: https://issues.apache.org/jira/browse/LUCENE-3905
&g
s an improvement!
> BaseTokenStreamTestCase should test analyzers on real-ish content
> -
>
> Key: LUCENE-3905
> URL: https://issues.apache.org/jira/browse/LUCENE-3905
&g
are unfortunately not OK: they use up tons of RAM when you
send random/big tokens through them, because they don't have the same 1024
character limit... I think we should open a new issue for them... in fact I
think repairing them could make a good GSoC!
> BaseTokenStreamTestCase
the filter versions of these the same way?
e.g. if i have mocktokenizer + (edge)ngramfilter, are they ok?
> BaseTokenStreamTestCase should test analyzers on real-ish content
> -
>
>
e to the first 1024 chars, but that
doesn't mean they can't implement end() correctly so that at least
highlighting on multivalued fields etc works.
> BaseTokenStreamTestCase should test analyzers on
tokenizers...
> BaseTokenStreamTestCase should test analyzers on real-ish content
> -
>
> Key: LUCENE-3905
> URL: https://issues.apache.org/jira/browse/LUCENE-3905
>
BaseTokenStreamTestCase should test analyzers on real-ish content
-
Key: LUCENE-3905
URL: https://issues.apache.org/jira/browse/LUCENE-3905
Project: Lucene - Java
Issue Type
ake BaseTokenStreamTestCase a bit more evil
>
>
> Key: LUCENE-3894
> URL: https://issues.apache.org/jira/browse/LUCENE-3894
> Project: Lucene - Java
> Issue Type: Improvement
>Repo
d trivial test (testHugeDoc) found the
IO-311 bug, what if we
didn't have that silly test?
I'll add a patch.
> Make BaseTokenStreamTestCase a bit more evil
>
>
> Key: LUCENE-3894
>
Rob!
> Make BaseTokenStreamTestCase a bit more evil
>
>
> Key: LUCENE-3894
> URL: https://issues.apache.org/jira/browse/LUCENE-3894
> Project: Lucene - Java
> Issue Type: Improvem
[
https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless resolved LUCENE-3894.
Resolution: Fixed
> Make BaseTokenStreamTestCase a bit more e
d' is not really new, its from commons-io! we
should open a bug over there...
> Make BaseTokenStreamTestCase a bit more evil
>
>
> Key: LUCENE-3894
> URL: https://issues.apach
thod needs to use the incoming offset (ie, pass
location + offset, not location, as 2nd arg to input.read)? Does testHugeDoc
then pass?
> Make BaseTokenStreamTestCase a bit more evil
>
>
>
t for icutokenizer now passes
(spoonfeeding caught a bug).
But, now testHugeDoc fails... (not a random test).
> Make BaseTokenStreamTestCase a bit more evil
>
>
> Key: LUCENE-3894
> URL: https://is
ake BaseTokenStreamTestCase a bit more evil
>
>
> Key: LUCENE-3894
> URL: https://issues.apache.org/jira/browse/LUCENE-3894
> Project: Lucene - Java
> Issue Type: Improvement
>
/NGramTokenizers to work w/ spoon feeding, but otherwise no
analyzers seem to be failing, at least on one run...
I had to do some sneaky things with MockTokenizer to work around its state
machine...
> Make BaseTokenStreamTestCase a bit more e
Make BaseTokenStreamTestCase a bit more evil
Key: LUCENE-3894
URL: https://issues.apache.org/jira/browse/LUCENE-3894
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael
MockGraphTokenFilter into tests.
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> ---
>
> Key: LUCENE-3848
> URL: https://issues.apache.org/jira/bro
nc=0' to posinc=1 anyway.
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> ---
>
> Key: LUCENE-3848
> URL: https://issues.apache.org/j
+1
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> ---
>
> Key: LUCENE-3848
> URL: https://issues.apache.org/jira/browse/LUCENE-3848
>
grate Mike's nice MockGraphTokenFilter *yet* but will do this
under a separate issue: its likely to expose a few bugs :)
> basetokenstreamtestcase should fail if tokenstream starts
[
https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-3848:
Fix Version/s: (was: 3.6)
> basetokenstreamtestcase should fail if tokenstream sta
MockGraphTokenFilter we can use to randomly insert fake graph
arcs...
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> ---
>
> Key: LUCENE-3848
>
vingfilter.
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> ---
>
> Key: LUCENE-3848
> URL: https://issues.apache.org/jira/browse/LUCENE-3848
>
basetokenstreamtestcase should fail if tokenstream starts with posinc=0
---
Key: LUCENE-3848
URL: https://issues.apache.org/jira/browse/LUCENE-3848
Project: Lucene - Java
[
https://issues.apache.org/jira/browse/LUCENE-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir resolved LUCENE-3717.
-
Resolution: Fixed
> Add fake charfilter to BaseTokenStreamTestCase to find offsets b
computing end() from the trimmed length
* not calling correctOffset
* not checking return value of Reader.read causing bugs in some situations
(e.g. empty stringreader)
> Add fake charfilter to BaseTokenStreamTestCase to find offsets b
just remains to add the random test to all remaining tokenstreams...
> Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs
> ---
>
> Key: LUCENE-3717
>
g the current patch as a start but i think we
should check every tokenizer/filter/etc and just clean this up.
> Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs
> ---
>
>
charfilters.
* WikipediaTokenizer broken in many ways, in general the tokenizer keeps a ton
of state variables, but never resets this state.
patch fixes these but I'm sure adding more tests to the remaining filters will
find more bugs.
> Add fake charfilter to BaseTokenStream
all using checkRandomData (i think most are), just to see if we have any other
bugs sitting out there.
It would be nice to have these offsets all under control for the next release.
> Add fake charfilter to BaseTokenStreamTestCase to find offsets b
Add fake charfilter to BaseTokenStreamTestCase to find offsets bugs
---
Key: LUCENE-3717
URL: https://issues.apache.org/jira/browse/LUCENE-3717
Project: Lucene - Java
Issue
[
https://issues.apache.org/jira/browse/LUCENE-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-3717:
Attachment: LUCENE-3717.patch
> Add fake charfilter to BaseTokenStreamTestCase to f
84 matches
Mail list logo