Re: Problem with porter stemming

2016-03-14 Thread Benson Margulies
Stemming is an inherently limited process. It doesn't know about the word 'news', it just has a rule about 's'. Some of us sell commercial products that do more complex linguistic processing that knows about which words are which. There may be open source implementations of similar technology.

Re: Text dependent analyzer

2015-04-17 Thread Benson Margulies
If you wait tokenization to depend on sentences, and you insist on being inside Lucene, you have to be a Tokenizer. Your tokenizer can set an attribute on the token that ends a sentence. Then, downstream, filters can read-ahead tokens to get the full sentence and buffer tokens as needed. On

Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
codecs may not work as expected! Maybe try it out, was just an idea :-) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Thursday

Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
Robert, Let me lay out the scenario. Hardware has .5T of Index is relatively small. Application profiling shows a significant amount of time spent codec-ing. Options as I see them: 1. Use DPF complete with the irritation of having to have this spurious codec name in the on-disk format that has

Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
WHOOPS. First sentence was, until just before I clicked 'send', Hardware has .5T of RAM. Index is relatively small (20g) ... On Thu, Feb 12, 2015 at 4:51 PM, Benson Margulies ben...@basistech.com wrote: Robert, Let me lay out the scenario. Hardware has .5T of Index is relatively small

Re: A codec moment or pickle

2015-02-12 Thread Benson Margulies
means that I'm really married to a process of making releases that mirror Lucene releases. On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies ben...@basistech.com wrote: Based on reading the same comments you read, I'm pretty doubtful that Codec.getDefault() is going to work. It seems

A codec moment or pickle

2015-02-11 Thread Benson Margulies
I have a class that extends FilterCodec. Written against Lucene 4.9, it uses the Lucene49Codec. Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec is read-only in 4.10. Is there some way to code one of these to get 'the default codec' and not have to chase versions?

A really hairy token graph case

2014-10-24 Thread Benson Margulies
Consider a case where we have a token which can be subdivided in several ways. This can happen in German. We'd like to represent this with positionIncrement/positionLength, but it does not seem possible. Once the position has moved out from one set of 'subtokens', we see no way to move it back

Re: A really hairy token graph case

2014-10-24 Thread Benson Margulies
...@gmail.com wrote: HI Benson: This is the case with n-gramming (though you have a more complicated start chooser than most I imagine). Does that help get your ideas unblocked? Will -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Friday, October 24, 2014 4

Re: Why does this search fail?

2014-08-27 Thread Benson Margulies
Does google actually support *? On Wed, Aug 27, 2014 at 9:54 AM, Milind mili...@gmail.com wrote: I see. This is going to be extremely difficult to explain to end users. It doesn't work as they would expect. Some of the tokenizing rules are already somewhat confusing. Their expectation is

Re: searching with stemming

2014-06-09 Thread Benson Margulies
You should construct an analysis chain that does what you need. Read the source of the relevant analyzer and pick the tokenizer and filter(s) that you need, and don't include stemming. On Mon, Jun 9, 2014 at 5:57 AM, Jamie ja...@mailarchiva.com wrote: Greetings Our app currently uses

Re: searching with stemming

2014-06-09 Thread Benson Margulies
Are you using Solr? If so you are on the wrong mailing list. If not, why do you need a non- -anonymous analyzer at all. On Jun 9, 2014 6:55 AM, Jamie ja...@mailarchiva.com wrote: To me, it seems strange that these default analyzers, don't provide constructors that enable one to override

Re: searching with stemming

2014-06-09 Thread Benson Margulies
. On Jun 9, 2014 7:02 AM, Jamie ja...@mailarchiva.com wrote: I am not using Solr. I am using the default analyzers... On 2014/06/09, 12:59 PM, Benson Margulies wrote: Are you using Solr? If so you are on the wrong mailing list. If not, why do you need a non- -anonymous analyzer at all

Re: Confuse with Kuromoji

2014-04-06 Thread Benson Margulies
You must know what language each text is in, and use an appropriate analyzer. Some people do this by using a separate field (text_eng, text_spa, text_jpn). Other people put some extra information at the beginning of the field, and then make an analyzer that peeks in order to dispatch to the

Re: Confuse with Kuromoji

2014-04-06 Thread Benson Margulies
the language, or ... you can run a language detector. They are less accurate for short strings, or ... you can process it in _all_ of the languages and OR up the results. On 4/6/2014 4:51 AM, Benson Margulies wrote: You must know what language each text is in, and use an appropriate

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives. On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio ye.pe...@gmail.comwrote: On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar geetgang...@gmail.com

Re: LUCENE-5388 AbstractMethodError

2014-01-30 Thread Benson Margulies
If you are sensitive to things being committed to trunk, that suggests that you are building your own jars and using the trunk. Are you perfectly sure that you have built, and are using, a consistent set of jars? It looks as if you've got some trunk-y stuff and some 4.6.1 stuff. On Thu, Jan 30,

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
-a-lucene-tokenstream/20630673#20630673 Regards, Mindaugas On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies ben...@basistech.com wrote: Yes I Do. On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir rcm...@gmail.com wrote: Benson, do you want to open an issue to fix this constructor to not take

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
, Sure, why not - I'm just not sure if my approach (of setting reader in reset()) is preferred over yours (using this.input instead of input in ctor)? Or are they both equally good? m. On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies ben...@basistech.com wrote: If you'd like to join

How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException fails if incrementToken fails to throw if there's a missing reset. How am I supposed to organize this in a Tokenizer? A quick look at CharTokenizer did not reveal any code for the purpose.

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
Tokenizer.java for the state machine logic. In general you should not have to do anything if the tokenizer is well-behaved (e.g. close calls super.close() and so on). On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies bimargul...@gmail.com wrote: In 4.6.0

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
. i think its confusing and contributes to bugs that you have to have logic in e.g. the ctor THEN ALSO in reset(). if someone does it correctly in the ctor, but they only test one time, they might think everything is working.. On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies ben

Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
There are a handful of binary files in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames ending in .dat. Trailing around in the source, it seems as if at least one of these derives from a source file named unk.def. In turn, this file comes from a dependency. should the build

Re: Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Monday, December 02, 2013 6:12 PM To: java-user@lucene.apache.org; Christian Moen

Re: Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
, Christian Moen アティリカ株式会社 http://www.atilika.com On Dec 3, 2013, at 2:11 AM, Benson Margulies ben...@basistech.com wrote: There are a handful of binary files in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames ending in .dat. Trailing around in the source, it seems

Re: Modify the StandardTokenizerFactory to concatenate all words

2013-11-05 Thread Benson Margulies
How would you expect to recognize that 'Toy Story' is a thing? On Tue, Nov 5, 2013 at 6:32 PM, Kevin glidekensing...@gmail.com wrote: Currently I'm using StandardTokenizerFactory which tokenizes the words bases on spaces. For Toy Story it will create tokens toy and story. Ideally, I would

Threads and LuceneTestCase in 3.6.0

2013-10-31 Thread Benson Margulies
I just backported some code to 3.6.0, and it includes tests that use org.apache.lucene.analysis.BaseTokenStreamTestCase#checkRandomData(java.util.Random, org.apache.lucene.analysis.Analyzer, int, int) The tests that use this method fail in 3.6.0 in ways that suggest that multiple threads are

Re: new consistency check for token filters in 4.5.1

2013-10-30 Thread Benson Margulies
- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Wednesday, October 30, 2013 12:30 AM To: java-user@lucene.apache.org Subject: new consistency check for token filters in 4.5.1 My token filter has no end() method at all. Am I required to have an end method

new consistency check for token filters in 4.5.1

2013-10-29 Thread Benson Margulies
My token filter has no end() method at all. Am I required to have an end method()? BaseLinguisticsTokenFilterTest.testSegmentationReadings:175-Assert.assertTrue:41-Assert.fail:88 super.end()/clearAttributes() was not called correctly in end()

Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
I'm working on tool that wants to construct analyzers 'at arms length' -- a bit like from a solr schema -- so that multiple dueling analyzers could be in their own class loaders at one time. I want to just define a simple configuration for char filters, tokenizer, and token filter. So it would be,

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
OK, so, here I go again making a public idiot of myself. Could it be that the tokenizer factory is 'relatively recent' as in since 4.1? On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies ben...@basistech.comwrote: I'm working on tool that wants to construct analyzers 'at arms length

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
analyzers-commons module (since 4.0). They are no longer part of Solr. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:ben...@basistech.com] Sent: Monday, October 28

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
. I don't suppose there are some guidelines? On Mon, Oct 28, 2013 at 9:43 AM, Benson Margulies ben...@basistech.comwrote: Just how 'experimental' is the SPI system at this point, if that's a reasonable question? On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler u...@thetaphi.de wrote: Hi

Anyone interested in a worked-out example of the SPIs for analyzer components?

2013-10-28 Thread Benson Margulies
I just built myself a sort of Solr-schema-in-a-test-tube. It's a class that builds a classloader on some JAR files and then uses the SPI mechanism to manufacture Analyzer objects made out of tokenizers and filters. I can make this visible in github, or even attach it to a JIRA, if anyone is

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Benson Margulies
It might be helpful if you would explain, at a higher level, what you are trying to accomplish. Where do these things come from? What higher-level problem are you trying to solve? On Sun, Oct 20, 2013 at 7:12 PM, saisantoshi saisantosh...@gmail.com wrote: Thanks. So, if I understand correctly,

Re: Exploiting a whole lot of memory

2013-10-10 Thread Benson Margulies
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies ben...@basistech.com wrote: On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all

Re: Exploiting a whole lot of memory

2013-10-09 Thread Benson Margulies
it as the postings guy, is that the whole recipe?. Does it make sense to extend it any further to any of the other codec pieces? Mike McCandless http://blog.mikemccandless.com On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies ben...@basistech.com wrote: Consider a Lucene index consisting of 10m

Re: Exploiting a whole lot of memory

2013-10-09 Thread Benson Margulies
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies ben...@basistech.com wrote: On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all

Analyzer classes versus the constituent components

2013-10-08 Thread Benson Margulies
Is there some advice around about when it's appropriate to create an Analyzer class, as opposed to just Tokenizer and TokenFilter classes? The advantage of the constituent elements is that they allow the consuming application to add more filters. The only disadvantage I see is that the following

Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy' and 'should' terms and a dismax. And, finally, consider doing all

Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
at 5:45 PM, Benson Margulies ben...@basistech.com wrote: Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy

Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Oh, drat, I left out an 's'. I got it now. On Tue, Oct 8, 2013 at 7:40 PM, Benson Margulies ben...@basistech.comwrote: Mike, where do I find DirectPostingFormat? On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless luc...@mikemccandless.com wrote: DirectPostingsFormat? It stores all

Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Benson Margulies
://blog.mikemccandless.com On Tue, Oct 1, 2013 at 7:09 AM, Adrien Grand jpou...@gmail.com wrote: Hi Benson, On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com wrote: The multithreaded index searcher fans out across segments. How aggressively does 'optimize' reduce

How to make good use of the multithreaded IndexSearcher?

2013-09-30 Thread Benson Margulies
The multithreaded index searcher fans out across segments. How aggressively does 'optimize' reduce the number of segments? If the segment count goes way down, is there some other way to exploit multiple cores?

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
should be useful: there is a JIRA for it, but it has some unresolved issues https://issues.apache.org/jira/browse/LUCENE-4072 On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies bimargul...@gmail.com wrote: Can anyone shed light as to why this is a token filter and not a char filter? I'm

org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
Can anyone shed light as to why this is a token filter and not a char filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the tokenizer's lookups in its dictionaries are seeing normalized contents.

Re: PositionLengthAttribute

2013-09-07 Thread Benson Margulies
in the original that might as well be blamed for any given component. On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir rcm...@gmail.com wrote: On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir rcm...@gmail.com wrote: its the latter

Re: PositionLengthAttribute

2013-09-07 Thread Benson Margulies
On Sat, Sep 7, 2013 at 8:39 AM, Robert Muir rcm...@gmail.com wrote: On Sat, Sep 7, 2013 at 7:44 AM, Benson Margulies ben...@basistech.com wrote: In Japanese, compounds are just decompositions of the input string. In other languages, compounds can manufacture entire tokens from thin air

Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
nextToken() calls peekToken(). That seems to prevent my lookahead processing from seeing that item later. Am I missing something? On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies ben...@basistech.com wrote: I think that the penny just dropped, and I should not be using this class. If I call

Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
the buffered tokens, and to insert your own tokens when afterPosition() is called ... Mike McCandless http://blog.mikemccandless.com On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies ben...@basistech.com wrote: nextToken() calls peekToken(). That seems to prevent my lookahead processing from seeing

Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
, thanks! Mike McCandless http://blog.mikemccandless.com On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies ben...@basistech.com wrote: I think I had better build you a test case for this situation, and attach it to a JIRA. On Sat, Sep 7, 2013 at 3:33 PM, Michael McCandless luc

Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies ben...@basistech.com wrote: I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis

PositionLengthAttribute

2013-09-06 Thread Benson Margulies
I'm confused by the comment about compound components here. If a single token fissions into multiple tokens, then what belongs in the PositionLengthAttribute. I'm wanting to store a fraction in here! Or is the idea to store N in the 'mother' token and then '1' in each of the babies?

Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
. public boolean incrementToken() throws IOException { if (positions.getMaxPos() 0) { peekSentence(); } return nextToken(); } On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 7:31 AM, Michael

Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
it. On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies ben...@basistech.com wrote: Michael, I'm apparently not fully deconfused yet. I've got a very simple incrementToken function. It calls peekToken to stack up the tokens. afterPosition is never called; I expected it to be called as each

Re: PositionLengthAttribute

2013-09-06 Thread Benson Margulies
On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir rcm...@gmail.com wrote: its the latter. the way its designed to work i think is illustrated best in kuromoji analyzer where it heuristically decompounds nouns: if it decompounds ABCD into AB + CD, then the tokens are AB and CD. these both have

LookaheadTokenFilter

2013-09-05 Thread Benson Margulies
This useful-looking item is in the test-framework jar. Is there some subtle reason that it isn't in the common analyzer jar? Some reason why I'd regret using it?

LookaheadTokenFilter

2013-09-05 Thread Benson Margulies
I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so

Re: Issue with documentation for org.apache.lucene.analysis.synonym.SynonymMap.Builder.add() method

2012-09-06 Thread Benson Margulies
On Thu, Sep 6, 2012 at 1:59 PM, Robert Muir rcm...@gmail.com wrote: Thanks for reporting this Mark. I think it was not intended to have actual null characters here (or probably anywhere in javadocs). Our javadocs checkers should be failing on stuff like this... On Thu, Sep 6, 2012 at 1:52

Payload class

2012-08-29 Thread Benson Margulies
I'm failing to find advice in MIGRATE.txt on how to replace 'new Payload(...)' in migrating to 4.0. What am I missing?

ResourceLoader?

2012-08-29 Thread Benson Margulies
Our Solr 3.x code used init(ResourceLoader) and then called the loader to read a file. What's the new approach to reading content from files in the 'usual place'?

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
That's what I meant, thanks. On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com wrote: Our Solr 3.x code used init(ResourceLoader) and then called the loader to read a file. What's the new

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
I'm confused. Isn't inform/ResourceLoader deprecated? But your example use it? On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies ben...@basistech.com wrote: Our Solr 3.x code used init(ResourceLoader) and then called

Using a char filter in solr createComponents

2012-08-29 Thread Benson Margulies
I'm close to the bottom of my list here. I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream method. So now I have to migrate that to createComponents. Can someone give me a shove in the right direction?

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com wrote: I'm confused. Isn't inform/ResourceLoader deprecated? But your example use it? Where is it deprecated? What does the deprecation message

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
Hang on: [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in org.apache.solr.util.plugin has been deprecated On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies ben...@basistech.com wrote: I'm confused

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:42 AM, Robert Muir rcm...@gmail.com wrote: Right and what does the @deprecated message say :) Yes, indeed, sorry. I got caught in a maze of twisty passages and my brain turned off. I'm better now. On Wed, Aug 29, 2012 at 10:40 AM, Benson Margulies ben

reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I've read the javadoc through a few times, but I confess that I'm still feeling dense. Are all tokenizers responsible for implementing some way of retaining the contents of their reader, so that a call to reset without a call to setReader rewinds? I note that CharTokenizer doesn't implement

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
. The fact that CharTokenizer is doing 'reset()-like-stuff' in here is bogus IMO, but I dont think it will cause any bugs. Don't emulate it :) On Wed, Aug 29, 2012 at 3:29 PM, Benson Margulies ben...@basistech.com wrote: I've read the javadoc through a few times, but I confess that I'm still

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
Some interlinear commentary on the doc. * Resets this stream to the beginning. To me this implies a rewind. As previously noted, I don't see how this works for the existing implementations. * As all TokenStreams must be reusable, * any implementations which have state that needs to be

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I think I'm beginning to get the idea. Is the following plausible? At the bottom of the stack, there's an actual source of data -- like a tokenizer. For one of those, reset() is a bit silly, and something like setReader is the brains of the operation. Some number of other components may be

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
If I'm following, you've created a division of labor between setReader and reset. We have a tokenizer that has a good deal of state, since it has to split the input into chunks. If I'm following here, you'd recommend that we do nothing special in setReader, but have #reset fix up all the state on

Re: DisjunctionMaxQuery and scoring

2012-04-20 Thread Benson Margulies
Uwe and Robert, Thanks. David and I are two peas in one pod here at Basis. --benson On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler u...@thetaphi.de wrote: Hi, Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To achieve this, you have to change the coord function in

DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I am trying to solve a problem using DisjunctionMaxQuery. Consider a query like: a:b OR c:d OR e:f OR ... name:richard OR name:dick OR name:dickie OR name:rich ... At most, one of the richard names matches. So the match score gets dragged down by the long list of things that don't match, as

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com wrote: I am trying to solve a problem using DisjunctionMaxQuery. Consider a query like: a:b OR c:d OR e:f OR ... name:richard OR name:dick

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
Turning on disableCoord for a nested boolean query does not seem to change the overall maxCoord term as displayed in explain. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies bimargul...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies bimargul...@gmail.com

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 5:10 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies bimargul...@gmail.com wrote: On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies bimargul...@gmail.com

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I see why I'm so confused, but I think I need to construct a simpler test case. My top-level BooleanQuery, which has disableCoord=false, has 22 clauses. All but three are ordinary SHOULD TermQueries. the remainder are a spanNear and a nested BooleanQuery, and an empty PhraseQuery (that's a bug).

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
to accomplish this? On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies bimargul...@gmail.com wrote: I see why I'm so confused, but I think I need to construct a simpler test case. My top-level BooleanQuery, which has

Repeatability of results

2012-04-02 Thread Benson Margulies
We've observed something that, in some ways, is not surprising. If you take a set of documents that are close in 'score' to some query, and shuffle them in different orders and then see what results you get in what order from the reference query, the scores will vary according to the

Re: Repeatability of results

2012-04-02 Thread Benson Margulies
should not change as a function of insertion order... Well, I assumed that TF-IDF would wiggle. Do you have a small test case? SInce this surprises you, I will build a test case. Mike McCandless http://blog.mikemccandless.com On Mon, Apr 2, 2012 at 5:28 PM, Benson Margulies bimargul

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Benson Margulies
fileformat.info On Mar 30, 2012, at 1:04 PM, Denis Brodeur denisbrod...@gmail.com wrote: Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir

Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
I've posted a self-contained test case to github of a mystery. git://github.com/bimargulies/lucene-4-update-case.git The code can be seen at https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java. I write a doc to an index,

A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail:

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
this instead of MultiFields.getFields(indexReader).iterator(); which I came up with by fishing around for myself? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
. Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies bimargul...@gmail.com wrote: Under LUCENE-1458, LUCENE-2111: Flexible Indexing, CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, I see, I didn't read far enough down. Well, the patch still repairs a bug in the code fragment relative to the Term enumeration. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands,

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, ouch, there's no SegmentReader.getReader, I was reading IndexWriter. Sorry. On Tue, Mar 6, 2012 at 9:14 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler u...@thetaphi.de wrote: AtomicReader.fields

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
etc that take Term don't analyze any text. Instead usually higher-level things like QueryParsers analyze text into Terms. On Tue, Mar 6, 2012 at 8:35 AM, Benson Margulies bimargul...@gmail.com wrote: I've posted a self-contained test case to github of a mystery. git://github.com

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote: I think the issue is that your analyzer is standardanalyzer, yet field text value is value-1 Robert, Why is this field analyzed at all? It's

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
that MultiFields will be fine. --benson Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Tuesday, March 06, 2012 3:15 PM To: java-user

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com wrote: I think the issue is that your analyzer is standardanalyzer, yet field text

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
analyzing StringField when we shouldn't... Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir rcm...@gmail.com

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
it, otherwise it should be pkg-private. Oh! I'll rework the patch again, then. I might include some commentary in MultiFields at all. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Benson Margulies

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
for sneaking around this in the mean time? On Tue, Mar 6, 2012 at 9:58 AM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler u...@thetaphi.de wrote: String field is analyzed, but with KeywordTokenizer, so all should be fine. I filed LUCENE-3854

What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
Sorry, I'm coming up empty in Google here. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
To reduce noise slightly I'll stay on this thread. I'm looking at this file, and not seeing a pointer to what to do about QueryParser. Are jar file rearrangements supposed to be in that file? I think that I don't have the right jar yet; all I'm seeing is the 'surround' package.

Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
- o.a.l.queryparser.classic.TokenMgrError -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Monday, March 05, 2012 11:15 AM To: java-user@lucene.apache.org Subject: Re: What replaces IndexReader.openIfChanged in Lucene 4.0? To reduce noise slightly I'll stay on this thread. I'm

Updating a document.

2012-03-04 Thread Benson Margulies
I am walking down the document in an index by number, and I find that I want to update one. The updateDocument API only works on queries and terms, not numbers. So I can call remove and add, but, then, what's the document's number after that? Or is that not a meaningful question until I make a

  1   2   >