Re: BooleanWeight.scorer() gives a TermScorer

2014-08-07 Thread Robert Muir
This can happen in some cases: for example if you are doing a disjunction of foo and bar with coordination factor disabled, and the segment has no postings for bar. In this case the optimum scorer to return is just a termscorer for foo. On Thu, Aug 7, 2014 at 12:42 PM, Christian Reuschling

Re: stemming irregular plurals?

2014-07-29 Thread Robert Muir
You can put this thing before your stemmer, with a custom map of exceptions: http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html On Tue, Jul 29, 2014 at 10:03 AM, Robert Nikander rob.nikan...@gmail.com wrote: Hi, I created

Re: Invalid fieldsStream maxPointer (file truncated?): maxPointer=2966205946, length=2966208512

2014-07-23 Thread Robert Muir
On Wed, Jul 23, 2014 at 6:03 AM, Harald Kirsch harald.kir...@raytion.com wrote: Hi, below is an exception I get from one Solr core. According to https://issues.apache.org/jira/browse/LUCENE-5617 the check that leads to the exception was introduced recently. Two things are worth mentioning:

Re: Invalid fieldsStream maxPointer (file truncated?): maxPointer=2966205946, length=2966208512

2014-07-23 Thread Robert Muir
On Wed, Jul 23, 2014 at 7:29 AM, Harald Kirsch harald.kir...@raytion.com wrote: (As a side note: after truncating the file to the expected size+16, at least the core starts up again. Have not tested anything else yet.) After applying your truncation-fix, Is it possible for you to run the

Re: Invalid fieldsStream maxPointer (file truncated?): maxPointer=2966205946, length=2966208512

2014-07-23 Thread Robert Muir
On Wed, Jul 23, 2014 at 7:29 AM, Harald Kirsch harald.kir...@raytion.com wrote: File system is xfs hosted on a corporate file share somewhere. Sorry, i forgot to ask: how do you access this? is it mounted over nfs? - To

Re: Invalid fieldsStream maxPointer (file truncated?): maxPointer=2966205946, length=2966208512

2014-07-23 Thread Robert Muir
Hey, thank you for following up! On Wed, Jul 23, 2014 at 8:46 AM, Harald Kirsch harald.kir...@raytion.com wrote: On 23.07.2014 13:38, Robert Muir wrote: On Wed, Jul 23, 2014 at 7:29 AM, Harald Kirsch harald.kir...@raytion.com wrote: (As a side note: after truncating the file

Re: mmap confusion in lucene

2014-07-14 Thread Robert Muir
Your code isn't doing what you think it is doing. You need to ensure things aren't eliminated by the compiler. On Mon, Jul 14, 2014 at 5:57 AM, wangzhijiang999 wangzhijiang...@aliyun.com wrote: Hi everybody, I found a problem confused me when I tested the mmap feature in lucene. I

Re: SortedDocValuesField

2014-06-26 Thread Robert Muir
don't use RAMDirectory: its not very performant and really intended for e.g. testing and so on. also, using a ramdirectory here defeats the purpose: the idea behind using a docvaluesfield in most cases is to keep (most of) such datastructures out of heap memory. The datastructures and even the

[ANNOUNCE] Apache Lucene 4.9.0 released

2014-06-25 Thread Robert Muir
25 June 2014, Apache Lucene™ 4.9.0 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.9.0 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires

Re: Changing field lengthnorm to store length

2014-06-19 Thread Robert Muir
No they do not. The method is: public abstract long computeNorm(FieldInvertState state); On Thu, Jun 19, 2014 at 1:54 PM, Nalini Kartha nalinikar...@gmail.com wrote: Thanks for the info! We're more interested in changing the lengthnorm function vs using additional stats for scoring so

Re: Changing field lengthnorm to store length

2014-06-19 Thread Robert Muir
methods on the TFIDFSimilarity class - public byte encodeNormValue(float f) public float decodeNormValue(byte b) On Thu, Jun 19, 2014 at 12:08 PM, Robert Muir rcm...@gmail.com wrote: No they do not. The method is: public abstract long computeNorm(FieldInvertState state); On Thu, Jun 19

Re: Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField

2014-06-17 Thread Robert Muir
Again, because merging is based on byte size, you have to be careful how you measure (hint: use LogDocMergePolicy). Otherwise you are comparing apples and oranges. Separately, your configuration is using experimental codecs like disk/memory which arent as heavily benchmarked etc as the default

Re: Hunspell low level interface in Lucene 4.8

2014-06-16 Thread Robert Muir
PM, Robert Muir rcm...@gmail.com wrote: Can you just use the tokenstream api? Thats the one we maintain and support... On Sat, Jun 14, 2014 at 10:42 AM, Michal Lopuszynski lop...@gmail.com wrote: Dear all, I am not much into searching, however, I used Lucene to do some text

Re: Hunspell low level interface in Lucene 4.8

2014-06-15 Thread Robert Muir
Can you just use the tokenstream api? Thats the one we maintain and support... On Sat, Jun 14, 2014 at 10:42 AM, Michal Lopuszynski lop...@gmail.com wrote: Dear all, I am not much into searching, however, I used Lucene to do some text postprocessing, (esp. stemming) using low level tools

Re: Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField

2014-06-14 Thread Robert Muir
They are still encoded the same way: so likely you arent testing apples to apples (e.g. different number of segments or whatever). On Fri, Jun 13, 2014 at 8:28 PM, Zhao, Gang gz...@ea.com wrote: I used lucene 4.4 to create index for some documents. One of the indexing fields is

Re: search performance

2014-06-03 Thread Robert Muir
Check and make sure you are not opening an indexreader for every search. Be sure you don't do that. On Mon, Jun 2, 2014 at 2:51 AM, Jamie ja...@mailarchiva.com wrote: Greetings Despite following all the recommended optimizations (as described at

Re: search performance

2014-06-03 Thread Robert Muir
No, you are incorrect. The point of a search engine is to return top-N most relevant. If you insist you need to open an indexreader on every single search, and then return huge amounts of docs, maybe you should use a database instead. On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com

Re: search performance

2014-06-03 Thread Robert Muir
. Jamie On 2014/06/03, 1:17 PM, Robert Muir wrote: No, you are incorrect. The point of a search engine is to return top-N most relevant. If you insist you need to open an indexreader on every single search, and then return huge amounts of docs, maybe you should use a database instead

[ANNOUNCE] Apache Lucene 4.8.1 released

2014-05-20 Thread Robert Muir
May 2014, Apache Lucene™ 4.8.1 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.8.1 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text

Re: Merger performance degradation on 3.6.1

2014-05-16 Thread Robert Muir
addIndexes doesn't call maybeMerge, so i think you are just getting in a situation with too many segments, so applying deletes is slow. can you try calling IndexWriter.maybeMerge() after you call addIndexes? (it wont have immediate impact, you have to do some merges to get your index healthy

Re: ConcurrentModificationException in ICU analyzer

2014-05-15 Thread Robert Muir
This looks like a bug in ICU? I'll try to reproduce it. We are also a little out of date, maybe they've already fixed it. Thank you for reporting this. On Fri, May 9, 2014 at 12:14 PM, feedly team feedly...@gmail.com wrote: I am using the 4.7.0 ICU analyzer (via elastic search) and noticed this

Re: ConcurrentModificationException in ICU analyzer

2014-05-14 Thread Robert Muir
fyi: this bug was already found and fixed in ICU's trunk: http://bugs.icu-project.org/trac/ticket/10767 On Wed, May 14, 2014 at 4:32 AM, Robert Muir rcm...@gmail.com wrote: This looks like a bug in ICU? I'll try to reproduce it. We are also a little out of date, maybe they've already fixed

Re: ConcurrentModificationException in ICU analyzer

2014-05-14 Thread Robert Muir
I opened https://issues.apache.org/jira/browse/LUCENE-5671 for now, if you are able to use the latest release of ICU, it should prevent the bug. On Wed, May 14, 2014 at 11:47 AM, Robert Muir rcm...@gmail.com wrote: fyi: this bug was already found and fixed in ICU's trunk: http://bugs.icu

Re: No Compound Files

2014-04-29 Thread Robert Muir
I think NoMergePolicy.NO_COMPOUND_FILES and NoMergePolicy.COMPOUND_FILES should be removed, and replaced with NoMergePolicy.INSTANCE If you want to change whether CFS is used by indexwriter flush, you need to set that in IndexWriterConfig. On Tue, Apr 29, 2014 at 8:03 AM, Varun Thacker

Re: No Compound Files

2014-04-29 Thread Robert Muir
On Tue, Apr 29, 2014 at 8:14 AM, Shai Erera ser...@gmail.com wrote: If we only offer NoMP.INSTANCE, what would it do w/ merged segments? always compound? always not-compound? it doesnt merge though. - To unsubscribe, e-mail:

[ANNOUNCE] Apache Lucene 4.7.2 released.

2014-04-15 Thread Robert Muir
April 2014, Apache Lucene™ 4.7.2 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.7.2 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires

Re: Strange behavior of ShingleFilter in Lucene 4.6

2014-04-02 Thread Robert Muir
Did you really mean to shingle twice (shingleanalyzerwrapper just wraps the analyzer with a shinglefilter, then the code wraps that with another shinglefilter again) ? On Wed, Apr 2, 2014 at 1:42 PM, Natalia Connolly natalia.v.conno...@gmail.com wrote: Hello, I am very confused about what

Re: Strange behavior of ShingleFilter in Lucene 4.6

2014-04-02 Thread Robert Muir
This, this is, is, is a , and so on. Is there another way I could do it? Thank you, Natalia On Wed, Apr 2, 2014 at 2:40 PM, Robert Muir rcm...@gmail.com wrote: Did you really mean to shingle twice (shingleanalyzerwrapper just wraps the analyzer with a shinglefilter, then the code wraps that with another

[ANNOUNCE] Apache Lucene 4.6.1 released

2014-01-28 Thread Robert Muir
January 2014, Apache Lucene™ 4.6.1 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.6.1Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Robert Muir
See Tokenizer.java for the state machine logic. In general you should not have to do anything if the tokenizer is well-behaved (e.g. close calls super.close() and so on). On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies bimargul...@gmail.com wrote: In 4.6.0,

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Robert Muir
is: public MyTokenizer(Reader reader, ) { super(reader); myWrappedInputDevice = new MyWrappedInputDevice(this.input); } On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir rcm...@gmail.com wrote: See Tokenizer.java for the state machine logic. In general you should not have

Re: Lucene42DocValuesProducer used in 4.5 and 4.6

2013-12-17 Thread Robert Muir
it is correct. the format of normalization factors has not changed since 4.2 On Tue, Dec 17, 2013 at 10:49 AM, Torben Greulich torben.greul...@s24.com wrote: Hi, we had a OOM error in solr and were confused about one part of the stackTrace where Lucene42DocValuesProducer.ramBytesUsed is

Re: PostingsHighlighter/PassageFormatter has zero matches for some results

2013-10-15 Thread Robert Muir
the usual thing and performance is not a particular concern). Thanks, Jon On Mon, Oct 14, 2013 at 9:58 PM, Robert Muir rcm...@gmail.com wrote: are your documents large? try PostingsHighlighter(int) ctor with a larger value than DEFAULT_MAX_LENGTH. sounds like the passages you see with matches

Re: PostingsHighlighter/PassageFormatter has zero matches for some results

2013-10-15 Thread Robert Muir
offhand that it's silently enforcing a limit ... Mike McCandless http://blog.mikemccandless.com On Tue, Oct 15, 2013 at 9:31 AM, Robert Muir rcm...@gmail.com wrote: Thanks Jon. Ill add some stuff to the javadocs here to try to make it more obvious. On Tue, Oct 15, 2013 at 5:54 AM, Jon

Re: PostingsHighlighter/PassageFormatter has zero matches for some results

2013-10-15 Thread Robert Muir
On Tue, Oct 15, 2013 at 9:59 AM, Michael McCandless luc...@mikemccandless.com wrote: Well, unfortunately, this is a trap that users do hit. By requiring the user to think about the limit on creating PostingsHighlighter, he/she would think about it and realize they are in fact setting a limit.

Re: PostingsHighlighter/PassageFormatter has zero matches for some results

2013-10-15 Thread Robert Muir
On Tue, Oct 15, 2013 at 10:57 AM, Michael McCandless luc...@mikemccandless.com wrote: On Tue, Oct 15, 2013 at 10:11 AM, Robert Muir rcm...@gmail.com wrote: On Tue, Oct 15, 2013 at 9:59 AM, Michael McCandless luc...@mikemccandless.com wrote: Well, unfortunately, this is a trap that users do hit

Re: PostingsHighlighter/PassageFormatter has zero matches for some results

2013-10-14 Thread Robert Muir
did you try the latest release? There are some bugs fixed... On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart j...@lightboxtechnologies.com wrote: Hello, I've observed that when using PostingsHighlighter in Lucene 4.4 that some of the responsive documents in TopDocs will have zero matches in the

Re: PostingsHighlighter/PassageFormatter has zero matches for some results

2013-10-14 Thread Robert Muir
is greater than zero. Jon On Mon, Oct 14, 2013 at 5:24 PM, Robert Muir rcm...@gmail.com wrote: did you try the latest release? There are some bugs fixed... On Mon, Oct 14, 2013 at 2:11 PM, Jon Stewart j...@lightboxtechnologies.com wrote: Hello, I've observed that when using PostingsHighlighter

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Robert Muir
Mostly because our tokenizers like StandardTokenizer will tokenize the same way regardless of normalization form or whether its normalized at all? But for other tokenizers, such a charfilter should be useful: there is a JIRA for it, but it has some unresolved issues

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Robert Muir
That would be great! On Mon, Sep 16, 2013 at 1:41 PM, Benson Margulies ben...@basistech.com wrote: Thanks, I might pitch in. On Mon, Sep 16, 2013 at 12:58 PM, Robert Muir rcm...@gmail.com wrote: Mostly because our tokenizers like StandardTokenizer will tokenize the same way regardless

Re: PositionLengthAttribute

2013-09-07 Thread Robert Muir
On Sat, Sep 7, 2013 at 7:44 AM, Benson Margulies ben...@basistech.com wrote: In Japanese, compounds are just decompositions of the input string. In other languages, compounds can manufacture entire tokens from thin air. In those cases, it's something of a question how to decide on the offsets.

Re: PositionLengthAttribute

2013-09-06 Thread Robert Muir
On Fri, Sep 6, 2013 at 8:03 PM, Benson Margulies ben...@basistech.com wrote: I'm confused by the comment about compound components here. If a single token fissions into multiple tokens, then what belongs in the PositionLengthAttribute. I'm wanting to store a fraction in here! Or is the idea

Re: PositionLengthAttribute

2013-09-06 Thread Robert Muir
On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies ben...@basistech.com wrote: On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir rcm...@gmail.com wrote: its the latter. the way its designed to work i think is illustrated best in kuromoji analyzer where it heuristically decompounds nouns

Re: Lucene index customization

2013-08-24 Thread Robert Muir
FieldType myType = new FieldType(TextField.TYPE_NOT_STORED); myType.setIndexOptions(IndexOptions.DOCS_ONLY); document.add(new Field(title, some title, myType)); document.add(new Field(body, some contents, myType)); ... On Sat, Aug 24, 2013 at 3:27 AM, Airway Wong airwayw...@gmail.com wrote: Hi,

Re: problem found with DiskDocValuesFormat

2013-08-22 Thread Robert Muir
On Thu, Aug 22, 2013 at 1:48 AM, Sean Bridges sean.brid...@gmail.com wrote: Is there a supported DocValuesFormat that doesn't load all the values into ram? Not with any current release, but in lucene 4.5 if all goes well, the official implementation will work that way (I spent essentially the

Re: problem found with DiskDocValuesFormat

2013-08-21 Thread Robert Muir
On Wed, Aug 21, 2013 at 11:30 AM, Sean Bridges sean.brid...@gmail.com wrote: What is the recommended way to use DiskDocValuesFormat in production if we can't reindex when we upgrade? I'm not going to recommend using any experimental codecs in production, but... 1. with 4.3 jar file:

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
There is a unit test demonstrating this at a very basic level here: http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/core/src/test/org/apache/lucene/search/TestDocValuesScoring.java On Mon, Aug 12, 2013 at 10:43 AM, Ross Woolf r...@rosswoolf.com wrote: The JavaDocs for

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
, 2013 at 8:54 AM, Robert Muir rcm...@gmail.com wrote: There is a unit test demonstrating this at a very basic level here: http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/core/src/test/org/apache/lucene/search/TestDocValuesScoring.java On Mon, Aug 12, 2013 at 10:43 AM, Ross

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
On Mon, Aug 12, 2013 at 8:48 AM, Ross Woolf r...@rosswoolf.com wrote: Okay, just for clarity sake, what you are saying is that if I make the FieldCache call it won't actually create and impose the loading time of the FieldCache, but rather just use the NumericDocValuesField instead. Is this

Re: How to retrieve value of NumericDocValuesField in similarity

2013-08-12 Thread Robert Muir
On Mon, Aug 12, 2013 at 11:06 AM, Shai Erera ser...@gmail.com wrote: Or, you'd like to keep FieldCache API for sort of back-compat with existing features, and let the app control the caching by using an explicit RamDVFormat? Yes. In the future ideally fieldcache goes away and is a

Re: WeakIdentityMap high memory usage

2013-08-10 Thread Robert Muir
On Thu, Aug 8, 2013 at 11:31 AM, Michael McCandless luc...@mikemccandless.com wrote: A number of users have complained about the apparent RAM usage of WeakIdentityMap, and it adds complexity to ByteBufferIndexInput to do this tracking ... I think defaulting the unmap hack to off is best for

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-09 Thread Robert Muir
, Robert Muir rcm...@gmail.com wrote: On Thu, Aug 8, 2013 at 11:18 AM, Tom Burton-West tburt...@umich.edu wrote: Sure I should be able to build a lucene core and give it a try. I probably won't run it until tomorrow night though because right now I'm running some other tests on the machine

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-08 Thread Robert Muir
Hi Tom, I committed a fix for the root cause (https://issues.apache.org/jira/browse/LUCENE-5156). Thanks for reporting this! I dont know if its feasible for you to build a lucene-core.jar from branch_4x and run checkindex with that jar file to confirm it really addresses the issue: if this is

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-08 Thread Robert Muir
On Thu, Aug 8, 2013 at 11:18 AM, Tom Burton-West tburt...@umich.edu wrote: Sure I should be able to build a lucene core and give it a try. I probably won't run it until tomorrow night though because right now I'm running some other tests on the machine I would run CheckIndex from and disk I/O

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-02 Thread Robert Muir
Thanks, this is what I expected. I opened an issue to remove seek by Ord from this vectors format. On Aug 2, 2013 2:13 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Robert, Looks like it switches between seekCeil and seekExact: main prio=10 tid=0x0e79a000 nid=0x5fe5 runnable

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-08-01 Thread Robert Muir
On Thu, Aug 1, 2013 at 6:40 PM, Tom Burton-West tburt...@umich.edu wrote: Hi all, OK, I really should have titled the post, CheckIndex limit with large tvd files? I started a new CheckIndex run about 1:00 pm on Tuesday and it seems to be stuck again looking at termvectors. I gave

Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?

2013-07-30 Thread Robert Muir
On Tue, Jul 30, 2013 at 8:41 AM, Michael McCandless luc...@mikemccandless.com wrote: I think that's ~ 110 billion, not trillion, tokens :) Are you certain you don't have any term vectors? Even if your index has no term vectors, CheckIndex goes through all docIDs trying to load them, but

Re: A SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene42' does not exist

2013-07-13 Thread Robert Muir
Open a bug with android team... the problem is android isn't java (and doesnt implement/follow the spec) On Sat, Jul 13, 2013 at 4:31 AM, VIGNESH S vigneshkln...@gmail.com wrote: Hi, I did not striped META-INF/services and it contains the files. Even when i combined with other jars,i

Re: A SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene42' does not exist

2013-07-13 Thread Robert Muir
:51 PM, VIGNESH S vigneshkln...@gmail.com wrote: Hi Robert, Thanks for your reply. If possible,can you please explain why this new class loading mechanism was introduced in Lucene 4 Thanks and Regards Vignesh On Sat, Jul 13, 2013 at 6:56 PM, Robert Muir rcm...@gmail.com

Re: build of trunk hangs

2013-06-22 Thread Robert Muir
I don't see anything abnormal. This is what happens why it downloads dependencies. replicator must pull in 2MB of jars from various places. If you are impatient during this process and press ^C, then that really only makes matters worse as it then leaves a .lck file in your ivy cache, and future

Re: TestGrouping.Java seems to combine multiple tests into one huge test

2013-06-18 Thread Robert Muir
On Tue, Jun 18, 2013 at 9:48 AM, Tom Burton-West tburt...@umich.edu wrote: Hello, I'm trying to understand BlockGroupingCollector. I thought I would start by running the tests in the debugger. However the only test I can find is

Re: FieldCache DocValues Filter

2013-06-06 Thread Robert Muir
FieldCacheTermsFilter will use your docvalues field. Its confusing: I think we should rename FieldCacheXXX to DocValuesXXX. On Thu, Jun 6, 2013 at 2:22 AM, Arun Kumar K arunk...@gmail.com wrote: Hi Guys, I was trying to better the filtering mechanism for my use case. When i use the existing

Re: FieldCache DocValues Filter

2013-06-06 Thread Robert Muir
the filter will create a DocValues for that field using FieldCache. Arun On Thu, Jun 6, 2013 at 3:49 PM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Jun 6, 2013 at 5:35 AM, Robert Muir rcm...@gmail.com wrote: Its confusing: I think we should rename FieldCacheXXX

Re: SloppyPhraseScorer behavior

2013-04-19 Thread Robert Muir
Its a bug: its already fixed for 4.3 (coming soon): https://issues.apache.org/jira/browse/LUCENE-4888 On Fri, Apr 19, 2013 at 1:09 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: When writing a custom codec, I encountered an issue in SloppyPhraseScorer. I am using

Re: DiskDocValuesFormat

2013-04-14 Thread Robert Muir
Your stack trace is incomplete: it doesn't even show where the OOM occurred. On Sun, Apr 14, 2013 at 7:48 PM, Wei Wang welshw...@gmail.com wrote: Unfortunately, I got another problem. My index has 9 segments (9 dvdd files) with total size is about 22GB. The merging step eventually failed and

Re: Forcemerge running out of memory

2013-04-11 Thread Robert Muir
merging binarydocvalues doesn't use any RAM, it streams the values from the segments its merging directly to the newly written segment. So if you have this problem, its unrelated to merging: it means you don't have enough RAM to support all the stuff you are putting in these binarydocvalues

Re: DocValues space usage

2013-04-09 Thread Robert Muir
On Tue, Apr 9, 2013 at 8:22 AM, Wei Wang welshw...@gmail.com wrote: DocValues makes fast per doc value lookup possible, which is nice. But it brings other interesting issues. Assume there are 100M docs and 200 NumericDocValuesFields, this ends up with huge number of disk and memory usage,

Re: DocValues space usage

2013-04-09 Thread Robert Muir
On Tue, Apr 9, 2013 at 9:06 AM, Wei Wang welshw...@gmail.com wrote: Thanks for the hint. Could you point to some Codec that might do this for some types, even just as an side effect as you mentioned? It will be helpful to have something to start with. Have a look at diskdv/ codec in the

Re: Migrating SnowballAnalyzer to 4.1

2013-03-16 Thread Robert Muir
On Sat, Mar 16, 2013 at 12:57 AM, Steve Rowe sar...@gmail.com wrote: Thanks for the explanation. I ran a lucene/benchmark alg comparing the two stemmers on trunk on my Macbook Pro with Oracle Java 1.7.0_13, and it looks like the situation hasn't changed much. The original-algorithm

Re: Migrating SnowballAnalyzer to 4.1

2013-03-15 Thread Robert Muir
Porter says the Porter2 stemmer is better[1]. Robert Muir (who wrote EnglishAnalyzer), if you're reading, what do you think? This was intentional actually. The default was a tradeoff of benefits (which affect less than 5% of english vocabulary, if you read around the snowball site), versus a much

Re: Lucene version naming of index files

2013-03-14 Thread Robert Muir
On Thu, Mar 14, 2013 at 7:22 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi list, a stupid question about the naming of the index files. While using lucene (and solr) 4.2 I still see files with Lucene41 in the name. This is somewhat confusing if lucene 4.x produces files with

[ANNOUNCE] Apache Lucene 4.2 released

2013-03-11 Thread Robert Muir
March 2013, Apache Lucene™ 4.2 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.2 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text

Re: DiskDocValues vs Lucene42Codec

2013-03-08 Thread Robert Muir
The underlying data formats are different. For example, because Lucene42Codec will load terms into RAM, it uses an FST. But DiskDV uses a more simplistic storage for the terms thats more suitable for being disk-resident. There are also different compression block sizes and so on in use. you can

Re: FST-based suggesters: recent changes, binary compatibility of automata

2013-03-02 Thread Robert Muir
On Fri, Mar 1, 2013 at 11:16 AM, Oliver Christ ochr...@ebscohost.com wrote: I've seen some changes in trunk regarding the data format of Lucene's FST-based suggesters, and wonder whether the automata created by trunk builds/next Lucene version are/will be binary-compatible to the ones created

Re: Setting Similarity classes in Benchmark .alg scripts

2013-02-06 Thread Robert Muir
Just to be sure what you are trying to do: A) compare the relevance of different similarities? this is something the benchmark.quality package (actually pretty much unrelated from the rest of the benchmark package!) does, if you have some e.g. TREC collection or whatever to test with. B) compare

Re: CompressingStoredFieldsFormat doesn't show improvement

2013-01-31 Thread Robert Muir
The top method here is your random string generation. are you indexing random data? On Thu, Jan 31, 2013 at 12:46 AM, arun k arunk...@gmail.com wrote: Hi, Please find the snapshots here. http://picpaste.com/Lucene3.0.2-G00Z5FfX.png http://picpaste.com/Lucene4.1-LsxpcQk0.png Arun On

Re: Chinese analyzer

2013-01-24 Thread Robert Muir
On Thu, Jan 24, 2013 at 9:25 AM, Jerome Lanneluc jerome_lanne...@fr.ibm.com wrote: Note the 2 tokens in the second sample when I would expect to have only one token with the (55401 57046) characters. I could not figure out if I'm doing something wrong, or if this is a bug in the Chinese

Re: Chinese analyzer

2013-01-24 Thread Robert Muir
On Thu, Jan 24, 2013 at 10:53 AM, Jerome Lanneluc jerome_lanne...@fr.ibm.com wrote: It looks like my attachment was lost. It referred to org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer. I think this analyzer will not properly tokenize text outside of the BMP: it pretty much only works

Re: Document term vectors in Lucene 4

2013-01-17 Thread Robert Muir
Which statistics in particular (which methods)? On Thu, Jan 17, 2013 at 5:10 AM, Jon Stewart j...@lightboxtechnologies.com wrote: Thanks very much for your reply, Ian. I am using SlowCompositeReaderWrapper because I am also retrieving the term frequency statistics for the corpus (at the end

Re: potential memory leak when using RAMDirectory ,CloseableThreadLocal and a thread pool .

2013-01-03 Thread Robert Muir
On Thu, Jan 3, 2013 at 12:16 PM, Alon Muchnick a...@datonics.com wrote: value org.apache.lucene.index.TermInfosReader$ThreadResources --- termInfoCache |org.apache.lucene.util.cache.SimpleLRUCache termEnum |org.apache.lucene.index.SegmentTermEnum You aren't using lucene 3.6.2 if

Re: Retrieving granular scores back from Lucene/SOLR

2012-12-26 Thread Robert Muir
On Tue, Dec 25, 2012 at 11:30 PM, Vishwas Goel vishw...@gmail.com wrote: Hi, I am looking to get a bit more information back from SOLR/Lucene about the query/document pair scores. This would include field level scores, overall text relevance score, Boost value, BF value etc. Use

Re: Retrieving granular scores back from Lucene/SOLR

2012-12-26 Thread Robert Muir
On Wed, Dec 26, 2012 at 6:15 AM, Vishwas Goel vishw...@gmail.com wrote: Use Scorer.getChildren()/freq()/getWeight() in your collector you can walk the scorer hierarchy, associate scorers with specific terms and queries, and determine which scorers matched which documents and with what

[ANNOUNCE] Apache Lucene 3.6.2 released

2012-12-25 Thread Robert Muir
25 December 2012, Apache Lucene™ 3.6.2 available The Lucene PMC and Santa Claus are pleased to announce the release of Apache Lucene 3.6.2. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any

Re: Lucene (4.0), junit, failed to delete _0_nrm.cfs

2012-12-09 Thread Robert Muir
Maybe get lucene-test-framework.jar, and extends LuceneTestCase, using newDirectory and so on. if you have files still open this will fail the test and give you a stacktrace of where you initially opened the file. On Sun, Dec 9, 2012 at 12:28 PM, Clemens Wyss DEV clemens...@mysign.chwrote: Hi

Re: CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-05 Thread Robert Muir
On Wed, Dec 5, 2012 at 1:30 PM, Tom Burton-West tburt...@umich.edu wrote: java.version=1.6.0_16 Tom can you use a newer java version for this? That's pretty old, and seeing such a crazy field number worries me that its some jvm bug. you could even try to run the checkindex itself with a newer

Re: CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-05 Thread Robert Muir
I'm particularly thinking its something like http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5091921 We tried to add workarounds to lucene to dodge problems from this, but really a newer unaffected version would be safer. On Wed, Dec 5, 2012 at 1:47 PM, Robert Muir rcm...@gmail.com wrote

Re: CheckIndex ArrayIndexOutOfBounds error for merged index

2012-12-05 Thread Robert Muir
On Wed, Dec 5, 2012 at 2:27 PM, Tom Burton-West tburt...@umich.edu wrote: Thanks Robert, I've asked our sysadmins to install a more recent Java version for testing. I'll report back if it fails with the newer Java version. Please let us know either way!

Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Robert Muir
On Tue, Nov 27, 2012 at 6:17 AM, Trejkaz trej...@trypticon.org wrote: Ah, yeah... I should have been clearer on what I meant there. If you want to make a filter which relies on data that isn't in the index, there is no mechanism for invalidation. One example of it is if you have a filter

Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Robert Muir
On Wed, Nov 28, 2012 at 12:27 AM, Trejkaz trej...@trypticon.org wrote: On Wed, Nov 28, 2012 at 2:09 AM, Robert Muir rcm...@gmail.com wrote: I don't understand how a filter could become invalid even though the reader has not changed. I did state two ways in my last email, but just to re

Re: Does anyone have tips on managing cached filters?

2012-11-26 Thread Robert Muir
On Thu, Nov 22, 2012 at 11:10 PM, Trejkaz trej...@trypticon.org wrote: As for actually doing the invalidation, CachingWrapperFilter itself doesn't appear to have any mechanism for invalidation at all, so I imagine I will be building a variation of it with additional methods to invalidate

Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Robert Muir
On Tue, Nov 20, 2012 at 6:26 AM, Carsten Schnober schno...@ids-mannheim.dewrote: Thanks, Uwe! I think what changed in comparison to Lucene 3.6 is that reset() is called upon initialization, too, instead of after processing the first document only, right? There is no such change: this step

Re: Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Robert Muir
On Tue, Nov 20, 2012 at 6:18 PM, Trejkaz trej...@trypticon.org wrote: I have a feature I wanted to implement which required a quick way to check whether an individual document matched a query or not. IndexSearcher.explain seemed to be a good fit for this. The query I tested was just a

Re: Superset Similarity?

2012-11-16 Thread Robert Muir
On Fri, Nov 16, 2012 at 5:18 PM, Tom Burton-West tburt...@umich.edu wrote: Hi Otis, I hope this is not off-topic, Apparently in Lucene similarity does not have to be set at index time: Actually in the general case it does. IndexWriter calls the Similarity's computeNorm method at

Re: com.sun.jdi.InvocationException occurred invoking method

2012-11-14 Thread Robert Muir
On Wed, Nov 14, 2012 at 4:04 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi list, while walking through the code with debugger (eclipse juno) I get the following: com.sun.jdi.InvocationException occurred invoking method. This is while trying to see

Re: com.sun.jdi.InvocationException occurred invoking method

2012-11-14 Thread Robert Muir
On Wed, Nov 14, 2012 at 5:38 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: AFAIK eclipse is just an ide and using the java debugger, so this is then a java debugger problem? http://stackoverflow.com/questions/4123628/com-sun-jdi-invocationexception-occurred-invoking-method I have

Re: CJKWidthFilter vs ICUFoldingFilter

2012-11-14 Thread Robert Muir
On Wed, Nov 14, 2012 at 9:47 AM, Scott Smith ssm...@mainstreamdata.com wrote: Reading the documentation for these two filters seems to imply that CJKWidthFilter is a subset of ICUFoldingFilter. Is that true? I'm basically using the CjkAnalyzer (from Lucene 4.0) but adding ICUFoldingFilter

Re: content disappears in the index

2012-11-13 Thread Robert Muir
On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: By the way, why does TrimFilter option updateOffset defaults to false, just keep it backwards compatible? In my opinion this option should be removed. TokenFilters shouldn't muck with offsets, for a lot of

Re: questions on PerFieldSimilarityWrapper

2012-11-07 Thread Robert Muir
coord() and queryNorm() work on the query as a whole, which may span multiple fields. On Wed, Nov 7, 2012 at 5:23 PM, Joel Barry jmb...@gmail.com wrote: Hi folks, I have a question on PerFieldSimilarityWrapper. It seems that it is not possible to get per-field behavior on queryNorm() and

Re: Using DocValues with CollationKeyAnalyzer

2012-11-06 Thread Robert Muir
Hi Christoph: in my opinion, (ICU)Collation should actually be implemented as DocValues just as you propose: e.g. we'd deprecate the Analyzer and just offer a (ICU)CollationFields that provide an easy way to do this, so you would just add one of these to your Lucene Document. I started a

Re: using CharFilter to inject a space

2012-11-03 Thread Robert Muir
On Sat, Nov 3, 2012 at 7:35 PM, Igal @ getRailo.org i...@getrailo.org wrote: hi, I want to make sure that every comma (,) and semi-colon (;) is followed by a space prior to tokenizing. the idea is to then use a WhitespaceTokenizer which will keep commas but still split the phrase in a case

<    1   2   3   4   5   6   >