SpanNearQuery and matching spans inside the first span

2011-12-05 Thread Trejkaz
Supposing I have a document with just "hi there" as the text. If I do a span query like this: near(near(term('hi'), term('there'), slop=0, forwards), term('hi'), slop=1, any-direction) that returns no hits. However, if I do a span query like this: near(near(term('hi'), term('there'), s

Re: Query that returns all docs that contain a field

2011-12-19 Thread Trejkaz
On Mon, Dec 19, 2011 at 9:05 PM, Paul Taylor wrote: > I was looking for a Query that returns all documents that contain a > particular field, it doesnt matter what the value of the field is just that > the document contains the field. If you don't care about performance (or if it runs fast enough

Remoting Lucene

2012-01-09 Thread Trejkaz
Hi all. I want to access a Lucene index remotely. I'm aware of a couple of options for it which seem to operate more or less at the IndexSearcher level - send a query, get back results. But in our case, we use IndexReader directly for building statistics, which is too slow to do via individual q

Re: comparing index fields within a query

2012-01-23 Thread Trejkaz
On Mon, Jan 23, 2012 at 11:31 PM, Jamie wrote: > Ian > > Thanks. I'll have to read up about it. I have a lot of comparisons to make, > so cannot precompute the values. How many is a lot? If it were 100 or so I would still be tempted to do all 4,950 comparisons and find some sensible way to store

Re: NGraming document for similar documents matching

2012-01-26 Thread Trejkaz
On Fri, Jan 27, 2012 at 10:41 AM, Saurabh Gokhale wrote: > I wanted to check if Ngraming the document contents (space is not the > issue) would make any good for better matching? Currently I see Ngram is > mostly use for auto complete or spell checker but is this useful for > similarity search? I

Lucene appears to use memory maps after unmapping them

2012-01-31 Thread Trejkaz
Hi all. I've found a rather frustrating issue which I can't seem to get to the bottom of. Our application will crash with an access violation around the time when the index is closed, with various indications of what's on the stack, but the common things being SegmentTermEnum.next and MMapIndexIn

Re: Lucene appears to use memory maps after unmapping them

2012-01-31 Thread Trejkaz
On Wed, Feb 1, 2012 at 11:30 AM, Robert Muir wrote: > the problem is caused by searching indexreaders after you closed them. > > in general we can try to add more and more safety, but at the end of the day, > if you close an indexreader while a search is running, you will have problems. > > So be

Re: Lucene appears to use memory maps after unmapping them

2012-01-31 Thread Trejkaz
On Wed, Feb 1, 2012 at 1:14 PM, Robert Muir wrote: > > No, I don't think you should use close at all, because your problem is > you are calling close() when its unsafe to do so (you still have other > threads that try to search the reader after you closed it). > > Instead of trying to fix the bugs

Merging results from two searches on two separate Searchers

2012-02-14 Thread Trejkaz
Hi all. We have 1..N indexes for each time someone adds some data. Each time they can choose different tokenisation settings. Because of this, each text index has its own query parser instance. Because each query parser could generate a different Query (though I guess whether they do or not is ano

Re: Merging results from two searches on two separate Searchers

2012-02-14 Thread Trejkaz
On Wed, Feb 15, 2012 at 11:46 AM, Uwe Schindler wrote: > Scores are only compatible if the query is the same, which is not the case > for you. > So you cannot merge hits from different queries. So I guess in the case where the different query parsers happen to generate the same query, it's safe

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Trejkaz
On Mon, Feb 20, 2012 at 12:07 PM, Uwe Schindler wrote: > See my response. The problem is not in Lucene; its in general a problem of > fixed > thread pools that execute other callables from within a callable running at > the > moment in the same thread pool. Callables are simply waiting for each

Re: How to add DocValues Field to a document in an optimal manner.

2012-02-29 Thread Trejkaz
On Thu, Mar 1, 2012 at 6:20 PM, Sudarshan Gaikaiwari wrote: > Hi > > https://builds.apache.org/job/Lucene-trunk/javadoc/core/org/apache/lucene/document/DocValuesField.html > > The documentation at the above link indicates that the optimal way to > add a DocValues field is to create it once and cha

Re: Range queries in successive positions

2012-03-01 Thread Trejkaz
On Fri, Mar 2, 2012 at 6:22 PM, su ha wrote: > Hi, > I'm new to Lucene. I'm indexed some documents with Lucene and need to > sanitize it to ensure > that they do not have any social security numbers (3-digits 2-digits > 4-digits). > > (How) Can I write a query (with the QueryParser) that searche

Re: Sort.INDEXORDER works incorrectly?

2012-04-17 Thread Trejkaz
On Wed, Apr 18, 2012 at 9:27 AM, Vladimir Gubarkov wrote: > Hi, dear Lucene specialists, > > The only explanation I could think of is the new TieredMergePolicy > instead of old LogMergePolicy. Could it be that because of > TieredMergePolicy merges not adjacent segments - this results in not > pres

Re: Lucene's internal doc ID space

2012-05-12 Thread Trejkaz
On Fri, May 11, 2012 at 9:56 PM, Jong Kim wrote: > 2. If Lucene can recycle old IDs, it would be even better if I could force > it to re-use a particular doc ID when updating a document by deleting old > one and creating new one. This scheme will allow me to reference this doc > ID from another do

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Trejkaz
On Thu, May 17, 2012 at 7:11 AM, Chris Harris wrote: > but also crazier ones, perhaps like > > agreement w/5 (medical and companion) > (dog or dragon) w/5 (cat and cow) > (daisy and (dog or dragon)) w/25 (cat not cow) [skip] Everything in your post matches our experience. We ended up writing some

Re: Store a query in a database for later use

2012-05-17 Thread Trejkaz
On Fri, May 18, 2012 at 6:23 AM, Jamie Johnson wrote: > I think you want to have a look at the QueryParser classes.  Not sure > which you're using to start with but probably the default QueryParser > should suffice. There are (at least) two catches though: 1. The semantics of a QueryParser might

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-25 Thread Trejkaz
On Sat, May 26, 2012 at 12:07 PM, Chris Harris wrote: > > Alternatively, if you insist that query > > merger w/5 (medical and agreement) > > should match document "medical x x x merger x x x agreement" > > then you can propagate 2x the parent's slop value down to child queries. This is in fact ex

Re: easy one? IN and OR stopword help

2012-06-07 Thread Trejkaz
On Fri, Jun 8, 2012 at 5:35 AM, Jack Krupansky wrote: > Well, if you have defined OR/or and IN/in as stopwords, what is it you expect > other than for the analyzer to ignore those terms (which with a boolean “AND” > means match nothing)? Is this behaviour really logical? If I search for a sing

Re: QueryParser and BooleanQuery

2012-07-23 Thread Trejkaz
On Mon, Jul 23, 2012 at 10:16 PM, Deepak Shakya wrote: > Hey Jack, > > Can you let me know how should I do that? I am using the Lucene 3.6 version > and I dont see any parse() method for StandardAnalyzer. In your case, presumably at indexing time you should be using a PerFieldAnalyzerWrapper with

Re: Usage of NoMergePolicy and its potential implications

2012-07-25 Thread Trejkaz
On Thu, Jul 26, 2012 at 5:38 AM, Simon Willnauer wrote: > you really shouldn't do that! If you use lucene as a Primary key > generator why don't you build your own on top. Just add one layer that > accepts the document and returns the PID and internally put it in an > ID field. Using no merge poli

Re: Why does this query slow down Lucene?

2012-08-15 Thread Trejkaz
On Thu, Aug 16, 2012 at 11:27 AM, zhoucheng2008 wrote: > > +(title:21 title:a title:day title:once title:a title:month) Looks like you have a fairly big boolean query going on here, and some of the terms you're using are really common ones like "a". Are you using AND or OR for the default operat

Re: Finding the most matching (cf. similar) document to another one

2012-09-07 Thread Trejkaz
On Fri, Sep 7, 2012 at 6:12 PM, Jochen Hebbrecht wrote: > Hi qibaoyuan, > > I tried your second solution, using the scoring data. I think in this way, > I could use MoreLikeThis. All documents with a score > X are a possible > match :-). FWIW, there is also BooleanQuery#setMinimumNumberShouldMatc

Re: SpanNearQuery distance issue

2012-09-19 Thread Trejkaz
On Thu, Sep 20, 2012 at 4:28 AM, vempap wrote: > Hello All, > > I've a issue with respect to the distance measure of SpanNearQuery in > Lucene. Let's say I've following two documents: > > DocID: 6, cotent:"1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1001 > 1002 1003 1004 1005 1006 1007 1008

Re: international stop set?

2012-10-26 Thread Trejkaz
On Sat, Oct 27, 2012 at 1:53 PM, Tom wrote: > Hello, > > using Lucene 4.0.0b, I am trying to get a superset of all stop words (for > an international app). > I have looked around, and not found anything specific. Is this the way to go? > > CharArraySet internationalSet = new CharArraySet(Version.L

Re: Excessive use of IOException without proper documentation

2012-11-04 Thread Trejkaz
On Mon, Nov 5, 2012 at 4:25 AM, Michael-O <1983-01...@gmx.net> wrote: > Continuing my answer from above. Have you ever worked with the Spring > Framework? They apply a very nice exception translation pattern. All > internal exceptions are turned to specialized unchecked exceptions like > Authentica

Can I still use SearcherManager in this situation?

2012-11-06 Thread Trejkaz
In our application, most users sit around in read-only mode all the time but there is one place where write access can occur, which is essentially scripted at the moment. (*) Currently, we start out opening an IndexReader. When the caller declares that they are going to start writing, we open an I

Re: Can I still use SearcherManager in this situation?

2012-11-07 Thread Trejkaz
On Wed, Nov 7, 2012 at 10:11 PM, Ian Lea wrote: > 4.0 has maybeRefreshBlocking which "is useful if you want to guarantee > that the next call to acquire() will return a refreshed instance". > You don't say what version you're using. > > If you're stuck on 3.6.1 can you do something with refreshIfN

Re: Can I still use SearcherManager in this situation?

2012-11-07 Thread Trejkaz
On Wed, Nov 7, 2012 at 11:10 PM, Ian Lea wrote: > > Sorry, didn't notice that refreshIfNeeded is protected. It's not only protected... but the class is final as well (the method might as well be private so that it doesn't give a false sense of hope that it can be overridden.) I might have to clo

Re: Can I still use SearcherManager in this situation?

2012-11-09 Thread Trejkaz
On Thu, Nov 8, 2012 at 8:29 AM, Trejkaz wrote: > It's not only protected... but the class is final as well (the method > might as well be private so that it doesn't give a false sense of hope > that it can be overridden.) > > I might have to clone the whole class just

Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Trejkaz
I have a feature I wanted to implement which required a quick way to check whether an individual document matched a query or not. IndexSearcher.explain seemed to be a good fit for this. The query I tested was just a BooleanQuery with two TermQuery inside it, both with MUST. I ran an empty query t

Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Trejkaz
On Wed, Nov 21, 2012 at 12:33 AM, Ramprakash Ramamoorthy wrote: > On Tue, Nov 20, 2012 at 5:42 PM, Danil ŢORIN wrote: > >> Ironically most of the changes are in unicode handling and standard >> analyzer ;) >> > > Ouch! It hurts then ;) What we did going from 2 -> 3 (and in some cases where passi

Re: Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Trejkaz
On Wed, Nov 21, 2012 at 10:40 AM, Robert Muir wrote: > Explain is not performant... but the comment is fair I think? Its more of a > worst-case, depends on the query. > Explain is going to rewrite the query/create the weight and so on just to > advance() the scorer to that single doc > So if this

Does anyone have tips on managing cached filters?

2012-11-22 Thread Trejkaz
I recently implemented the ability for multiple users to open the index in the same process ("whoa", you might think, but this has been a single user application forever and we're only just making the platform capable of supporting more than that.) I found that filters are being stored twice and s

Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Trejkaz
On Tue, Nov 27, 2012 at 9:31 AM, Robert Muir wrote: > On Thu, Nov 22, 2012 at 11:10 PM, Trejkaz wrote: > >> >> As for actually doing the invalidation, CachingWrapperFilter itself >> doesn't appear to have any mechanism for invalidation at all, so I >> imagine I

Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Trejkaz
On Wed, Nov 28, 2012 at 2:09 AM, Robert Muir wrote: > > I don't understand how a filter could become invalid even though the reader > has not changed. I did state two ways in my last email, but just to re-iterate: (1): The filter reflects a query constructed from lines in a text file. If some ot

Re: Does anyone have tips on managing cached filters?

2012-11-28 Thread Trejkaz
On Wed, Nov 28, 2012 at 6:28 PM, Robert Muir wrote: > My point is really that lucene (especially clear in 4.0) assumes > indexreaders are immutable points in time. I don't think it makes sense for > us to provide any e.g. filtercaching or similar otherwise, because this is > a key simplification t

Re: Does anyone have tips on managing cached filters?

2012-11-28 Thread Trejkaz
On Thu, Nov 29, 2012 at 4:57 PM, Trejkaz wrote: > doubt we're not Rats. Accidentally double-negatived that. I doubt we are the only ones. * TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For ad

Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-11-29 Thread Trejkaz
Hi all. trying to figure out what I was doing wrong in some of my own code so I looked to LowerCaseFilter since I thought I remembered it doing this correctly, and lo and behold, it failed the same test I had written. Is this a bug or an intentional difference in behaviour? @Test public

Re: Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-11-30 Thread Trejkaz
On Fri, Nov 30, 2012 at 8:22 PM, Ian Lea wrote: > Sounds like a side effect of possibly different, locale-dependent, > results of using String.toLowerCase() and/or Character.toLowerCase(). > > http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#toLowerCase() > specifically mentions Turk

Re: Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-12-03 Thread Trejkaz
On Tue, Dec 4, 2012 at 10:09 AM, Vitaly Funstein wrote: > If you don't need to support case-sensitive search in your application, > then you may be able to get away with adding string fields to your > documents twice - lowercase version for indexing only, and verbatim to > store. Actually, I will

Re: Lucene 4.0, Serialization

2012-12-04 Thread Trejkaz
On Tue, Dec 4, 2012 at 8:33 PM, BIAGINI Nathan wrote: > I need to send a class containing Lucene elements such as `Query` over the > network using EJB and of course this class need to be serialized. I marked > my class as `Serializable` but it does not seems to be enough: > > org.apache.lucene

Re: RE: Stemming and Wildcard - or fire and water

2013-01-04 Thread Trejkaz
On Sat, Jan 5, 2013 at 4:06 AM, Klaus Nesbigall wrote: > The actual behavior doesn't work either. > The english word families will not be found in case the user types the query > familie* > So why solve the problem by postulate one oppinion as right and another as > wrong? > A simple flag which

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Trejkaz
On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi wrote: > DoesLucene StandardAnalyzer work for all the languagues for tokenizing before > indexing (since we are using java, I think the content is converted to UTF-8 > before tokenizing/indeing)? No. There are multiple cases where it chooses not to brea

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-08 Thread Trejkaz
On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe wrote: > Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of > interest to you, along with the token filters in that same module. - Steve ICUTokenizer sounds like it's implementing UAX #29, which is exac

Re: Is StandardAnalyzer good enough for multi languages...

2013-01-09 Thread Trejkaz
On Wed, Jan 9, 2013 at 5:25 PM, Steve Rowe wrote: > Dude. Go look. It allows for per-script specialization, with (non-UAX#29) > specializations by default for Thai, Lao, Myanmar and Hewbrew. See > DefaultICUTokenizerConfig. It's filled with exactly the opposite of what you > were describing

Re: Custom Query Syntax/Parser

2013-01-28 Thread Trejkaz
On Tue, Jan 29, 2013 at 3:42 AM, Andrew Gilmartin wrote: > When I first started using Lucene, Lucene's Query classes where not suitable > for use with the Visitor pattern and so I created my own query class > equivalants and other more specialized ones. Lucene's classes might have > changed since

Re: How to properly use updatedocument in lucene.

2013-01-31 Thread Trejkaz
On Thu, Jan 31, 2013 at 11:05 PM, Michael McCandless wrote: > It's confusing, but you should never try to re-index a document you > retrieved from a searcher, because certain index-time details (eg, > whether a field was tokenized) are not preserved in the stored > document. > > Instead, you shoul

Migrating from using doc IDs to using application IDs from the FieldCache

2013-01-31 Thread Trejkaz
Hi all. We have an application which has been around for so long that it's still using doc IDs to key to an external database. Obviously this won't work forever (even in Lucene 3.x we had to use a custom merge policy to keep it working) so we want to introduce application IDs eventually. We have

Re: Lightweight detection of whether a keyword is CJK or not (language detection)

2013-03-10 Thread Trejkaz
On Sun, Mar 10, 2013 at 8:19 PM, Gili Nachum wrote: > Answering myself for next generations' sake. > Character.UnicodeBlock.CJK_UNIFIED_IDEOGRAPHS does the job. How about 㒨? TX - To unsubscribe, e-mail: java-user-unsubscr...@lu

Re: how to search a certain number of document without travelling all related documents

2013-03-14 Thread Trejkaz
On Tue, Mar 12, 2013 at 10:42 PM, Hu Jing wrote: > so my question is how to achieve a non-sort query method, this method can > get result constantly and don't travel all unnecessary doc. > > Does Lucene supply some strategies to implement this? If you want the result as soon as possible, just pa

Re: getLocale of SortField

2013-07-09 Thread Trejkaz
On Wed, Jul 10, 2013 at 12:53 AM, Uwe Schindler wrote: > Hi, > > there is no more locale-based sorting in Lucene 4.x. It was deprecated in 3.x, > so you should get a warning about deprecation already! I wasn't sure about this because we are on 3.6 and I didn't see a deprecation warning in our cod

Re: getLocale of SortField

2013-07-10 Thread Trejkaz
On Wed, Jul 10, 2013 at 4:20 PM, Uwe Schindler wrote: > Hi, > > The "fast" replacement (means sorting works as fast without collating) is to > index the fields > used for sorting with CollationKeyAnalyzer ([snip]). The Collator you get > from e.g. the locale. [snip] > The better was is, as menti

Callback for commits

2013-08-11 Thread Trejkaz
Hi all. Is there some kind of callback where we can be notified about commits? Sometimes a call to commit() doesn't actually commit anything (e.g. if there is nothing in memory at the time.) I'm not really sure what's wrong with assuming it does commit something, because it's another developer as

Re: Stream Closed Exception and Lock Obtain Failed Exception while reading the file in chunks iteratively.

2013-09-01 Thread Trejkaz
On Mon, Sep 2, 2013 at 4:10 PM, Ankit Murarka wrote: > There's a reason why Writer is being opened everytime inside a while loop. I > usually open writer in main method itself as suggested by you and pass a > reference to it. However what I have observed is that if my file contains > more than 4 l

Unicode normalisation *before* tokenisation?

2011-01-16 Thread Trejkaz
Hi all. I discovered there is a normalise filter now, using ICU's Normalizer2 (org.apache.lucene.analysis.icu.ICUNormalizer2Filter). However, as this is a filter, various problems can result if used with StandardTokenizer. One in particular is half-width Katakana. Supposing you start out with t

Re: Unicode normalisation *before* tokenisation?

2011-01-16 Thread Trejkaz
On Mon, Jan 17, 2011 at 11:53 AM, Robert Muir wrote: > On Sun, Jan 16, 2011 at 7:37 PM, Trejkaz wrote: >> So I guess I have two questions: >>    1. Is there some way to do filtering to the text before >> tokenisation without upsetting the offsets reported by the tokeniser? &

Re: AW: Best practices for multiple languages?

2011-01-19 Thread Trejkaz
On Thu, Jan 20, 2011 at 9:08 AM, Paul Libbrecht wrote: >>> Wouldn't it be better to prefer precise matches (a field that is >>> analyzed with StandardAnalyzer for example) but also allow matches are >>> stemmed. >> >> StandardAnalyzer isn't quite precise, is it?  StandardFilter does some >> kind o

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-12 Thread Trejkaz
On Fri, Mar 11, 2011 at 10:03 PM, shrinath.m wrote: > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ? This doesn't really answer the question, but I think it will help... The features you want to look for: 1.

a faster way to addDocument and get the ID just added?

2011-03-28 Thread Trejkaz
Hi all. I'm trying to parallelise writing documents into an index. Let's set aside the fact that 3.1 is much better at this than 3.0.x... but I'm using 3.0.3. One of the things I need to know is the doc ID of each document added so that we can add them into auxiliary database tables which are ke

Re: a faster way to addDocument and get the ID just added?

2011-03-29 Thread Trejkaz
On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson wrote: > I'm always skeptical of storing the doc IDs since they can > change out from underneath you (just delete even a single > document and optimize). We never delete documents. Even when a feature request came in to update documents (i.e. dele

Re: a faster way to addDocument and get the ID just added?

2011-03-30 Thread Trejkaz
On Wed, Mar 30, 2011 at 8:21 PM, Simon Willnauer wrote: > Before trunk (and I think > its in 3.1 also) merge only merged continuous segments so the actual > per-segment ID might change but the global document ID doesn't if you > only add documents. But this should not be considered a feature. In >

Re: Using IndexWriterConfig repeatedly in 3.1

2011-04-01 Thread Trejkaz
On Sat, Apr 2, 2011 at 7:07 AM, Christopher Condit wrote: > I see in the JavaDoc for IndexWriterConfig that: > "Note that IndexWriter makes a private clone; if you need to > subsequently change settings use IndexWriter.getConfig()." > > However when I attempt to use the same IndexWriterConfig to c

Re: switching between Query parsers

2011-04-18 Thread Trejkaz
On Thu, Apr 14, 2011 at 9:44 PM, shrinath.m wrote: > Consider this case : > > Lucene index contains documents with these fields : > title > author > publisher > > I have coded my app to use MultiFieldQueryParser so that it queries all > fields. > Now if user types something like "author:tom" in s

Re: Immutable OpenBitSet?

2011-04-28 Thread Trejkaz
On Thu, Apr 28, 2011 at 6:13 PM, Uwe Schindler wrote: > In general a *newly* created object that was not yet seen by any other > thread is always safe. This is why I said, set all bits in the ctor. This is > easy to understand: Before the ctor returns, the object's contents and all > references li

Re: MultiFieldQueryParser with default AND and stopfilter

2011-06-08 Thread Trejkaz
On Wed, Jun 8, 2011 at 6:52 PM, Elmer wrote: > the parsed query becomes: > > '+(title:the) +(title:project desc:project)'. > > So, the problem is that docs that have the term 'the' only appearing in > their desc field are excluded from the results. Subclass MFQP and override getFieldQuery. If th

Re: Corrupt segments file full of zeros

2011-06-28 Thread Trejkaz
On Wed, Jun 29, 2011 at 2:24 AM, Michael McCandless wrote: > Here's the issue: > >    https://issues.apache.org/jira/browse/LUCENE-3255 > > It's because we read the first 0 int to be an ancient segments file > format, and the next 0 int to mean there are no segments.  Yuck! > > This format pre-dat

StandardAnalyzer compatibility between 2.3 and 3.0

2011-07-10 Thread Trejkaz
Hi all. I created a test using Lucene 2.3. When run, this generates a single token: public static void main(String[] args) throws Exception { String string = "\u0412\u0430\u0441\u0438\u0301\u043B\u044C\u0435\u0432"; StandardAnalyzer analyser = new StandardAnalyzer();

Re: Searching for Empty Field

2011-07-14 Thread Trejkaz
On Fri, Jul 15, 2011 at 10:02 AM, Trieu, Jason T wrote: > Hi all, > > I read postings about searching for empty field with but did not find any > cases of successful search using query language syntax itself(-myField:[* TO > *] for example). We have been using: -myField:* You would need to us

Re: Searching for Empty Field

2011-07-15 Thread Trejkaz
On Fri, Jul 15, 2011 at 4:45 PM, Uwe Schindler wrote: > Hi, > >> The crappy thing is that to actually detect if there are any tokens in the >> field >> you need to make a TokenStream which can be used to read the first token >> and then rewind again.  I'm not sure if there is such a thing in Luce

Rewriting other query types into span queries and two questions about this

2011-08-04 Thread Trejkaz
Hi all. I am writing a custom query parser which strongly resembles StandardQueryParser (I use a lot of the same processors and builders, with a slightly customised config handler and a completely new syntax parser written as an ANTLR grammar.) My parser has additional syntax for span queries. T

Re: Grouping Clauses to Preserve Order of Boolean Precedence

2011-08-04 Thread Trejkaz
On Fri, Aug 5, 2011 at 1:57 AM, Jim Swainston wrote: > So if the Text input is: > > Marketing AND Smith OR Davies > > I want my program to work out that this should be grouped as the following > (as AND has higher precedence than OR): > > (Marketing AND Smith) OR Davies. > > I'm effectively lookin

Re: Rewriting other query types into span queries and two questions about this

2011-08-07 Thread Trejkaz
On Mon, Aug 8, 2011 at 8:58 AM, Michael Sokolov wrote: > Can you do something approximately equivalent like: > > within(5, 'my', and('cat', 'dog')) -> > within(5, 'my', within(5, 'cat', 'dog') ) > > Might not be exactly the same in terms of distances (eg "cat x x x my x x x > dog") might match the

Re: Rewriting other query types into span queries and two questions about this

2011-08-07 Thread Trejkaz
On Mon, Aug 8, 2011 at 10:00 AM, Trejkaz wrote: > >    within(5, 'my', and('cat', 'dog')) -> within(5, 'my', within(10, 'cat', > 'dog') ) To extend my example and maybe make it a bit more hellish, take this one:

Strange change to query parser behaviour in recent versions

2011-08-17 Thread Trejkaz
Hi all. Suppose I am searching for - 限定 In 3.0, QueryParser would parse this as a phrase query. In 3.3, it parses it as a boolean query, but offers an option to treat it like a phrase. Why would the default be not to do this? Surely you would always want it to become a phrase query. The new p

Re: Strange change to query parser behaviour in recent versions

2011-08-20 Thread Trejkaz
On Fri, Aug 19, 2011 at 11:05 AM, Chris Hostetter wrote: > > See LUCENE-2458 for the backstory. > > the argument was that while phrase queries were historicly generated by > the query parser when a single (white space deliminated) "chunk" of query > parser input produced multiple tokens, that logi

Re: Strange change to query parser behaviour in recent versions

2011-08-21 Thread Trejkaz
On Sat, Aug 20, 2011 at 7:00 PM, Robert Muir wrote: > On Sat, Aug 20, 2011 at 3:34 AM, Trejkaz wrote: > >> >> As an aside, Google's behaviour seems to follow the "old" way.  For >> instance, [[ 限定 ]] returns 640,000,000 hits and [[ 限 定 ]] returns

IndexWriter.ramSizeInBytes() no longer returns to 0 after commit()?

2011-08-22 Thread Trejkaz
Hi all. We are using IndexWriter with no limits set and managing the commits ourselves, mainly so that we can ensure they are done at the same time as other (non-Lucene) commits. After upgrading from 3.0 ~ 3.3, we are seeing a change in ramSizeInBytes() behaviour where it is no longer resetting t

Re: IndexWriter.ramSizeInBytes() no longer returns to 0 after commit()?

2011-08-23 Thread Trejkaz
On Wed, Aug 24, 2011 at 4:45 AM, Michael McCandless wrote: > Hmm... this looks like a side-effect of LUCENE-2680, which was merged > back from trunk to 3.1. > > So the problem is, IW recycles the RAM it has allocated, and so this > method is returning the allocated RAM, even if those buffers are n

Re: Possible to use span queries to avoid stepping over null index positions

2011-08-27 Thread Trejkaz
On Sat, Aug 27, 2011 at 2:30 AM, wrote: > Hello, > In our indexes we have a field that is a combination of other various > metadata fields (i.e. subject, from, to, etc.). Each field that is added has > a null position at the beginning. As an example, in Luke the field data looks > like: > > nu

Re: Extracting all documents for a given search

2011-09-18 Thread Trejkaz
On Mon, Sep 19, 2011 at 3:50 AM, Charlie Hubbard wrote: > Here was the prior API I was calling: > >        Hits hits = getSearcher().search( query, filter, sort ); > > The new API: > >        TopDocs hits = getSearcher().search( query, filter, startDoc + > length, sort ); > > So the question is wh

JapaneseAnalyser filter ordering

2014-01-15 Thread Trejkaz
The current ordering of JapaneseAnalyser's token filters is as follows: 1. JapaneseBaseFormFilter 2. JapanesePartOfSpeechStopFilter 3. CJKWidthFilter (similar to NormaliseFilter) 4. StopFilter 5. JapaneseKatakanaStemFilter 6. LowerCaseFilter Our existing support for E

MultiFieldAttribute is deprecated but the replacement is not documented

2014-01-22 Thread Trejkaz
In 3.6.2, I notice MultiFieldAttribute is deprecated. So I looked to the docs to find the replacement: https://lucene.apache.org/core/3_6_2/api/contrib-queryparser/org/apache/lucene/queryParser/standard/config/MultiFieldAttribute.html ...and the Deprecated note doesn't say what we're supposed to

Re: exporting a query to String with default operator = AND ?

2014-01-25 Thread Trejkaz
On Sat, Jan 25, 2014 at 4:29 AM, Olivier Binda wrote: > I would like to serialize a query into a string (A) and then to unserialize > it back into a query (B) > > I guess that a solution is > A) query.toString() > B) StandardQueryParser().parse(query,"") If your custom query parser uses the new q

Re: limitation on token-length for KeywordAnalyzer?

2014-01-27 Thread Trejkaz
On Mon, Jan 27, 2014 at 3:48 AM, Andreas Brandl wrote: > Is there some limitation on the length of fields? How do I get around this? [cut] > My overall goal is to index (arbitrary sized) text files and run a regular > expression search using lucene's RegexpQuery. I suspect the > KeywordAnalyzer to

Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-03 Thread Trejkaz
Hi all. I'm trying to find a precise and reasonably efficient way to highlight all occurrences of terms in the query, only highlighting fields which match the corresponding fields used in the query. This seems like it would be a fairly common requirement in applications. We have an existing implem

Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-05 Thread Trejkaz
On Wed, Feb 5, 2014 at 4:16 AM, Earl Hood wrote: > Our current solution is to do highlighting on the client-side. When > search happens, the search results from the server includes the parsed > query terms so the client has an idea of which terms to highlight vs > trying to reimplement a complete

Re: Limiting the fields a user can query on

2014-02-20 Thread Trejkaz
On Thu, Feb 20, 2014 at 1:43 PM, Jamie Johnson wrote: > Is there a way to limit the fields a user can query by when using the > standard query parser or a way to get all fields/terms that make up a query > without writing custom code for each query subclass? If you mean StandardQueryParser, you c

Re: encoding problem when retrieving document field value

2014-03-03 Thread Trejkaz
On Tue, Mar 4, 2014 at 4:44 AM, Jack Krupansky wrote: > What is the hex value for that second character returned that appears to > display as an apostrophe? Hex 92 (decimal 146) is listed as "Private Use > 2", so who knows what it might display as. Well, if they're dealing with HTML, then it wil

Re: searching with stemming

2014-06-09 Thread Trejkaz
On Mon, Jun 9, 2014 at 7:57 PM, Jamie wrote: > Greetings > > Our app currently uses language specific analysers (e.g. EnglishAnalyzer, > GermanAnalyzer, etc.). We need an option to disable stemming. What's the > recommended way to do this? These analyzers do not include an option to > disable stem

Reading a v2 index in v4

2014-06-09 Thread Trejkaz
Hi all. The inability to read people's existing indexes is essentially the only thing stopping us upgrading to v4, so we're stuck indefinitely on v3.6 until we find a way around this issue. As I understand it, Lucene 4 added the notion of codecs which can precisely choose how to read and write th

Re: Reading a v2 index in v4

2014-06-09 Thread Trejkaz
On Mon, Jun 9, 2014 at 10:17 PM, Adrien Grand wrote: > Hi, > > It is not possible to read 2.x indices from Lucene 4, even with a > custom codec. For instance, Lucene 4 needs to hook into > SegmentInfos.read to detect old 3.x indices and force the use of the > Lucene3x codec since these indices don

Is it possible to rewrite a MultiPhraseQuery to a SpanQuery?

2014-08-18 Thread Trejkaz
Someone asked if it was possible to do a SpanNearQuery between a TermQuery and a MultiPhraseQuery. Sadly, you can only use SpanNearQuery with other instances of SpanQuery, so we have a gigantic method where we rewrite as many queries as possible to SpanQuery. For instance, TermQuery can trivially

Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-18 Thread Trejkaz
Unrelated to my previous mail to the list, but related to the same investigation... The following test program just indexes a phrase of nonsense words using and then queries for one of the words using the same analyser. The same analyser is being used both for indexing and for querying, yet in th

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-18 Thread Trejkaz
Also in case it makes a difference, we're using Lucene v3.6.2. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
On Tue, Aug 19, 2014 at 5:27 PM, Uwe Schindler wrote: > Hi, > > You forgot to close (or commit) IndexWriter before opening the reader. Huh? The code I posted is closing it: try (IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_36, analyser))) {

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-19 Thread Trejkaz
Lucene 4.9 gives much the same result. import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.ja.JapaneseAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.TextField; import

Re: Can some terms from analysis be silently dropped when indexing? Because I'm pretty sure I'm seeing that happen.

2014-08-25 Thread Trejkaz
It seems like nobody knows the answer, so I'm just going to file a bug. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ArrayIndexOutOfBoundsException: -65536

2014-10-13 Thread Trejkaz
Bit of thread necromancy here, but I figured it was relevant because we get exactly the same error. On Thu, Jan 19, 2012 at 12:47 AM, Michael McCandless wrote: > Hmm, are you certain your RAM buffer is 3 MB? > > Is it possible you are indexing an absurdly enormous document...? We're seeing a cas

Re: OutOfMemoryError indexing large documents

2014-11-26 Thread Trejkaz
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson wrote: > Well > 2> seriously consider the utility of indexing a 100+M file. Assuming > it's mostly text, lots and lots and lots of queries will match it, and > it'll score pretty low due to length normalization. And you probably > can't return it to

  1   2   3   >