Re: When does Query Parser do its analysis ?

2012-02-01 Thread Doron Cohen
> > In my particular case I add album catalogsno to my index as a keyword > field , but of course if the cat log number contains a space as they often > do (i.e. cad 6) there is a mismatch. Ive now changed my indexing to index > the value as 'cad6' removing spaces. Now if the query sent to the quer

Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

2012-02-01 Thread Doron Cohen
> int gap = (pp[pp.length-1] - pp[0]) - (pp.length - 1); > > Don't want to cause an IndexOutOfBoundsException Right... that's what I meant with "(boundary cases)"...

Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

2012-01-31 Thread Doron Cohen
Hi, Code here ignores PhraseQuery (PQ) 's positions: int[] pp = PQ.getPositions(); These positions have extra gaps when stop words are removed. To accommodate for this, the overall extra gap can be added to the slope: int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); // (+/- bounda

Re: Taxonomy indexer debug

2011-11-28 Thread Doron Cohen
> > Could you minimize this to a small stand-alone program that does not work > > as expected? > > This will be hard, because of the bug only appearing after a couple of days > or more and i'm starting to think that it is triggered by high data > volumes. I'll try to minimize the code and serve mor

Re: Taxonomy indexer debug

2011-11-28 Thread Doron Cohen
Sequence of operations seems logical, I don't see straight why this does not work. Could you minimize this to a small stand-alone program that does not work as expected? This will allow to recreate the problem here and debug it. It is interesting that facet 3.5 is used with core 3.4 and queries 3.4

Re: Taxonomy indexer debug

2011-11-26 Thread Doron Cohen
> > However there are at least two issues with this: > 1) the info would be in the lower level of the internal index writer, and > not in that of the categories logic. > 2) one cannot just call super.openIndexWriter(directory, openMode) and > modify the result before returning it, because once IW i

Re: Taxonomy indexer debug

2011-11-25 Thread Doron Cohen
> > I'm having an issue with using NRT and Tax. After a couple of days of > running continuously , the taxonomyreader doesn't return results anymore > (but taxindex has them). Taxonomy Reader does not support NRT - see https://issues.apache.org/jira/browse/LUCENE-3441 ("Add NRT support to Taxonom

Re: Scoring in Lucene

2011-10-07 Thread Doron Cohen
To my understanding this stems from V(q) · V(d) (see the "*Conceptual Scoring Formula*") - the elements in those vectors are *Tf-idf* values, and so, implementation wise (see the "*Practical Scoring Function*"), idf(t) is multiplied by itself: once for the query and once for the document. HTH, Do

Re: TaxWriter leakage?

2011-10-04 Thread Doron Cohen
LUCENE-3484 is resolved. Mihai, could you give it a try and see if this solves the NPE problem in your setup? You would need to download a nightly build that contains the fix - see the issue for revision numbers... On Tue, Oct 4, 2011 at 7:51 PM, Mihai Caraman wrote: > > (org.myapp.search.CustomL

Re: TaxWriter leakage?

2011-10-04 Thread Doron Cohen
On Tue, Oct 4, 2011 at 11:29 AM, Mihai Caraman wrote: > I also think that there is nothing special in the second restart, except > > > that that by that time there were other servlets up (?) which were able > to > > trigger simultaneous AddDoc requests, exposing this bug... > > > > Makes sense? >

Re: TaxWriter leakage?

2011-10-04 Thread Doron Cohen
Looking into this with Shai I think we see how this can happen, in this code of LTW: private ParentArray getParentArray() throws IOException { if (parentArray==null) { // [1] if (reader == null) { reader = openReader(); } parentArray = new ParentArray(); // [2]

Re: Please help me with a basic question...

2011-05-20 Thread Doron Cohen
Hi Rich, SeetSpotSimilarity looks promising. Does it not favor shorter docs by not > normalizing or does it make some attempt to standardized. > > > - using e.g. SeetSpotSimilarity which do not favor shorter documents. > SweetSpotSimilarity (I misspelled it previously) defines a range of lengths

Re: Please help me with a basic question...

2011-05-19 Thread Doron Cohen
Hi Rich, If I understand correctly you are concerned that short documents are preferred too much over long ones, is this really the case? It would help to understand what goes on to look at the Explanation of the score for say two result documents - one that you think is ranked too low, and one tha

Re: SpanNearQuery - inOrder parameter

2011-05-19 Thread Doron Cohen
Hi Greg, On Thu, May 19, 2011 at 12:26 PM, Gregory Tarr wrote: > We let our users decide whether they want to force the order or not, so > in effect they pass in "inOrder". > > I would have to detect a repeated term and change the parameter as a > result of that in order to workround this - I'd r

Re: SpanNearQuery - inOrder parameter

2011-05-19 Thread Doron Cohen
Hi Greg, I created http://issues.apache.org/jira/browse/LUCENE-3120 for this problem, and attached there a more general test that exposes this problem, based on your test case. I am not sure yet that this is indeed a problem to be fixed with regard to span queries (see more there in JIRA) but at

Re: How to implement a proximity search using LINES as slop

2011-02-10 Thread Doron Cohen
IIUC what you are trying to achieve I think the following could help, without setting all words in a line to be in the same position: At indexing, set a position increment of N (e.g. 100) at line start tokens. This would set a position gap of N between last token of line x to first token of line x+

Re: Creating an index with multiple values for a single field

2011-01-10 Thread Doron Cohen
On Mon, Jan 10, 2011 at 7:44 PM, Ryan Aylward wrote: > We do leverage synonyms but they are not appropriate for this case. We use > synonyms for words that are truly synonymous for the entire index such as > "inc" and "incorporated". Those words are always interchangeable. However, > many of the

Re: lucene locking

2010-12-16 Thread Doron Cohen
> > > I have a app that seems to be locking on some search calls. I am > including > > the stacktrace for the blocked and blocker thread. > Is it daedlock for sure? No search deadlock fixes were done since 2.1.0, so perhaps it is something else... > "TP-Processor177" daemon prio=10 tid=0x00

Re: Custom scoring for searhing geographic objects

2010-12-15 Thread Doron Cohen
Also, when taking the Similarity suggestion below note two things in Lucene's default behavior that you seem to wish to avoid: The first is IDF - but only for multi-term queries - otherwise ignore this comment. For multi term queries to only consider term frequency and doc length, you may want to

Re: Forcing specific index file names

2010-12-15 Thread Doron Cohen
> I could make an exception in the patch creation program to detect > that there is a lucene directly, and diff the .cfs files, even if > they have different names, but was seeing if I can avoid that > so the patch program can be agnostic about the contents of the > directory tree. > Doing only th

Re: Searcher#setSimilarity clarifications

2009-04-28 Thread Doron Cohen
Searcher is quite light. It is the index reader that is heavier. So create a single index reader, for each of the similarities to be use concurrently, create a searcher over that single reader, set its similarity, and so on. Doron On Mon, Apr 27, 2009 at 7:53 PM, Rakesh Sinha wrote: > I am looki

Re: Error: there are more terms than documents...

2009-04-24 Thread Doron Cohen
On Thu, Apr 23, 2009 at 11:52 PM, wrote: > I figured it out. We are using Hibernate Search and in my ORM class I > am doing the following: > > @Field(index=Index.TOKENIZED,store=Store.YES) > protected String objectId; > > So when I persisted a new object to our database I was inadvertently > cre

Re: exponential boosts

2009-04-24 Thread Doron Cohen
On Fri, Apr 24, 2009 at 12:28 AM, Steven Bethard wrote: > On 4/23/2009 2:08 PM, Marcus Herou wrote: > > But perhaps one could use a FieldCache somehow ? > > Some code snippets that may help. I add the PageRank value as a field of > the documents I index with Lucene like this: > >Document docum

Re: exponential boosts

2009-04-23 Thread Doron Cohen
> > I think we are doing similar things, at least I am trying to implement > document boosting with pagerank. Having issues of howto appky the scoring > of > specific docs without actually reindex them. I feel something should be > done > at query time which looks at external data but do not know h

Re: Error: there are more terms than documents...

2009-04-23 Thread Doron Cohen
On Thu, Apr 23, 2009 at 10:39 PM, wrote: > I'm getting a strange error when I make a Lucene (2.2.0) query: > > java.lang.RuntimeException: there are more terms than documents in field > "objectId", but it's impossible to sort on tokenized fields > Is it possible that, for at least one document,

Re: Why is CustomScoreQuery limited to ValueSourceQuery type?

2009-04-22 Thread Doron Cohen
:09 AM, Doron Cohen wrote: > > It could, but (historically and) currently it doesn't... :) > > I actually have code for this. > > Would you like open a JIRA issue for this - I'll attach my wrapper there? > > Done. > > https://issues.apache.org/jira/browse/LUCEN

Re: Why is CustomScoreQuery limited to ValueSourceQuery type?

2009-04-21 Thread Doron Cohen
It could, but (historically and) currently it doesn't... :) I actually have code for this. Would you like open a JIRA issue for this - I'll attach my wrapper there? Doron On Tue, Apr 21, 2009 at 7:58 PM, Steven Bethard wrote: > On 4/21/2009 12:47 AM, Doron Cohen wrote: > &g

Re: changing term freq in indexing time

2009-04-21 Thread Doron Cohen
ur score. I looked at an old thread - > Search for synonyms - implemenetation for review : > . > > http://mail-archives.apache.org/mod_mbox/lucene-java-user/200603.mbox/%3c39b0fb508e5d7540aca5ad57225e150d392...@xmail.me.corp.entopia.com%3e > > I don;t know if its part of lucene now.

Re: changing term freq in indexing time

2009-04-21 Thread Doron Cohen
Depending on the problem you are trying to solve there may be other solutions to it, not requiring setting wrong (?) values for term frequencies. If you can explain what you are trying to solve, people on the list may be able to suggest such alternatives. - Doron On Sun, Apr 19, 2009 at 2:39 PM, l

Re: IndexWriter update method

2009-04-21 Thread Doron Cohen
*IndexWriter.deleteDocuments *(Query query) may be handy too (but note that i

Re: Why is CustomScoreQuery limited to ValueSourceQuery type?

2009-04-21 Thread Doron Cohen
CustomScoreQuery expects the VSQs to have a score for document matching the (main) subQuery - this does not hold for arbitrary queries. On Sat, Apr 18, 2009 at 2:35 AM, Steven Bethard wrote: > CustomScoreQuery only allows the secondary queries to be of type > ValueSourceQuery instead of allowing

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-20 Thread Doron Cohen
On Tue, Aug 19, 2008 at 2:15 AM, Antony Bowesman <[EMAIL PROTECTED]> wrote: > > Thanks for you time and I appreciate your valuable insight Doron. > Antony > I'm glad I could help! Doron

Re: Index of Lucene

2008-08-18 Thread Doron Cohen
On Mon, Aug 18, 2008 at 7:28 AM, blazingwolf7 <[EMAIL PROTECTED]>wrote: > > Thanks for the info. But do you know where this is actually perform in > Lucene? I mean the method involved, that will calculate the value before > storing it into the index. I track it to one method known as lengthNorm()

Re: Fields with the same name?? - Was Re: Payloads and tokenizers

2008-08-18 Thread Doron Cohen
> > payload and the other part for storing, i.e. something like this: >> >>Token token = new Token(...); >>token.setPayload(...); >>SingleTokenTokenStream ts = new SingleTokenTokenStream(token); >> >>Field f1 = new Field("f","some-stored-content",Store.YES,Index.NO); >>Field f2

Re: Index of Lucene

2008-08-17 Thread Doron Cohen
Norms information comes mainly from lengths of documents - allowing the search time scoring to take into account the effect of document lengths (actually field length within a document). In practice, norms stored within the index may include other information, such as index time boosts - for a docu

Re: Payloads and tokenizers

2008-08-17 Thread Doron Cohen
> > Implementing payloads via Tokens explicitly prevents the use of payloads > for untokenized fields, as they only support field.stringValue(). There > seems no way to override this. I assume you already know this but just to make sure what I meant was clear - on tokenization but still indexing

Re: Case Sensitivity

2008-08-16 Thread Doron Cohen
Hi Sergey, seems like case 4 and 5 are equivalent, both meaning case insensitive right. Otherwise please explain the difference. If it is required to support both case sensitive (cases 1,2,3) and case insensitive (case 4/5) then both forms must be saved in the index - in two separate fields (as Er

Re: Case Sensitivity

2008-08-14 Thread Doron Cohen
> > In example I want to show what I stored field as Field.Index.NO_NORMS > > As I understand it means what field contains original string > despite what analyzer I chose(StandardAnalyzer by default). > This would be achieved by UN_TOKENIZED. The NO_NORMS just guides Lucene to avoid normalizin

Re: Payloads and tokenizers

2008-08-14 Thread Doron Cohen
IIRC first versions of patches that added payloads support had this notion of payload by field rather than by token, but later it was modified to be by token only. I have seen two code patterns to add payloads to tokens. The first one created the field text with a reserved separator/delimiter whi

Re: Number range search

2008-08-13 Thread Doron Cohen
The code seems correct (although it doesn't show which analyzer was used at indexing). Note that when adding numbers like this there's no real point in analyzing them, so I would add that field as UN_TOKENIZED. This would be more efficient, and would also comply with the query parser who does not

Re: Query to ignore certain phrases

2008-08-12 Thread Doron Cohen
> > I think it should look something like this > > "white house" NOT "russian white house"~1 "a b c"~1 just matches more 'easily' than "a b c". It will match for instance "a b d c". The NOT however excludes all documents which match this, unlike requested logic. In fact, Q1: "a b" NOT "a

Re: Query to ignore certain phrases

2008-08-12 Thread Doron Cohen
I can't see how to accomplish this without writing some special code, and not just because of query parsing. Phrases are searched by iterating the participating term positions and when a match is found say for "b c" there is no way to know whether another query "a b c d" matches exactly the corres

Re: Highlight huge documents

2008-08-11 Thread Doron Cohen
I believe Highlighter.setMaxDocBytesToAnalyze(int byteCount) should be used for this. On Mon, Aug 11, 2008 at 11:40 AM, <[EMAIL PROTECTED]> wrote: > Hello > > I am using Highlighter to highlight query terms in documents getting from a > database founded from lucene search. > My problem is that wh

Re: Re : Stop search process when a given number of hits is reached

2008-08-09 Thread Doron Cohen
> > Ok, I'm not near any documentation now, but I think > throwing an exception is overkill. As I remember > all you have to do is return false from your collector > and that'll stop the search. But verify that. > That would have been much cleaner, however collect() is a void, so throwing a (runti

Re: Need help searching

2008-08-09 Thread Doron Cohen
> > > writer = new IndexWriter("C:\\", new StandardAnalyzer(), true); > > Term term = new Term("line", "KOREA"); > > PhraseQuery query = new PhraseQuery(); > > query.add(term); > StandardAnalyzer - used here while indexing - applies lowercasing. The query is created programatically - i.e. without

Re: Deleting and adding docs

2008-08-09 Thread Doron Cohen
> > doc.add(new Field(ID_FIELD, id, Field.Store.YES, Field.Index.NO)); > writer.deleteDocuments(new Term(ID_FIELD, id)); > int i = reader.deleteDocuments(new Term(ID_FIELD, id)); //i returns 0 > Both failed. I try to delete one id value that I know for sure it was added > in the first step. > For

Re: Stop search process when a given number of hits is reached

2008-08-07 Thread Doron Cohen
Nothing built in that I'm aware of will do this, but it can be done by searching with your own HitCollector. There is a related feature - stop search after a specified time - using TimeLimitedCollector. It is not released yet, see issue LUCENE-997. In short, the collector's collect() method is invo

Re: CustomScoreQuery and BooleanQuery

2008-08-07 Thread Doron Cohen
When combining any sub queries, a scorer has at least two things to decide: which docs to match, and once matched, how to score. Boolean Queries applies specific logics for this, and some queries allow some control of the way to score. For current CustomScoreQuery things are more straightforward -

Re: Concurrent query benchmarks

2008-06-09 Thread Doron Cohen
On Tue, Jun 10, 2008 at 3:50 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Hi Glen, > > Thanks for sharing. Does your benchmarking tool build on top of > contrib/benchmark? (not sure if that one lets you specify the number of > concurrent threads -- if it does not, perhaps this is an opportu

Re: How to add PageRank score with lucene's relevant score in sorting

2008-06-01 Thread Doron Cohen
Hi Jarvis, > I have a problem that how to "combine" two score to sort the search > result documents. > for example I have 10 million pages in lucene index , and i know their > pagerank scores. i give a query to it , every docs returned have a > lucene-score, mark it as R (relevant score)

Re: Opening an index directory inside a jar

2008-06-01 Thread Doron Cohen
> > : The crux of the issue seems to be that lucene cannot open segments file > that > : is inside the jar (under luceneFiles/index directory) > > i'm not entirely sure why it would have problems finding the segments > file, but a larger problem is that Lucene needs random access which (last > time

Re: IndexReader.reopen memory leak

2008-06-01 Thread Doron Cohen
Hi John, IndexReader newInner=in.reopen(); > if (in!=newInner) > { >in.close(); >this.in=newInner; > >// code to clean up my data >_cache.clear(); >_indexData.load(this, true); >init(_fieldConfig); > } > Just to be sure on this, could

Re: LUCENE-933 / SOLR-261

2008-03-18 Thread Doron Cohen
hi Jake, yes it was commited in Lucene - this is visible in the JIRA issue when if you switch to the "Subversion Commits" tab. where you can also see the actual diffs that took place. Best, Doron On Tue, Mar 18, 2008 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > Hey folks, > I was wonder

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
On Thu, Mar 13, 2008 at 9:30 PM, Doron Cohen <[EMAIL PROTECTED]> wrote: > Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). > I suspect this can be related to the problem you see though I am not sure. > Could you try with the patch there? > Thanks, > Doron

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). I suspect this can be related to the problem you see though I am not sure. Could you try with the patch there? Thanks, Doron On Thu, Mar 13, 2008 at 10:46 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > Daniel Noll wrote: > >

Re: Extracting terms from a query splitting a phrase.

2008-02-10 Thread Doron Cohen
PhraseQuery.extractTerms() returns the terms making up the phrase, and so it is not adequate for 'finding' a single term that represents the phrase query, one that represents the searched entire text. It seems you are trying to obtain a string that can be matched against the displayed text for e.g

Re: problem with Whitespace analyzer

2008-02-10 Thread Doron Cohen
Should be the parenthesis which are part of the query syntax Try escaping - \( \) Also see http://lucene.apache.org/java/2_3_0/queryparsersyntax.html#Escaping%20Special%20Characters Doron On Sun, Feb 10, 2008 at 9:03 AM, saikrishna venkata pendyala < [EMAIL PROTECTED]> wrote: > Hi, > > I am fa

Re: recall/precision with lucene

2008-02-10 Thread Doron Cohen
Take a look at the quality package under contrib/benchmark. Regards, Doron On Sat, Feb 9, 2008 at 2:59 AM, Panos Konstantinidis <[EMAIL PROTECTED]> wrote: > Hello I am a new lucene user. I am trying to calculate the > recall/precision of > a query and I was wondering if lucene provides an easy w

Re: Performance guarantees and index format

2008-02-08 Thread Doron Cohen
I was once involved in modified a search index implementation (not Lucene) to write posting lists so that they can be traversed (only) in reverse order. Docids were preserved but you got higher IDs first. This was a non-trivial code change. Now the suggestion to (optionally) order merged segments

Re: appending field to an existing index

2008-01-31 Thread Doron Cohen
This may help: http://www.nabble.com/Updating-Lucene-Index-with-Unstored-fields-tt15188818.html#a15188818 Doron On Thu, Jan 31, 2008 at 2:42 AM, John Wang <[EMAIL PROTECTED]> wrote: > Hi all: > >We have a large index and it is difficult to reindex. > >We want to add another field to the

Re: A small doubt related to write.lock

2008-01-30 Thread Doron Cohen
Hi Ajay, IndexReader.unlock() is a brute force call to be used by applications/users knowing that a lock can be safely removed. finalize() on the other hand is a method that Java will call when garbage collecting a no-more-referenced object. So it is often a place for cleanup code. However the pr

Re: contrib/benchmark Quality

2008-01-30 Thread Doron Cohen
Hi Grant, I initially thought of doing so, but after working on the Million Queries Track where running the 10,000 queries could take more than a day (depending on the settings) and where indexing was done once and took few days I felt that a more tight control is needed than that provided by the b

Re: Some Help needed in search.

2008-01-29 Thread Doron Cohen
You can add phrase on the writer field. I.e. with high boost of 3 and low boost of 2, writing 'h' for 'heading' and 'w' for 'writer', try this query: h:sachin^3 d:tendulkar^3 w:sachin^2 w:tendulkar^2 w:"h:Sachin Tendulkar"^6 On Jan 29, 2008 9:23 AM, Sure <[EMAIL PROTECTED]> wrote: > > Hi Al

Re: Basic Named Entity Indexing

2008-01-08 Thread Doron Cohen
On Jan 8, 2008 11:48 PM, chris.b <[EMAIL PROTECTED]> wrote: > > Wrapping the whitespaceanalyzer with the ngramfilter it creates unigrams > and > the ngrams that i indicate, while maintining the whitespaces. :) > The reason i'm doing this is because I only wish to index names with more > than one t

Re: Query processing with Lucene

2008-01-08 Thread Doron Cohen
lt; [EMAIL PROTECTED]> wrote: > Doron Cohen wrote: > > Hi Marjan, > > > > Lucene process the query in what can be called > > one-doc-at-a-time. > > > > For the example query - x y - (not the phrase query "x y") - all > > documents containi

Re: Sorting on tokenized fields

2008-01-08 Thread Doron Cohen
Hi Michael, I think you mean the exception thrown when you search and sort with a field that was not yet indexed: RuntimeException: field "BBC" does not appear to be indexed I think the current behavior is correct, otherwise an application might (by a bug) attempt to sort by a wrong field, th

Re: Basic Named Entity Indexing

2008-01-08 Thread Doron Cohen
Hi Chris, A null pointer exception can be causes by not checking newToken for null after this line: Token newToken = input.next() I think Hoss meant to call next() on the input as long as returned tokens do not satisfy the check for being a named entity. Also, this code assumes white space i

Re: Query processing with Lucene

2008-01-08 Thread Doron Cohen
Hi Marjan, Lucene process the query in what can be called one-doc-at-a-time. For the example query - x y - (not the phrase query "x y") - all documents containing either x or y are considered a match. When processing the query - x y - the posting lists of these two index terms are traversed, and

Re: Question regarding adding documents

2008-01-07 Thread Doron Cohen
Or, very similar, wrap the 'real' analyzer A with your analyzer that delegates to A but also keeps the returned tokens, possibly by using a CachingTokenFilter. On Jan 7, 2008 7:11 AM, Daniel Noll <[EMAIL PROTECTED]> wrote: > On Monday 07 January 2008 11:35:59 chris.b wrote: > > is it possible to

Re: StopWords problem

2007-12-27 Thread Doron Cohen
AIL PROTECTED]> wrote: > Doron Cohen wrote: > > On Dec 27, 2007 11:49 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote: > > > > > >> I got your point. The program given does not give not any error during > >> compilation and it is interpreted well.

Re: StopWords problem

2007-12-27 Thread Doron Cohen
On Dec 27, 2007 11:49 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote: > I got your point. The program given does not give not any error during > compilation and it is interpreted well. But the it does not create any > index. when the StandardAnalyzer() is called without Stopwords list it > works well, b

Re: StopWords problem

2007-12-27 Thread Doron Cohen
This is not a self contained program - it is incomplete, and it depends on files on *your* disk... Still, can you show why you're saying it indexes stopwords? Can you print here few samples of IndexReader.terms().term()? BR, Doron On Dec 27, 2007 10:22 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote:

Re: StopWords problem

2007-12-27 Thread Doron Cohen
Hi Liagat, This part of the code seems correct and should work, so problem must be elsewhere. Can you post a short program that demonstrates the problem? You can start with something like this: Document doc = new Document(); doc.add(new Field("text",URDU_STOP_WORDS[0] +

Re: StopWords problem

2007-12-26 Thread Doron Cohen
On Dec 26, 2007 10:33 PM, Liaqat Ali <[EMAIL PROTECTED]> wrote: > Using javac -encoding UTF-8 still raises the following error. > > urduIndexer.java : illegal character: \65279 > ? > ^ > 1 error > > What I am doing wrong? > If you have the stop-words in a file, say one word in a line, they can be

Re: Modifying StopAnalyzer

2007-12-26 Thread Doron Cohen
> > can we modify the StopyAnalyzer to insert Stop Words of > another language, instead of English, like Urdu given below: > public static final String[] URDU_STOP_WORDS = { "پر", "کا", "کی", "کو" }; > "new StandardAnalyzer(URDU_STOP_WORDS)" should work. Regards, Doron

Re: problem in indexing documents

2007-12-25 Thread Doron Cohen
> > >document.add(new Field("contents",sb.toString(), > > Field.Store.NO, Field.Index.TOKENIZED)); > In addition, for tokenized but not stored like here, the Field() constructor that takes a Reader param can be handy here. Regards, Doron

Re: document deletion problem

2007-12-20 Thread Doron Cohen
gt; And, btw, I can still see the terms from the deleted documents when I do > the top terms etc... when will they be gone? > > thanks > > - Original Message > > From: Doron Cohen <[EMAIL PROTECTED]> > > To: java-user@lucene.apache.org > > Sent: Wedn

Re: document deletion problem

2007-12-19 Thread Doron Cohen
On Dec 19, 2007 5:45 PM, Tushar B <[EMAIL PROTECTED]> wrote: > Hi Doron, > > I was just playing around with deletion because I wanted to delete > documents due to spurious entries in one particular field. Could you tell me > how do I file a JIRA issue? > See Lucene's wiki, at page "HowToContribut

Re: document deletion problem

2007-12-19 Thread Doron Cohen
Hi Tushar, This is an interesting scenario! The problem arises from the way search() methods that return Hits are working: for start only 100 matching documents are collected, assuming that apps calling this method will not be interested in more documents than this, and that apps traversing all m

Re: Lucene multifield query problem

2007-12-18 Thread Doron Cohen
Hi Rakseh, It just occurred to me that your code has String searchCriteria = "Indoor*"; Assuming StandardAnalyzer used at indexing time, all text words were lowercased. Now, QueryParser by default does not lowercase wildcard queries. You can however instruct it to do so by calling: myQu

Re: Lucene multifield query problem

2007-12-18 Thread Doron Cohen
Hi Rakesh, Perhaps the confusion comes from the asymmetry between +X and -X. I.e., for the query: A B -C +D one might think that, similar to how -C only disqualifies docs containing C (but not qualifying docs not containing C), also +D only disqualifies docs not containing D. But this is i

Re: FuzzyQuery + QueryParser - I'm puzzled

2007-12-17 Thread Doron Cohen
See in Lucene FAQ: "Are Wildcard, Prefix, and Fuzzy queries case sensitive?" On Dec 17, 2007 11:27 AM, Helmut Jarausch <[EMAIL PROTECTED]> wrote: > Hi, > > please help I am totally puzzled. > > The same query, once with a direct call to FuzzyQuery > succeeds while the same query with QueryParse

Re: "Field weights"

2007-12-14 Thread Doron Cohen
It seems that documents having less fields satisfying the query worth more than those satisfying more fields of the query, because the first ones are more "to the point". At least it seems like it in the example. If this makes sense I would try to compose a top level boolean query out of the one-

Re: Handling Indexed, Stored and Tokenized fields

2007-12-12 Thread Doron Cohen
Seems that PerFieldAnalyzerWrapper would be convenient here? Doron On Dec 12, 2007 10:41 PM, ts01 <[EMAIL PROTECTED]> wrote: > > Hi, > > We have a requirement to index as well as store multiple fields in a > document, each with its own special tokenizer. The following seems to > provide a way to

Re: Accessing parsed content in Nutch

2007-12-12 Thread Doron Cohen
You would probably get better and quicker answer in Nutch mailing lists: http://lucene.apache.org/nutch/mailing_lists.html Doron On Dec 12, 2007 11:16 PM, Developer Developer <[EMAIL PROTECTED]> wrote: > I believe nutch stores parsed content somewhere. Can you please let me > know > how I can

Re: Applying SpellChecker to a phrase

2007-12-11 Thread Doron Cohen
Yes that's right, my mistake. In fact even after reading your comment I was puzzled because PhraseScorer indeed requires *all* phrase-positions to be satisfied in order to match. The answer is that the OR logic is taken care of by MultipleTermPositions, so the scorer does not need to be aware of a

Re: Problem with termdocs.freq and other

2007-12-10 Thread Doron Cohen
> Seen as that solved all my problems (i think), Glad it helped! (btw it's always like this with, debugging - others see stuff in my code that I don't) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Re: Problem with termdocs.freq and other

2007-12-10 Thread Doron Cohen
> while (termDocs.next()) { > termDocs.next(); > } For one, this loop calls next() twice in each iteration, so every second is skipped... ? "chris.b" <[EMAIL PROTECTED]> wrote on 10/12/2007 12:58:15: > > Here goes, > I'm developing an application using lucene which

Re: Applying SpellChecker to a phrase

2007-12-07 Thread Doron Cohen
smokey <[EMAIL PROTECTED]> wrote on 04/12/2007 16:54:32: > Thanks for the information on o.a.l.search.spans. > > I was thinking of parsing the phrase query string into a > sequence of terms, > then constructing a phrase query object using add(Term term, > int position) > method in org.apache.lucen

Re: Errors while running LIA code.

2007-12-06 Thread Doron Cohen
>1) Downloaded http://www.ehatchersolutions.com/downloads/ > LuceneInAction.zip - sorry, lucenebook.com is broken at the moment :( This one works too - http://www.manning.com/hatcher2/ --> Downloads --> Source Code - To unsu

Re: Applying SpellChecker to a phrase

2007-12-03 Thread Doron Cohen
See below - smokey <[EMAIL PROTECTED]> wrote on 03/12/2007 05:14:23: > Suppose I have an index containing the terms impostor, > imposter, fraud, and > fruad, then presumably regardless of whether I spell impostor and fraud > correctly, Lucene SpellChecker will offer the improperly > spelled versi

Re: can we do partial optimization?

2007-12-03 Thread Doron Cohen
It doesn't make sense to optimize() after every document add. Lucene in fact implements a logic in the spirit of what you describe below, when it decides to merge segments on the fly. There are various ways to tell Lucene how often to flush recently added/updated documents and what to merge. But

Re: SpellChecker performance and usage

2007-12-03 Thread Doron Cohen
I didn't have performance issues when using the spell checker. Can you describe what you tried and how long it took, so people can relate to that. AFAIK the spell checker in o.a.l.search.spell does not "expand a query by adding all the permutations of potentially misspelled word". It is based on b

Re: FSDirectory Again

2007-12-02 Thread Doron Cohen
This is from Lucene's CHANGES.txt: LUCENE-773: Deprecate the FSDirectory.getDirectory(*) methods that take a boolean "create" argument. Instead you should use IndexWriter's "create" argument to create a new index. (Mike McCandless) So you should create the FSDir with FSDirect

Re: multireader vs multisearcher

2007-12-02 Thread Doron Cohen
MultiReader is more efficient and is preferred when possible. MultiSearcher allows further functionality. Every time an index has more than a single segment (which is. to say almost every index except for after calling optimize()), Opening an IndexReader (or an IndexSearcher) above that index), is

Re: Where to place a filter...

2007-11-23 Thread Doron Cohen
Seems your ask if to remove accents before or after stemming. Here is a discussion on similar question (for Spanish) - http://www.nabble.com/Snowball-and-accents-filter...--tf3653720.html#a10207399 Hope this helps, Doron Christian Aschoff <[EMAIL PROTECTED]> wrote on 22/11/2007 21:27:20: > Hell

Re: Scoring for all the documents in the index relative to a query

2007-11-20 Thread Doron Cohen
You can also rely on that by default documents are collected in-docid-order. You can therefore use your own hit collector that when collecting doc with id n2, assuming the previous doc collected had id n1, would (know to) assign score 0 to all docs with: n1 < id < n2. In other words, you can know

Re: Problem in Running Lucene Demo

2007-11-19 Thread Doron Cohen
Try "java -verbose" to see more info on class loading. Also try "java -classpath=yourClassPath" from command line. Note that separators in the classpath may differ between operating systems - e.g. ";" in Windows but ":" in Linux... Doron Liaqat Ali <[EMAIL PROTECTED]> wrote on 19/11/2007 15:43:30

Re: Customized search with Lucene?

2007-10-25 Thread Doron Cohen
"Lukas Vlcek" <[EMAIL PROTECTED]> wrote on 25/10/2007 10:25:23: > Doron, > > You definitely added few important (crucial) questions. There > are important > concerns and I am glad to hear that Lucene community is > debating them. I am > not an Lucene viscera expert thus I can hardly compare simple

Re: Customized search with Lucene?

2007-10-24 Thread Doron Cohen
g collective intelligence). This technique tends to capture > *association* between query words and individual document. > > The nice thing about this approach is that the network can give you > reasonable results even for words which haven't been used > before (as least > that

  1   2   3   4   >