Re: Boost doesn't works
Claude Libois writes: Hello. I'm using Lucene for an application and I want to boost the title of my documents. For that I use the setBoost method that is applied on the title field. However when I look with luke(1.6) I don't see any boost on this field and when I do a search the score isn't change. What's wrong? How do you search? I guess you cannot see a change unless you combine searches in different fields, since scores are normalized. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
Claude Libois writes: The explanation given by the IndexSearcher indicate me that the boost of my title is 1.0 where it should be 10.0. I really don't understand what it's wrong. AFAIK you cannot get the boost of a field from the index because it's not stored as such. It's calculated in the fields length norm or something like that during indexing. Search the list archives for details. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search performance with one index vs. many indexes
Jochen Franke writes: Topic: Search performance with large numbers of indexes vs. one large index My questions are: - Is the size of the wordlist the problem? - Would we be a lot faster, when we have a smaller number of files per index? sure. Look: Index lookup of a word is O(ln(n)) where n is the number of words. Index lookup of a word in k indexes having m words is O( k ln(m) ) In the best case all word lists are distict (purely theoretical), that is n = k*m or m = n/k For n = 15 Mio, k = 800 ln(n) = 16.5 k*ln(n/k) = 7871 In a realistic case, m is much bigger since word lists won't be distinct. But it's the linear factor k that bites you. In the worst case (all words in all indices) you have k*ln(n) = 13218.8 HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: help with boolean expression
Omar Didi writes: I have a problem understanding how would lucene iterpret this boolean expression : A AND B OR C . it neither return the same count as when I enter (A AND B) OR C nor A AND (B OR C). if anyone knows how it is interpreted i would be thankful. thanks A AND B OR C creates a query that requires A and B. C influcenes the score, but is neither sufficient nor required for a match. IMO query parser is broken for queries mixing AND and OR without explicit braces. My favorite sample is `a AND b OR c AND d' which equals `a AND b AND c AND d' in query parser. I suggested a patch some time ago, but it's still pending in bugzilla. http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 Don't know if it's still usable with current sources. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting date stored in milliseconds time
Ben writes: I store my date in milliseconds, how can I do a sort on it? SortField has INT, FLOAT and STRING. Do I need to create a new sort class, to sort the long value? Why do you need that precicion? Remember: there's a price to pay. The memory required for sorting and the time to set up the sort cache depends on the number of different terms, dates in your case. I can hardly think of an application where seconds are relevant, what do you need milliseconds for? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: select where from query type in lucene
Miles Barr writes: On Fri, 2005-02-18 at 03:58 +0100, Miro Max wrote: how can i search for content where type=document or (type=document OR type=view). actually i can do it with: (type:document OR type:entry) AND queryText as QueryString. but does exist any other better way to realize this? [...] Another alternative is to put each type in it's own index and use a MultiSearcher to pull in the types you want. If the change rate of the index and the number of commonly used type combinations aren't too large, cached filters might be another alternative. Of couse the filter would have to be recreated whenever the index changes. The advantage is, that you save searching for the types for each query where the filter is reused while you can keep all documents within one index. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Concurrent searching re-indexing
Paul Mellor writes: 1. If IndexReader takes a snapshot of the index state when opened and then reads the files when searching, what would happen if the files it takes a snapshot of are deleted before the search is performed (as would happen with a reindexing in the period between opening an IndexSearcher and using it to search)? On unix, open files are still there, even if they are deleted (that is, there is no link (filename) to the file anymore but the file's content still exists), on windows you cannot delete open files, so Lucene AFAIK (I don't use windows) postpones the deletion to a time, when the file is closed. 2. Does a similar potential problem exist when optimising an index, if this combines all the segments into a single file? AFAIK optimising creates new files. The only problem that might occur, is opening a reader during index change but that's handled by a lock. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck
Aad Nales writes: Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1. My thinking is processing a series of replacement statements that go like: -- g sounds like ch if the immediate predecessor is an s. o sounds like oo if the immediate predecessor is a consonant -- But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions? An implementation of a rule based system to create such a pronounciation form, can be found in a library called makelib that is part of an editor named leanedit. Unfortunatley the website seems to be down. The lib is LGPL. If you're interested, I can send you a copy of the sources. The only ruleset available is german though. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize
Bernhard Messer writes: However, three times the space sounds a bit too much, or I make a mistake in the book. :) there already was a discussion about disk usage during index optimize. Please have a look to the developers list at: http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1797569 where i made some measurements about the disk usage within lucene. At that time i proposed a patch which was reducing disk total used disk size from 3 times to a little more than 2 times of the final index size. Together with Christoph we implemented some improvements to the optimization patch and finally commit the changes. Hmm. In the case that the index is used (open reader), I doubt your patch makes a difference. In that case the disk space used by the non optimized index will still be used even if the files are deleted (on unix/linux). What happens, if disk space run's out during creation of the compound index? Will the non compound files be a usable index? Otherwise you risk to loose the index. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document numbers
Hi Jonathan, Yet another burning question :-). Can someone explain how the document numbers in Lucene documents work? For example, the TermDocs.doc() method returns the current doc number. How can I get this doc number if I just have a Document? I don't think you can. A document does not even have to be indexed yet. So either you're dealing with some document found in the index, then you should have the document number already, or you have a document independently from the index, then you have to analyze the documents content and count yourself. Note that term vector support might be useful if you're interested in more than one term (but that requires the document number again). Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: closing an IndexSearcher
Hi Cocula, And now here is a code that works : the only differance with the previous one is the QueryParser call before new IndexWriter. The QueryParser .parse statement seems to close the IndexReader but I really can't figure how. I rather suspect your OS/filesystem to delay the effect of the close. QueryParser does not even know about your searcher. What OS are you using? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: English and French documents together / analysis, indexing, searching
[EMAIL PROTECTED] writes: you could try to create a more complex query and expand it into both languages using different analyzers. Would this solve your problem ? Would that mean I would have to actually conduct two searches (one in English and one in French) then merge the results and display them to the user? No. You could do a ( ( french-query ) or ( english-query ) ) construct using one query. So query construction would be a bit more complex but querying itself wouldn't change. The first thing I'd do in your case would be to look at the differences in the output of english and french snowball stemmer. I don't speak any french, but probably you might even use both stemmers on all texts. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Owen Densmore writes: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? rule based stemmers such as porter/snowball cannot do that. But there are (commercial) dictionary based tools that can. E.g. the canoo lemmatizer. You might also have a look at egothors stemmer, that are word list based. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best way to find if a document exists, using Reader ...
Praveen Peddi writes: Does it makes sense to call docFreq or termDocs (which ever is faster) before calling delete? IMO no. calling termDocs is what Reader.delete(Term) does: public final int delete(Term term) throws IOException { TermDocs docs = termDocs(term); if (docs == null) return 0; int n = 0; try { while (docs.next()) { delete(docs.doc()); n++; } } finally { docs.close(); } return n; } (the advantage of OSS is, that you can look into it's sources) So it already uses termDocs to see if there's anything to do. I doubt that using docFreq would be much faster. In both cases the term is searched and -- if you don't have to delete anything -- not found. If it's found, docFreq might be faster, but in that case you have to delete and use termDocs anyway. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and number of occurence
Bertrand VENZAL writes: Im quite new in this mailing list. I ve many difficulties to find the number of a word (occurence) in a document, I need to use indexSearcher because of the query but the score returning is not wot i m looking for. I found in the mailing List the class TermDoc but it seems to work only with indexReader. The use of a searcher does not prevent the use of a reader (in fact the searcher relys on a reader). So I'd use the searcher to find the document and a reader to get the frequency using IndexReader.termDocs. Depending on how many frequencies your interested in, the term vector support might be of interest. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HELP! Directory is NOT getting closed!
Joseph Ottinger writes: According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir is set, it's supposed to close the directory. That's fine - but that leads me to believe that for some reason, closeDir is *not* set. Why? Under what circumstances would this not be true, and under what circumstances would you NOT want to close the Directory? From the sources, you can see, that is is true only, if the directory is created by the IndexWriter itself. If you provide a directory to the IndexWriter you have to close it yourself. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Check to see if index is optimized
Crump, Michael writes: Is there a simple way to check and see if an index is already optimized? What happens if optimize is called on an already optimized index - does the call basically do a noop? Or is it still and expensive call? Why don't you just try that? E.g. using luke. Or three lines of code... You will find, that calling optimize for an optimized index does not change the index. (optimized means just one segement and no deleted documents) So I guess the answer for your first question can be found in the sources of optimize: public synchronized void optimize() throws IOException { flushRamSegments(); while (segmentInfos.size() 1 || (segmentInfos.size() == 1 (SegmentReader.hasDeletions(segmentInfos.info(0)) || segmentInfos.info(0).dir != directory || (useCompoundFile (!SegmentReader.usesCompoundFile(segmentInfos.info(0)) || SegmentReader.hasSeparateNorms(segmentInfos.info(0))) { int minSegment = segmentInfos.size() - mergeFactor; mergeSegments(minSegment 0 ? 0 : minSegment); } } segmentInfos is private in IndexWriter, so I suspect you cannot check that without modifying lucene. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deleting index for DB indexing
mahaveer jain writes: I am using lucene for my DB indexing. I have 2 columns which are Keyword. Now I want to delete my index based on this 2 keyword. Is it possible ? If no. What is other alternative ? You can delete documents based on document number from an index reader. You can get document numbers from searches. So if you can search documents to be deleted based on your keywords, there should be no problem deleting them... HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser, default operator
Paul writes: the following code QueryParser qp = new QueryParser(itemContent, analyzer); qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND); Query query = qp.parse(line, itemContent, analyzer); doesn't produce the expected result because a query foo bar results in: itemContent:foo itemContent:bar where as a foo AND bar results in +itemContent:foo +itemContent:bar If I understand the default operator correctly than the first query should have been expanded to the same as the latter one, isn't it? try qp.parse(line). parse(String query, String field, Analyzer analyzer) is a static method that create it's own instance of QP, that does not know anything about the settings of your qp object. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: (Offtopic) The unicode name for a character
Hi Peter, The Question: In Java generally, Is there an easy way to get the unicode name of a character? (e.g. LATIN SMALL LETTER A from 'a') ... I'm considering taking the unicode name for each character I encounter and regexping it against something like: ^LATIN .* LETTER (.) WITH .*$ ... to try and extract the single A-Z|a-z character. There used to be a list (ASCII) on some ftp server at unicode.org. I have a version 'UnicodeData.txt' here. It lists ~ 12000 characters in the form 01A4;LATIN CAPITAL LETTER P WITH HOOK;Lu;0;L;N;LATIN CAPITAL LETTER P HOOK;;;01A5; 01A5;LATIN SMALL LETTER P WITH HOOK;Ll;0;L;N;LATIN SMALL LETTER P HOOK;;01A4;;01A4 If you cannot find that list somewhere I can mail you a copy. It would be a nice contribution if you could add your filter to lucenes sandbox, once it's finished. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Erik Hatcher writes: On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. I guess you missunderstood the question. I think he want's to know how to create a query parser understanding something like 'a UND b' as well as 'a AND b' to support localized operator names (german in this case). AFAIK that can only be done by copying query parsers javacc-source and adding the operators there. Shouldn't be difficult, though it's a bit ugly since it implies code duplication. And there will be no way of choosing the operators dynamically at runtime. One will need to have different query parsers for different languages. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Sanyi writes: Well, I guess I'd better recognize and replace the operator synonyms to their original format before passing them to QueryParser. I don't feel comfortable tampering with Lucene's source code. Apart from knowing how to compile lucene (including the javacc code generation) you should only need to change DEFAULT TOKEN : { AND: (AND | ) | OR:(OR | ||) | NOT: (NOT | !) to DEFAULT TOKEN : { AND: (AND | insert your version of and here | ) | OR:(OR | insert your version of or here | ||) | NOT: (NOT | insert your version of not here | !) in jakarta-lucene/src/java/org/apache/lucene/queryParser/QueryParser.jj Replacing the operators before query might be hard to do, if you want to handle cases like »a AND b OR c«, which is a query for a phrase a AND b or the token c, correctly. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Queries difference
Alex Kiselevski writes: Hello, I want to know is there a difference between queries: +city(+London Amsterdam) +address(1_street 2_street) And +city(+London) +city(Amsterdam) +address(1_street) +address(2_street) I guess you mean city:(... and so on. The first query searches documents containing 'London' in city, scoring results also containing Amsterdam higher, and containing 1_street or 2_street in address. The second query searches for documents containing both London and Amsterdam in city and 1_street and 2_street in address. Note the the + before London in the second query doesn't mean anything. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NUMERIC RANGE BOOLEAN
Erik Hatcher writes: TooManyClauses exception occurs when a query such as a RangeQuery expands to more than 1024 terms. I don't see how this could be the case in the query you provided - are you certain that is the query that generated the error? Why not: the terms might be 0003 0003.1 0003.11 ... So the question is, how do his terms look like... Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Unexpected TermEnum behavior
Chris Hostetter writes: I thought it was documented in the TermEnum interface, but looking at it now I realize that not only does the TermEnum javadoc not explain it very well, but the class FilteredTermEnum (which implements TermEnum) acctually documents the oposite behavior... public Term term() Returns the current Term in the enumeration. Initially invalid, valid after next() called for the first time. That's a documentation bug. Fixed in CVS. http://issues.apache.org/bugzilla/show_bug.cgi?id=32353 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: hits.length() changes during delete process.
David Townsend writes: So the short question is, should the hits object be changing and what is the best way to delete all the results of a search (it's a range query so I can't use delete(Term term)? The hits object caches only part of the hits (initially the first 100 (?)). This cache is extended if further hits are accessed by repeating the search. Since you deleted part of the hits at this point, your hits object changes. You should be able to get around this by either scanning the hits objects from end to start instead of start to end or deleting with a different index reader. In the latter case the searcher should not see the deletions. Reversing the order might be preferable, since it implies only one search repetition. (both suggestions untested) The best way would probably be, to avoid a hit object anyway and delete the documents at the level where the hits object is created. Have a look at the sources for details. (also untested; I never needed more than term based deletions) Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexReader close method
Helen Warren writes: //close the IndexReader object myReader.close(); //return results return hits; The myReader.close() line causes the IOException to be thrown. To try Are you sure it's the myReader.close() that fails? I'd suspect that to fail as soon as you want to do anything meaningful with the hits objects you return. You need an open searcher/reader for that and in general it should be the one, you used during search. This is assuming hits is an instance of class org.apache.lucene.search.Hits. The method Document doc(int n) relys on the searcher used for search not being closed. So I'd suspect the IOException to be thrown later. Of course removing the myReader.close(); will prevent the exception. You cannot close the reader as long as you want to access search results. In this case, the reader appears to close without error but even after I've called myReader.close() I can execute the maxDoc() method on that object and return results. Anybody shed any light? yes. the source ;-) maxDoc does not access the index files but returns an integer stored in the class itself. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Hoss writes: (c) Filtering. Filters in general make a lot of sense to me. They are a way to specify (at query time) that only a certain subset of the index should be considered for results. The Filter class has a very straight forward API that seems very easy to subclass to get the behavior I want. The Query API on the other hand ... I freely admit, that I can't make heads or tails out of it. I don't even know where I would begin to try and write a new subclass of Query if I wanted to. I would think that most people who want to do a numeric range restriction on their data, probably don't care about the Scoring benefits of RangeQuery. Looking at the code base, the way DateFilter works seems like it provides an ideal solution to any sort of Range restriction (not just Dates) that *should* be more efficient then using RangeQuery when dealing with an unbounded value set. (Both approaches need to iterate over all of the terms in the specified field using TermEnum, but RangeQuery has to build up an set of BooleanQuery objects for each matching term, and then each of those queries have to help score the documents -- DateFilter on the other hand only has to maintain a single BitSet of documents that it finds as it iterates) IMO there's another option, at least as long as the number of your documents isn't too high. Sorting already creates a list of all field values for some field that will be used during the search for sorting. Nothing prevents you from using that aproach for search restriction also. The advantage is, that you can create that list once and use it for different ranges until the index is changed whereas a filter can only represent one range. The disadvantate is, that you have to keep one value for each document in memory instead of one bit in a filter. I did that (before the sort code was introduced) for date queries in order to be able to sort and restrict searches on dates. But I haven't thought about how a general API for such a solution might look like so far. Of course it depends on a number of questions, which way is preferable. How often is the index modified, are range queries usually done for the same or different ranges, how many documents are indexed and so on. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Help on the Query Parser
Terence Lai writes: Look likes that the wildcard query disappeared. In fact, I am expecting text:java* developer to be returned. It seems to me that the QueryParser cannot handle the wildcard within a quoted String. That's not just QueryParser. Lucene itself doesn't handle wildcards within phrases. You could have a query text:java* developer if '*' isn't removed by the analyzer. But it would only search for the token 'java*' not any expansion of that. I guess this is not, what you want. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple analysers within a query
Kauler, Leto S writes: Would anyone have any suggestions on how this could be done? I was thinking maybe the QueryParser would have to be changed/extended to accept a separator other than colon :, something like = for example to indicate this clause is not to be tokenised. I suggested that in a recent discussion and Erik Hatcher objected that it isn't a good idea, to require that users know which field to query in which way. I guess he is right. If your query isn't entered by users, you shouldn't use query parser in most cases anyway. Or perhaps this can all be done using a single analyser? Look at PerFieldAnalyzerWrapper. You will probably have to write a keyword analyzer (unless you can use whitespace analyzer in your case). HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple analysers within a query
Erik Hatcher writes: If your query isn't entered by users, you shouldn't use query parser in most cases anyway. I'd go even further and say in all cases. If you use lucene as a search server you have to provide the query somehow. E.g. we have an php application, that sends queries to a lucene search servlet. In this case it's justifiable to serialize the query into query parser syntax on the client side and have query parser read the query again on the server side. I don't recall any problems with the aproach since we clean up the user before constructing the query. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
Sanyi writes: If there's a bug, it should be tracked down, not worked around... Sure, but I'm working with 20million records and it takes about 25 hours to re-index, so I'm looking for ways that doesn't require reindexing. why reindex? My code was: WildcardTermEnum wcenum = new WildcardTermEnum(reader, term); while (wcenum.next()) { terms.add(new WeightedTerm(termgroup,wcenum.term().text())); //System.out.println(wcenum.term().text()); } And it skipped lots of things it shouldn't have skipped. As stated at the end of my mail, I'd expect that to skip the first term in the enum. Is that, what you miss or do you loose more than one term? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
Sanyi writes: Enumerating the terms using WildcardTermEnum and an IndexReader seems to be too buggy to use. If there's a bug, it should be tracked down, not worked around... But it looks ok to me: import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.*; import org.apache.lucene.document.*; import org.apache.lucene.store.*; import org.apache.lucene.search.*; public class LuceneTest { public static void main(String[] args) throws Exception { RAMDirectory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true); Document doc = new Document(); doc.add(new Field(foo, blabla etc.. etc... c0la c0ca caca ccca, true, true, true)); writer.addDocument(doc); writer.close(); IndexReader reader = IndexReader.open(dir); WildcardTermEnum enum = new WildcardTermEnum(reader, new Term(foo, c??a)); do { System.out.println(enum.term().text()); } while ( enum.next() ); WildcardQuery wq = new WildcardQuery(new Term(foo, c??a)); Query q = wq.rewrite(reader); System.out.println(q.toString()); reader.close(); } } gives c0ca c0la caca ccca foo:c0ca foo:c0la foo:caca foo:ccca The only bug I see is in the docs, that claims enum.term() to be invalid before the first call to next() which does not seem to be the case. So if you use while ( enum.next() ) { ... } you will loose the first term, whatever it is. Looking at the sources I find that this behaviour is shared by FuzzyTermEnum. Both implementations of the abstract FilteredTermEnum class call setEnum at the end of the constructor, which prepares the first result. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems search number range
[EMAIL PROTECTED] writes: i need to solve this search: number: -10 range: -50 TO 5 i need help.. i dont find anything using google.. If your numbers are in the interval MIN/MAX and MIN0 you can shift that to a positive interval 0 ... (MAX-MIN) by subtracting MIN from each number. Alternatively you have to find a string represantation providing the correct order for signed integers. E.g. -0010 -0001 0 1 00020 should work (in the range -..9), since '0' has a higher ascii (unicode) code than '-'. Of course the analayzer has to preserve the '-' and the '-' should not be eaten by the query parser in case you use it. I don't know if there are problems with that, but I suspect that at least for the query parser. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems search number range
[EMAIL PROTECTED] writes: this solution was the first that i tried.. but this does not run correctly.. because: when we try to sort this number in alphanumeric order we obtain that number -0010 is higher than -0001 right. I failed to see that. So you would have to use a complement for negative numbers as well e.g. using -9989 for -10, -9998 for -1, ... But shifting the interval is easier of course. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching and indexing from different processes (applications)
K Kim writes: I just started to play around with Lucene. I was wondering if searching and indexing can be done simultaneously from different processes (two different processes.) For example, searching is serviced from a web appliation, while indexing is done periodically from a stand-alone application. What would be the best way to implement this? simply do it. The only things you have to keep in mind, is a) you cannot have more than one process/thread writing to lucene b) an index reader/search will not see updates unless it's closed and reopened. So all you need is your web app and your indexing process and some way to inform the web app after indexing, that it should reopen the index. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Phrase search for more than 4 words throws exception in QueryParser
Sanyi writes: How to perform phrase searches for more than four words? This works well with 1.4.2: aa bb cc dd I pass the query as a command line parameter on XP: \aa bb cc dd\ QueryParser translates it to: text:aa text:bb text:cc text:dd Runs, searches, finds proper matches. This throws exeption in QueryParser: aa bb cc dd ee I pass the query as a command line parameter on XP: \aa bb cc dd ee\ The exception's text is: : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 13. Encountered: EOF after : \aa bb cc dd Works for me on linux: java -cp lucene.jar org.apache.lucene.queryParser.QueryParser 'a b c d e f g h i j k l m n o p q r s t u v w x y z' a b c d e f g h i j k l m n o p q r s t u v w x y z Must be an XP command line problem. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
Sanyi writes: This query works as expected: validword AND stopword (throws out the stopword part and searches for validword) This query seems to crash: stopword AND validword (java.lang.ArrayIndexOutOfBoundsException: -1) Maybe it can't handle the case if it had to remove the very first part of the query?! Can anyone else test this for me? How can I overcome this problem? see bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=9110 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
Sanyi writes: Thanx for your replies guys. Now, I was trying to locate the latest patch for this problem group, and the last thread I've read about this is: http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 It ends with an open question from Morus: If you want me to change the patch, let me know. That no big deal. Did you change the patch since then? No. But this is an independent issue from the `stopword AND word' problem. The `stopword AND word' problem just has to be taken care of in that context also. Bug 25820 basically is about better handling of AND and OR in a query. Currently `a AND b OR c AND d' equals `a AND b AND c AND d' in query parser. Can I simply download the latest compiled development version of lucene.jar and will it fix my problem? If there are no current nightly builds, I guess you will have to get the sources it from cvs directly. But the fix seems to be included in 1.4.2. see http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4 item 5 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A TokenFilter to split words and numbers
william.sporrong writes: Does it have something to do with the QueryParser guessing what kind of query it is by examining the string and thus presumes that the first string should not be parsed into a PhraseQuery? QueryParser creates a PhraseQuery for words that are tokenized to more than one token. You should see that in the serialized query. Anyways if there is a correct way to accomplish what I want could anyone please give me a hint? One way I thought about is preparsining the query and construct several subqueries i.e PhraseQuerys and so on and then combine them in a BooleanQuery but I guess there is a nicer solution? I guess you could overwrite the getFieldQuery method of query parser and change the way queries are generated. I have a similar problem with another Filter Iäm trying to implement that should remove certain suffixes and replace them with a wildcard ( bilar-bil*). If you expect bil* to be executed as a wildcard/prefix query, this cannot work. The query parser parses the query, not the analyzer output. Again you might introduce such behaviour in getFieldQuery. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: jaspq: dashed numerical values tokenized differently
Daniel Taurat writes: Hi, I have just another stupid parser question: There seems to be a special handling of the dash sign - different from Lucene 1.2 at least in Lucene 1.4.RC3 StandardAnalyzer. Examples (1.4RC3): A document containing the string dash-test is matched by the following search expressions: dash test dash* dash-test It is _not_ matched by the following search expressions: dash-* dash-t* If the string after the dash consists of digits, the behavior is different. E.g., a document containing the string dash-123 is matched by: dash* dash-* dash-123 It is not matched by: dash 123 Question: Is this, esp. the different behavior when parsing digits and characters, intentional and how can it be explained? Regards, Query parser was changed to treat '-' within words as part of the word. Before that change a query 'dash-test' was parsed as 'dash AND NOT test'. Now QP reads one word 'dash-test' which is analyzed. If the analyzer splits that to more than one token (standard analyzer does) a phrase query is created. The difference you see comes from standard analyzer which tokenizes dash-test dash-123 to tokens dash, test and dash-123. Prefix queries aren't analyzed. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locks and Readers and Writers
[EMAIL PROTECTED] writes: Hi Christoph, Thats what I thought. But what I'm seeing is this: - open reader for searching (the reader is opening an index on a remote machine (via UNC) which takes a couple seconds) - meanwhile the other service opens an IndexWriter and adds a document (the index writer determines that it needs to merge so it tries to get a lock. since the reader is still opening, the IO exception is thrown) I believe that increasing the merge factor will reduce the opportunity for this to occur. But it will still occur at some point. I'm not sure what you mean by `opening an index on a remote machine (via UNC)' but have you made sure that lock files are put in the same directory for both processes (see the mailing list archive for details)? Also note, that lucene's locking is known not to work on NFS (also see the list archive). I don't know if it works on SMB mounts. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching for a phrase that contains quote character
Daniel Naber writes: On Thursday 28 October 2004 19:03, Justin Swanhart wrote: Have you tried making a term query by hand and testing to see if it works? Term t = new Term(field, this is a \test\); PhraseQuery pq = new PhraseQuery(t); That's not a proper PharseQuery, it searches for *one* term this is a test which is probably not what one wants. You have to add the terms one by one to a PhraseQuery. Will spoke of a keyword field, in which case he would want to search for one term. Using a TermQuery make more sense, though. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: new version of NewMultiFieldQueryParser
Bill Janssen writes: Try to see the behavior if you want to have a single term query juat something like: robust .. and print out the query string ... Sure, that works fine. For instance, if you have the three default fields title, authors, and contents, the one-word search robust expands to title:foobar authors:foobar contents:foobar just as it should. Try to see what is happening with Prefix, Wild, and Fuzzy searches ... Good point. My older version (see below) found these, but the new one doesn't. Oh, well, back to the working version. I knew there was some reason getFieldQuery wasn't sufficient. wouldn't it be better to go on and overwrite the methods creating these types of queries too? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locks and Readers and Writers
Christoph Kiehl writes: AFAIK you should never open an IndexWriter and an IndexReader at the same time. You should use only one of them at a time but you may open as many IndexSearchers as you like for searching. You cannot open an IndexSearcher without opening an IndexReader (explicitly or implicitly). Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Null or no analyzer
Erik Hatcher writes: however perhaps it should be. Or perhaps there are other options to solve this recurring dilemma folks have with Field.Keyword indexed fields and QueryParser? I think one could introduce a special syntax in query parser for keyword fields. Query parser wouldn't analyze them at all in this case. Something like field#Keyword or field#keyword containing blanks I haven't thought through all consequences for field#(keywordA keywordB otherfield:noKeyword) but I think it should be doable. Doesn't make query parser simpler, on the other hand. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Null or no analyzer
Aviran writes: You can use WhiteSpaceAnalyzer Can he? If Elections 2004 is one token in the subject field (keyword), this will fail, since WhiteSpeceAnalyzer will tokenize that to `Elections' and `2004'. So I guess he has to write an identity analyzer himself unless there is one provided (which doesn't seem to be the case). The only alternatives are not using query parser or extending query parser for a key word syntax, as far as I can see. Morus -Original Message- From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 19, 2004 11:23 AM To: Lucene Users List Subject: Null or no analyzer Hi All I have a question regarding selection of Analyzer's during query parsing i have three field in my index db_id, full_text, subject all three are indexed, however while indexing I specified to lucene to index db_id and subject but not tokenize them I want to give a single search box in my application to enable searching for documents some query can look lile motor cross rally this will get fed to QueryParser to do the relevent parsing however if the user enters Jhon Kerry subject:Elections 2004 I want to make sure that No analyzer is used fro the subject field ? how can that be done. this is because I expect the users to know the subject from a List of controlled vocabularies and also I am searching for documents that have the exact subject I tried using the PerFieldAnalyzerWrapper, but how do I get hold a Analyzer that does nothing but pass the text trough to the Searcher ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: QueryParsing
Rupinder Singh Mazara writes: hi erik and everyone else ok i will buy the book ;) but this still does not solve the problem of why String x = \jakarta apache\~100; is being transalted as a PhraseQuery FULL_TEXT:jakarta apache~100 is the correct query beining formed ? or is there something wrong with the Proximity Search topic in the URL http://jakarta.apache.org/lucene/docs/queryparsersyntax.html A proximity search is done by a PhraseQuery with a slop. The slop makes the PhraseQuery to perform a proximity search (so you can argue that the name is problematic). That's what query parser creates. SpanQueries where introduced later. Maybe you can get the effect of a proximity search by SpanQueries also, but that's not handled by the query parser. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StopWord elimination pls. HELP
Miro Max writes: String cont = rs.getString(x); d.add(Field.Text(cont, cont)); writer.addDocument(d); to get results from a database into lucene index. but when i check println(d) i can see the german stopwords too. how can i eliminate this? Stopwords in an analyzer don't make the stopwords disappear from the document, they only prevent them from beeing indexed. So you will allways see stopwords in the document (before indexing and, if the field is stored, when the document is retrieved from the index). A meaningful check, if stopwords are recognized, would be to search for a stopword. You shouldn't find anything... HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How extract a Field.Text(String, String) field to process it with a Stylesheet?
Otis Gospodnetic writes: That's likely because you used an Analyzer that stripped the XML (, , etc.) from the original text. If you want to preserve the original text, use an Analyzer that doesn't throw your XML away. You can write your own Analyzer that doesn't discard anything, for instance. An analyzer doesn't change the stored content. Only the indexed tokens. So if something threw away the tags (or just the spectial characters) it must have been before Field.Text(String, String) was called. This of course wouldn't be surprising, since indexing xml often means to extract the text from an xml document and index that text. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildCardQuery
Robinson Raju writes: The way i have done is , if there is a wildcard , Use WildCardQuery , else other. Here searchFields is an array which contains the column names . search string is the value to be searched. if ((searchString.indexOf(IOOSConstants.ASTERISK) -1) || (searchString.indexOf(IOOSConstants.QUESTION_MARK) -1)) { WildcardQuery wQuery = new WildcardQuery(new Term( searchFields[0], searchString)); booleanQuery.add(wQuery, true, false); if (searchFields.length 1) { WildcardQuery wQuery2 = new WildcardQuery(new Term( searchFields[1], searchString)); booleanQuery.add(wQuery2, true, false); } } else { Query query = MultiFieldQueryParser.parse(searchString, searchFields, flags, analyzer); booleanQuery.add(query, true, false); } Query queryfilter = MultiFieldQueryParser.parse(filterString, filterFields, flags, analyzer); QueryFilter queryFilter = new QueryFilter(queryfilter); hits = parallelMultiSearcher.search(booleanQuery, queryFilter); In the meanwhile , i thought i would tokenize the string based on space if the input contains spaces and then add them one by one into booleanQuery. But this gave a StringIndexOutOfBoundsException. So am still trying... Thanks for your help . would appreciate greately if you could give me more pointers . Did you look at the output of query.toString(defaultfield)? That's usually the best way to see, if a constructed query is what you expect it to be. Why isn't creating wildcard queries left to the query parser? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - Too Many Clases on date range.
Chris Fraschetti writes: So i decicded to move my epoch date to the 20040608 date which fixed my boolean query problem in regards to my current data size (approx 600,000) but now as soon as I do a query like ... a* I get the boolean error again. Google obviously can handle this query, and I'm pretty sure jguru.com can handle it too.. any ideas? With out without a date dange specified i still get teh TooManyClauses error. I tired cranking the maxclauses up to Integer.MaxInt, but java gave me a out of memory error. Is this b/c the boolean search tried to allocate that many clauses by default or because my query actually needed that many clauses? boolean search allocates clauses for all tokens having the prefix or matching the wildcard expression. Why does it work on small indexes but not large? Because there are fewer tokens starting with a. Is there any way to have the parser create as many clauses as it can and then search with what it has? w/o recompiling the source? You need to create your own version of Wildcard- and Prefix-Query that takes a maximum term number and ignores further clauses. And you need a variant of the query parser that uses these queries. This can be done, even without recompiling lucene, but you will have to do some programming at the level of lucene queries. Shouldn't be hard, since you can use the sources as a starting point. I guess this does not exist because the lucene developer decided to prefer a query error rather than uncomplete results. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: different analyzer all produce the same index?
sergiu gordea writes: Daan Hoogland wrote: H all, I try to create different indices using different Analyzer-classes. I tried standard, german, russian, and cjk. They all produce exactly the same index file (md5-wise). There are over 280 pages so I expected at least some differences. Take a look in the lucene source code... Maybe you will find the answer ... I asume that all the pages you indexed were written in English, therefore is normal that german, russian and cjk analyzers to create identic indexex, but htey should be different than english one (StandardAnalyzer) german analyzer definitely won't leave english text as it is, since it does algorithmic stemming. E.g. your text get's tak a look in the luc sourc cod mayb you will find the answ i asum tha all the pag you indexed wer writt in english therefor is normal tha germa russia and cjk analyx to crea identic indexex but htey should be diff tha english one standardanalyx while std analyzer does not stem at all and gives take a look in the lucene source code maybe you will find the answer i asume that all the pages you indexed were written in english therefore is normal that german russian and cjk analyzers to create identic indexex but htey should be different than english one standardanalyzer I'd rather suspect some problem with the indexing code. So my advice is, to check what the analyzer produces. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Seraching in Keyword Field
Bernhard Messer wrote Hi, try that query: MyKeywordField:ABC Why should that help? foo:(bla) and foo:bla create the same query: java -classpath lucene-1.4.1/lucene-1.4.1.jar org.apache.lucene.queryParser.QueryParser 'foo:(bla)' foo:bla java -classpath lucene-1.4.1/lucene-1.4.1.jar org.apache.lucene.queryParser.QueryParser 'foo:bla' foo:bla As often the necessary step is to look at what query parser produced using query.toString() I guess SimpleAnalyzer lowercases the term and prevents entries 'ABC' from beeing found. Using an apropriate PerFieldAnalyzerWrapper might help. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: list of removed stop words
Chris Fraschetti writes: Is there a way to via the parser or the query retrieve a list of the stop words removed by the analyzer? or should i just check my query against .STOPWORDS and do it myself? Query parser does not provide that info. Of course you might consider adding this inside query parser. Doing the check yourself outside QP means, that have to parse a second time... Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: online and offline Directory
Ernesto De Santis writes: Hi Aviran Thanks for response. I forgot important information for you understand my issue. My process do some like this: The index have contents from differents sources, identified for a special field 'source'. Then the index have documents with source: S1 or source: S2 ... etc. When I reindex the source S1, first delete all documents with source: S1, in otherwise I have the index with repeated content. Then add the new index result. In the middle of process the IndexSearcher use an incomplete index. Is posible do it like a data base transaction? It's not like a data base transcation but any index reader/searcher that was opened before the changes won't see them until it's closed and reopened. AFAIK that also applies to deletions though I never checked that. So you have two options: a) use a second index for indexing, move the indexes after the indexing is done and make sure indexreader/searcher are closed and reopened after the move. b) use one index and make sure that you do not open any index reader/searcher during the update. Searches may only use already opened reader/searcher. I guess it depends on index size, update frequency and so on, which szenario is easier to handle. Given that the index isn't too large and update frequency is rather low, I'd use a second index. But you'll need to copy that index and should consider the time and disc IO needed for that. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Ulrich Mayring writes: Daniel Naber wrote: AND always refers to the terms on both sides, +/- only refers to the term on the right. So a AND b - +a +b is correct. *slap forehead* - you're right. Wasn't there something about operator precedence way back when ;-) Yes. January. And it's still in bugzilla. :-( But it would not make a difference in this case, since AND has higher precedence, so a OR b AND c is a OR (b AND c) which is correctly done as a (+b +c) in boolean queries. a +b +c is different, since it won't find documents containing only a. Occurences of a only modify score in this case. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Ulrich Mayring writes: Hi all, first, here's how to reproduce the problem: Go to http://www.denic.de/en/special/index.jsp and enter obscure service in the search field. You'll get 132 hits. Now enter obscure service* - and you only get 1 hit. The above website is running Lucene 1.3rc3, but I was able to reproduce this locally with 1.4.1. Here are my local results with controlled pseudo documents, perhaps you can see a pattern: searching for 00700* gets two documents: 007001 action and 007002 handle searching for handle gets two documents: 007002 handle and 011010 handle searching for 00700* handle gets two documents: 007002 handle and 011010 handle But where is 007001 action? searching for handle 00700* gets two documents: 007001 action and 007002 handle But where is 001010 handle? We're using the MultiFieldQueryParser and the Snowball Stemmers, if that makes any difference. Your number/handle samples look ok to me if the default operator is AND. Note that wildcard expressions are not analyzed so if service is stemmed to anything different from service, it's not surprising that service* doesn't find it. I think you should look at a) what's the analyzed form of your terms and b) how does the rewritten query look like (there's a rewrite method for query that expands wildcard queries into basic queries). HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Ulrich Mayring writes: Will do, thank you very much. However, how do I get at the analyzed form of my terms? Instanciate the analyzer, create a token stream feeding your input, loop over the tokens, output the results. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Combining Lucene and database functionality
Marco Schmidt writes: I'm trying to find out whether Lucene is an option for a project of mine. I have texts which also have a date and a list of numbers associated with each of them. These numbers are ID values which connect the article to certain categories. So a particular article X might belong to categories 17, 49 and 112. A search for all articles containing foo bar and belonging to categories 100 to 140 should return X (because it also contains foo bar). Is it possible to do this with Lucene and if it is, how? I've read about the concept of fields in Lucene, but it seems to me that you can only store text in them, not integers, let alone list of integers. None of the tutorials I've seen deals with more complex queries like that. Basically what I want to accomplish could be done nicely with databases with full text search capability, if that full text search wasn't so awful. Where's the problem? 100 is a text as well as an integer (one has to keep in mind, that treating it as text changes sort order, which may require leading 0 to compensate). Lucene does not understand the words you index anyway. So if a document has a field `category' with content '017 049 112' and some `text' field with content 'bla fasel foo bar' and you do a range query 100 - 140 on category (search all documents containing any word, that is alphanumerically sorted between 100 and 140) and a apropriate query on text it will find, what you want. There are some caveats like choosing an apropriate analyzer or considering the maximum number of terms the range query covers, but in principle there is no difference between a text field containing words and a category field containing categories. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: range and content query
Chris Fraschetti writes: can someone assist me in building or deny the possibility of combing a range query and a standard query? say for instance i have two fields i'm searching on... one being the a field with an epoch date associated with the entry, and the content so how can I make a query to select a range of thos epochs, as well as search through the content? can it be done in one query, or do I have to perform a query upon a query, and if so, what might the syntax look like? if you create the query using the API use a boolean query to combine the two basic queries. If you use query parser use AND or OR. Note that range queries are expanded into boolean queries (OR combined) which may be a problem if the number of terms matching the range is too big. Depends on your date entries and especially how precise they are. Alternatively you might consider using a filter. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: range and content query
Chris Fraschetti writes: I've more or less figured out the query string required to get a range of docs.. say date[0 TO 10]assuming my dates are from 1 to 10 (for the sake of this example) ... my query has results that I don't understand. if i do from 0 TO 10, then I only get results matching 0,1,10 ... if i do 0 TO 8, i get all results ... from 0 to 10... if i do 1 TO 5 ... then i get results 1,2,3,4,5,10 ... very strange. that's not strange. Lucene indexes strings and compares strings. Not numbers. So the order is 1 10 101 11 2 20 21 3 4 and so on I't up to you to make your number look a way that it will work, e.g. use leading '0' to get 001 002 003 004 010 011 020 021 ... I think there's a page in the wiki about these issues. here is how my query looks... query: +date_field:[1 TO 5] here is how the date was added... Document doc = new Document(); doc.add(Field.UnIndexed(arcpath_field, filename)); doc.add(Field.Keyword(date_field, date)); doc.add(Field.Text(content_field, content)); writer.addDocument(doc); I tried Field.Text for the date and also received the same results. Essentially I have a loop to add 11 strings... indexes 0 to 10... and add doc0, 0, some text for each.. and the results i get as as explained above... any ideas? Here is my simple searching code.. i'm currently not searching for any text... i just want to test the range feature right now query_string = +(+DATE_FIELD+:[+start_date+ TO +end_date+]); Searcher searcher = new IndexSearcher(index_path); QueryParser parser = new QueryParser(CONTENT_FIELD, new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_OR); Query query = parser.parse(query_string); System.out.println(query: +query.toString()); Hits hits = searcher.search(query); It's a bad practice to create search strings that have to be decomposed by query parser again, if have the parts already at hand. At least in most cases. I don't know the details how and when query parser will call the analyzer and what standard analyzer does with numbers. What does query.toString() output? But the main problem seems to be your misunderstanding of searching numbers in lucene. They are just strings and are treated by their lexical representation not their numeric value. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
David Spencer writes: could you put the current version of your code on that website as a java Weblog entry updated: http://searchmorph.com/weblog/index.php?id=23 thanks Great suggestion and thanks for that idiom - I should know such things by now. To clarify the issue, it's just a performance one, not other functionality...anyway I put in the code - and to be scientific I benchmarked it two times before the change and two times after - and the results were suprising the same both times (1:45 to 1:50 with an index that takes up 200MB). Probably there are cases where this will run faster, and the code seems more correct now so it's in. Ahh, I see, you check the field later. The logging made me think, you index all fields you loop over, in which case one might get unwanted words into the ngram index. An interesting application of this might be an ngram-Index enhanced version of the FuzzyQuery. While this introduces more complexity on the indexing side, it might be a large speedup for fuzzy searches. I also thinking of reviewing the list to see if anyone had done a Jaro Winkler fuzzy query yet and doing that I went into another direction, and changed the ngram index and search to use a simliarity that computes m * m / ( n1 * n2) where m is the number of matches and n1 is the number of ngrams in the query and n2 is the number of ngrams in the word. (At least if I got that right; I'm not sure if I understand all parts of the similarity class correctly) After removing the document boost in the ngram index based on the word frequency in the original index I find the results pretty good. My data is a number of encyclopedias and dictionaries and I only use the headwords for the ngram index. Term frequency doesn't seem relevent in this case. I still use the levenshtein distance to modify the score and sort according to score / distance but in most cases this does not make a difference. So I'll probably drop the distance calculation completely. I also see few difference between using 2- and 3-grams on the one hand and only using 2-grams on the other. So I'll presumably drop the 3-grams. I'm not sure, if the similarity I use, is useful in general, but I attached it to this message in case someone is interested. Note that you need to set the similarity for the index writer and searcher and thus have to reindex in case you want to give it a try. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: QueryParser.parse() and Lucene1.4.1
Polina Litvak writes: Hi Daniel, I just downloaded the latest version of Lucene and tried the whole thing again: I ran my code first with lucene-1.3-final.jar, getting the query Field:(A AND -(B)) parsed into +Field:A -Field:B, and then I ran exactly the same code with lucene-1.4.1.jar and got the output parsed into Field:A Field:- Field:B. I also read Lucene's documentation (http://cvs.apache.org/viewcvs.cgi/* checkout*/jakarta-lucene/CHANGES.txt?rev=1.85), and it does mention a change to the + and - operators: 13. Changed QueryParser.jj to allow '-' and '+' within tokens: http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 (Morus Walter via Otis) This change is unlikely to introduce the behaviour you describe, since it affects '-' within words only, not at start. So there is a change for a-b between 1.3 and 1.4 1.3 gives a -b 1.4 gives a b or one token a-b (depending on the analyzer) as it treats the - as part of a word. So is this behaviour a bug, or Lucene1.4 is not backwards compatible? Your behaviour cannot be seen from the test code (as Daniel already said): java -cp lucene-1.3-final/lucene-1.3-final.jar org.apache.lucene.queryParser.QueryParser 'Field:(A AND -(B))' +Field:a -Field:b java -cp lucene-1.4-final/lucene-1.4-final.jar org.apache.lucene.queryParser.QueryParser 'Field:(A AND -(B))' +Field:a -Field:b java -cp lucene-1.4.1/lucene-1.4.1.jar org.apache.lucene.queryParser.QueryParser 'Field:(A AND -(B))' +Field:a -Field:b So either you have a different query or something in your code is responsable for the problem. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Hi David, Based on this mail I wrote a ngram speller for Lucene. It runs in 2 phases. First you build a fast lookup index as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached and I'd like to contribute it to the sandbox, esp if someone can validate that what it's doing is reasonable and useful. great :-) [4] Here's source in HTML: http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152 could you put the current version of your code on that website as a java source also? At least until it's in the lucene sandbox. I created an ngram index on one of my indexes and think I found an issue in the indexing code: There is an option -f to specify the field on which the ngram index will be created. However there is no code to restrict the term enumeration on this field. So instead of final TermEnum te = r.terms(); i'd suggest final TermEnum te = r.terms(new Term(field, )); and a check within the loop over the terms if the enumerated term still has fieldname field, e.g. Term t = te.term(); if ( !t.field().equals(field) ) { break; } otherwise you loop over all terms in all fields. An interesting application of this might be an ngram-Index enhanced version of the FuzzyQuery. While this introduces more complexity on the indexing side, it might be a large speedup for fuzzy searches. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: (n00b) Meaning of Hits.id (int)
Peter Pimley writes: My documents are not stored in their original form by lucene, but in a seperate database. My lucene docs do however store the primary key, so that I can fetch the original version from the database to show the user (does that sound sane?) yes. I see that the 'Hits' class has an id (int) method, which sounds interesting. The javadoc says Returns the id for the nth document in this set.. However, I can't find any mention anywhere else about Document ids. Could anybody explain what this is? It's lucenes internal id or document number which allows you to access the document and its stored fields. See IndexSearcher.doc(int i) or IndexReader.document(int n) The docs just don't name the parameter 'id'. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: *term search
sergiu gordea writes: Hi all, I want to discuss a little problem, lucene doesn't support *Term like queries. I know that this can bring a lot of results in the memory and therefore it is restricted. That's not the reason for the restriction. That's possible with a* also. The problem is, that lucene has to check all terms to see if they end with Term. That makes the performance pretty poor. A prefix allows to restrict the search on words with this prefix efficiantly, since the wordlist is orderd. So my question is if there is a simple solution for implementing the funtionality mentioned above. Sure. Just follow the way, wildcard query is implemented. Actually I'm not sure if the restriction you mention is in the wildcard query itself or only in the query parser. In the latter case, you might just create the query yourself. A better way for postfix queries is to create an additional search field where all words are reversed and search for mreT* on that field. Depends on the size of your index, how important such an optimization is. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Negative Boost
Daniel Naber writes: On Wednesday 04 August 2004 13:19, Terry Steichen wrote: I can't get negative boosts to work with QueryParser. Is it possible to do so? Isn't that the same as using a boost 1, e.g. 0.1? That should be possible. no. a^-1 OR b A boost of -1 means that the score gets smaller if a document contains a with that boost appears. So it's somehow similar to NOT a, though less strict. A boost of 0.1 means that the score is increased less for an occurance of a. Usually one just want's the latter, but it's not the same. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Negative Boost
Terry Steichen writes: I can't get negative boosts to work with QueryParser. Is it possible to do so? If you change QueryParser ;-) Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Misbehaving query string
Bill Tschumy writes: I would think the following strings passed to the QueryParser should yield the same results: #1: +telescope AND !operate #2: (+telescope) AND (!operate) However the first string seems to give the correct results while the second gives zero hits. Am I misunderstanding something or is there a bug? The first query creates a boolean query with a required and a prohibited term. The second one, creates one boolean query for the !operate term, containing only one prohibited term and combines this with a query for telescope where both subqueries are required (don't ask me, if telescope makes a term query or a boolean query, I suspect the former). But lucene doesn't search boolean queries only containing prohibited terms. So the !operate boolean query gives you an empty result, which leads to the empty result of the whole query. I don't know if there's a reason, why the boolean query doesn't throw an exception in this case. Silently not working doesn't seem a good way of handling this. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ArrayIndexOutOfBoundsException if stopword on left of bool clause w/ StandardAnalyzer
Claude Devarenne writes: My question is: should the queryParser catch that there is no term before trying to add a clause when using a StandardAnalyzer? Is this even possible? Should the burden be on the application to either catch the exception or parse the query before handing it out to the queryParser? Yes. Yes. No. There are fixes in bugzilla that would make query parser read that query as title:bla and simply drop the stop word. see http://issues.apache.org/bugzilla/show_bug.cgi?id=9110 http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Tool for analyzing analyzers
Hi Mark, I've had this running OK from the command line and in Eclipse on XP. I suspect it might be because you're running a different OS? The Classfinder tries to split the system property java.class.path on the ; character but I forgot different OSes have different seperators. Let me know your setup details and I'll try fix the classloader issue. I have the same problems and am running on linux using ':' to separate the class path... BTW: I tried to compile your sources but you left out the part in thinlet. 2928 Sun Oct 12 19:47:56 CEST 2003 thinlet/AppletLauncher.class 2643 Sun Oct 12 19:47:56 CEST 2003 thinlet/FrameLauncher.class 74823 Sun Oct 12 19:47:56 CEST 2003 thinlet/Thinlet.class Was that intentional? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Claude Devarenne writes: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I think it would be worth to take a look at the sorting code. The idea of the sorting code is to have an array of the dates for each doc in memory and access this array for sorting. Now sorting isn't the only thing one might use this array for. Doing a range check is another. So you might extend the sorting code by a range selection. There is no code for this in lucene and you have to create your own searcher but it gives you a fast way to search and sort by date. I did this independently from the new sorting code (I just started a little to early) and it works quite well. The only drawback from this (and the new sorting code) is, that it requires an array of field values that must be rebuilt each time the index changes. Shouldn't be a problem for 6 documents. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Internal full content store within Lucene
Kevin Burton writes: How much interest is there for this? I have to do this for work and will certainly take the extra effort into making this a standard Lucene feature. Sounds interesting. How would you handle deletions? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: multivalue fields
Alex McManus writes: Maybe your fields are too long so that only part of it gets indexed (look at IndexWriter.maxFieldLength). This is interesting, I've had a look at the JavaDoc and I think I understand. The maximum field length describes the maximum number of unique terms, not the maximum number of words/tokens. Therefore, even if I have a 4Gb field, I could quite safely have a maxFieldLength of, say, 100k words which should safely handle the maximum number of unique words, rather than 800 million which would be needed to handle every token. Is this correct? A short look at the source says no. maxFieldLength is handed to DocumentWriter where one finds TokenStream stream = analyzer.tokenStream(fieldName, reader); try { for (Token t = stream.next(); t != null; t = stream.next()) { position += (t.getPositionIncrement() - 1); addPosition(fieldName, t.termText(), position++); if (++length maxFieldLength) break; } } finally { stream.close(); } so it's the number of terms not the number of different tokens. Is 100k a worrying maxFieldLength, in terms of how much memory this would consume? Depends on the size of your documents ;-) I use 25 without problems, but my documents are not as big (4 tokens). I just want to make sure, not to loose any text for indexing. Does Lucene issue a warning if this limit is exceeded during indexing (it would be quite worrying if it was silently discarding terms)? no. I guess the idea behind this limit is, that the relevant terms should occur in the first n words and indexing the rest just increases index size. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: multivalue fields
Ryan Sonnek writes: using lucene 1.3-final, it appears to only search the first field with that name. here's the code i'm using to construct the index, and I'm using Luke to check that the index is created correctly. Everything looks fine, but my search returns empty. do i have to use a special query to work with multivalue fields? is there a testcase in the source that performs this kind of work that I could look at? Don't know what goes wrong on your side, but this works just fine. Maybe your fields are too long so that only part of it gets indexed (look at IndexWriter.maxFieldLength). A test program import org.apache.lucene.document.*; import org.apache.lucene.analysis.*; import org.apache.lucene.index.*; import org.apache.lucene.store.*; import org.apache.lucene.search.*; import org.apache.lucene.queryParser.QueryParser; class LuceneTest { static String[] docs = { a c, b d, c e, d f, }; static String[] queries = { a, b, c, d, b OR c }; public static void main(String argv[]) throws Exception { Directory dir = new RAMDirectory(); String[] stop = {}; Analyzer analyzer = new StandardAnalyzer(stop); IndexWriter writer = new IndexWriter(dir, analyzer, true); // index documents (2 fields text each) for ( int i=0; i docs.length; i+=2 ) { Document doc = new Document(); doc.add(Field.Text(text, docs[i])); doc.add(Field.Text(text, docs[i+1])); writer.addDocument(doc); } writer.close(); Searcher searcher = new IndexSearcher(dir); for ( int i=0; i queries.length; i++ ) { Query query = QueryParser.parse(queries[i], text, analyzer); Hits hits = searcher.search(query); System.out.println(Query: + query.toString(text)); System.out.println( + hits.length() + documents found); for ( int j=0; j hits.length(); j++ ) { Document doc = hits.doc(j); System.out.println(\t+hits.id(j) + : + doc.get(text) + \t + hits.score(j)); //System.out.println( + searcher.explain(query, hits.id(j))); } } } } shows that search takes place in both fields. Query: a 1 documents found 0: b d 0.5 Query: b 1 documents found 0: b d 0.5 Query: c 2 documents found 0: b d 0.2972674 1: d f 0.2972674 Query: d 2 documents found 0: b d 0.2972674 1: d f 0.2972674 Query: b c 2 documents found 0: b d 0.581694 1: d f 0.0759574 But note, that this affects scoring as concatenation would. So I think Otis answer is a bit missleading. If you don't want the effects on scoring you AFAIK need to use different documents or fields. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query
Rosen Marinov writes: Short answer: it depends. Questions for you to answer: What field type and analyzer did you use during indexing? What analyzer used with QueryParser? What does the generated Query.toString return? in both cases SimpleAnalyzer QueryParser.parse(\abc\) throws an exception and i can't see what does Query.toString return in this case what analizer should i use if i want to execute folowing queries: simple key word seach (+bush -president , etc) range queries including characters in searching values The problem is, that Phrases are defined as | QUOTED: \ (~[\])+ \ in the query parser. So you cannot have a inside (even escaped). I guess that's a bug. It should read something like | QUOTED: \ (~[\] | \\\)+ \ (untested). But that shouldn't apply to QueryParser.parse(\abc\) (parsing abc). Only to QueryParser.parse(abc) (parsing \abc\). If you used SimpleAnalyzer (same for StandardAnalyzer) quotes got stripped anyway. Since you cannot search for things, that didn't got indexed, searching for `foo bar bla' and `foo bar bla' will be the same. The answer to your second question Is there more sly way to get the doc with exact maching this title? (for info: my titles are unique) is to skip query parser and create the query as a phrase query yourself. But this requires tokenization in the same way as it was done when indexing as well. Otherwise you might end with no results. If you have a lot of exact title queries, it might be worth to consider having a keyword field (that means no tokenization) for this data (in that case, you won't have to care about tokenizers and might create the query as a single TermQuery). There's no support for keyword queries in the query parser though. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to order search results by Field value?
Erik Hatcher writes: Why not do the unique sequential number replacement at index time rather than query time? how would you do that? This requires to know the ids that will be added in future. Let's say you start with strings 'a' and 'b'. Later you add a document with 'aa'. How do you know that you should make 'a' 1 and 'b' 3 to be prepared for 'aa'? To me Erics suggestion makes sense. The problem might be however: you have to sort all values, while keeping the strings means that you sort only the hits. And you should be aware that you have to rebuild the array each time the index changes. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query syntax on Keyword field question
Hi Chad, But I assume this fix won't come out for some time. Is there a way I can get this fix sooner? I'm up against a deadline and would very much like this functionality. Just get lucenes sources, change the line and recompile. The difficult part is to get a copy of JavaCC 2 (3 won't do), but I think this can be found in the archives. And to go one more step with the KeywordAnalyzer that I wrote, changing this method to skip the escape: protected boolean isTokenChar(char c) { if (c == '\\') { return false; } else { return true; } } The test then returns with a space: healthecare.domain.lucenesearch.KeywordAnalyzer: [HW-NCI_TOPICS] query.ToString = +category:HW -NCI_TOPICS +space junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is Expected:+category:HW\-NCI_TOPICS +space Actual :+category:HW -NCI_TOPICS +space note space where escape was. Sure. If \ isn't a token char, it end's the token. So you will have to look for a different way of implementing the analyzer. Shouldn't be that difficult since you have only one token. Maybe it should be the job of the query parser to remove the escape character (would make more sense to me at least) but that would be another change of the query parser... Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query syntax on Keyword field question
Chad Small writes: I'm getting this with 3.2: javacc-check: BUILD FAILED file:D:/applications/lucene-1.3-final/build.xml:97: ## JavaCC not found. JavaCC Home: /applications/javacc-3.2/bin JavaCC JAR: D:\applications\javacc-3.2\bin\bin\lib\javacc.jar Please download and install JavaCC from: http://javacc.dev.java.net Then, create a build.properties file either in your home directory, or within the Lucene directory and set the javacc.home property to the path where JavaCC is installed. For example, if you installed JavaCC in /usr/local/java/javacc-3.2, then set the javacc.home property to: javacc.home=/usr/local/java/javacc-3.2 If you get an error like the one below, then you have not installed things correctly. Please check all your paths and try again. java.lang.NoClassDefFoundError: org.javacc.parser.Main ## even though I put a build.properties file in my root lucene directory with this in it: javacc.home=/applications/javacc-3.2/bin I never tried javacc 3.2 but I thought there were issues with query parser and/or standard analyzer. Seems I'm wrong or outdated. In your case the problem seems to be installation of javacc. I guess the /bin directory should not be part of javacc.home. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query syntax on Keyword field question
Chad Small writes: Here is my attempt at a KeywordAnalyzer - although is not working? Excuse the length of the message, but wanted to give actual code. With this output: Analzying HW-NCI_TOPICS org.apache.lucene.analysis.WhitespaceAnalyzer: [HW-NCI_TOPICS] org.apache.lucene.analysis.SimpleAnalyzer: [hw] [nci] [topics] org.apache.lucene.analysis.StopAnalyzer: [hw] [nci] [topics] org.apache.lucene.analysis.standard.StandardAnalyzer: [hw] [nci] [topics] healthecare.domain.lucenesearch.KeywordAnalyzer: [HW-NCI_TOPICS] query.ToString = category:HW -nci topics +space junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is Expected:+category:HW-NCI_TOPICS +space Actual :category:HW -nci topics +space Well query parser does not allow `-' within words currently. So before your analyzer is called, query parser reads one word HW, a `-' operator, one word NCI_TOPICS. The latter is analyzed as nci topics because it's not in field category anymore, I guess. I suggested to change this. See http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 Either you escape the - using category:HW\-NCI_TOPICS in your query (untested. and I don't know where the escape character will be removed) or you apply my suggested change. Another option for using keywords with query parser might be adding a keyword syntax to the query parser. Something like category:key(HW-NCI_TOPICS) or category=HW-NCI_TOPICS. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem with search results
Doug Cutting writes: Morus Walter wrote: Now I think this can be fixed in the query parser alone by simply allowing '-' within words. That is change #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR ) to #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - ) As a result, query parser will read '-' within words (such as tft-monitor or Sysh1-1) as one word, which will be tokenized by the used analyzer and end up in a term query or phrase query depending if it create one ore more tokens. Other characters which are also candidates for this sort of treatment include /, @, ., ', and +. _TERM_START_CHAR is | #_TERM_START_CHAR: ( ~[ , \t, \n, \r, +, -, !, (, ), :, ^, [, ], \, {, }, ~, *, ? ] so / @ . ' are already allowed in terms. (:, ^, ~, * and ? cannot be added, parenthesis don't make sense.) So I end up with #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - | + ) The regression tests show no error, so I entered that in bugzilla. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing numbers
[EMAIL PROTECTED] writes: Hi! I want to store numbers (id) in my index: long id = 1069421083284; doc.add(Field.UnStored(in, String.valueOf(id))); But searching for id:1069421083284 doesn't return any hits. If your field is named 'in' you shouldn't search in 'id'. Right? Well, did I misunderstand something? UnStored is the number is stored but not index (analyzed), isn't it? Anyway, Field.Text doesn't work either. Well, indexing and analyzing are different things. UnStored means, the number is not stored (as the name says) but indexed. And IIRC it's analyzed before indexing. Shouldn't make a difference for a single number. What I'd use in this case is an unstored keyword (given that you really don't want to have the id returned from lucene, which is the consequence of not storing). I'm not sure if there's a method to create such a field, but you can do it by setting the flags directly. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practices for indexing in Web application
Michael Steiger writes: Depends on your application, but if you can, it's better to keep IndexSearcher open until the index changes. Otherwise you will have to open all the index files for each search. Good tip. So I have to synchronize (logically) my search routine with any updates and if the index changes I have to close the Searcher and reopen it. Right. The hard part is, that you shouldn't close the searcher when there still is access the that searcher. E.g. if you have a szenario - do search - index changes - access search results you cannot close the searcher until you accessed all search results. But that can be done by a little bit of reference counting. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practices for indexing in Web application
Michael Steiger writes: I am using an IndexSearcher for querying the index but for deletions I need to use the IndexReader. I now know that I can have Readers and a Writer open concurrently but IndexReader.delete can only be used if no Writer is open. You should be aware that an IndexSearcher uses a readonly IndexReader. So you can't ignore it for your considerations. I want to open the IndexSearcher only while searching and close it afterwards. Depends on your application, but if you can, it's better to keep IndexSearcher open until the index changes. Otherwise you will have to open all the index files for each search. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.tmpdir as lock dir .... once again
Otis Gospodnetic writes: This looks nice. However, what happens if you have two Java processes that work on the same index, and give it different lock directories? They'll mess up the index. Is that different to having two java processes using different java.io.tempdir? I had that problem (one running in tomcat and one from the command line). I don't think that making the need to choose the same directory for the lock more explicit will increase the problems. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem with search results
Otis Gospodnetic writes: And if you do not use QueryParser, then things work? If so, then this is likely caused by the fact that your Term contains a 'special' character, '-'. Actually I was going to suggest a fix for '-' within words in the query parser. The was a suggested fix, that changed both StandardAnalyzer and QueryParser, which was rejected, I guess because of the StandardAnalyzer change. Now I think this can be fixed in the query parser alone by simply allowing '-' within words. That is change #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR ) to #_TERM_CHAR: ( _TERM_START_CHAR | _ESCAPED_CHAR | - ) As a result, query parser will read '-' within words (such as tft-monitor or Sysh1-1) as one word, which will be tokenized by the used analyzer and end up in a term query or phrase query depending if it create one ore more tokens. So with StandardAnalyzer a query tft-monitor would get a phrase query tft monitor and Sysh1-1 a term query for Sysh1-1. Searching tft-monitor as a phrase tft monitor is not exact but the best aproximation possible once you indexed tft-monitor as tokens tft and monitor. The effect of '-' not occuring within a word is not changed, so tft -monitor will still search for 'tft AND NOT monitor'. Is that a change that would be acceptable? I didn't find the time to look at the regression tests though. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re:can't delete from an index using IndexReader.delete()
Dhruba Borthakur writes: Hi folks, I am using the latest and greatest Lucene jar file and am facing a problem with deleting documents from the index. Browsing the mail archive, I found that the following email (June 2003) listed the exact problem that I am encountering. In short: I am using Field.text(id, value) to mark a document. Then I use reader.delete(new Term(id, value)) to remove the document: this call returns 0 and fails to delete the document. The attached sample program shows this behaviour. You don't tell us how your ids look like, but Field.text(id, value) tokenizes value, that is splits value into whatever the analyzer considers to be a token, and creates a term for each token. Whereas new Term(id, value) creates one term containing value. So I guess your ids are considered several token by the analyzer you use and therefore they won't be matched by the term you construct for the delete. Using keyword fields instead of text fields for the id should help. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
open files under linux
Rasik Pandey writes: As a side note, regarding the Too many open files issue, has anyone noticed that this could be related to the JVM? For instance, I have a coworker who tried to run a number of optimized indexes in a JVM instance and a received the Too many open files error. With the same number of available file descriptors (on linux ulimit = ulimited), he split the number of indicies over too JVM instances his problem disappeared. He also tested the problem by increasing the available memory to the JVM instance, via the -Xmx parameter, with all indicies running in one JVM instance and again the problem disappeared. I think the issue deserves more testing to pin-point the exact problem, but I was just wondering if anyone has already experienced anything similar or if this information could be of use to anyone, in which case we should probably start a new thread dedicated to this issue. The limit is per process. Two JVM make two processes. (There's a per system limit too, but it's much higher; I think you find it in /proc/sys/fs/file-max and it's default value depends on the amount of memory the system has) AFAIK there's no way of setting openfiles to unlimited. At least neither bash nor tcsh accepts that. But it should not be a problem to set it to very high values. And you should be able to increase the system wide limit by writing to /proc/sys/fs/file-max as long as you have enough memory. I never used this, though. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Limiting hit count
[EMAIL PROTECTED] writes: On Friday 13 February 2004 12:18, Julien Nioche wrote: If you want to limit the set of Documents you're querying, you should consider using Filter objects and send it to the searcher along with your Query. Hm, hard to find information about Filters...I actually only want the first hit: public BitSet bits(IndexReader reader) throws IOException { BitSet bs = new BitSet(1); bs.set(1); return bs; } ...doesn't work (i.e. returns nothing rather than all hits). Well that means that you only want document with document id 1 given that it matches the query. A filter provides means to restrict *query* to certain documents, not results. And it won't have influcene on the performance (except for the time it takes to create the filter and that it slows down things a little bit). As far as results are concerned Lucenes hits object will only hold a limited number of result (IIRC 200) and repeat the query if you access more (look at the search implementation for details) as Julien already stated. What's the reason for your question? Usually lucene executes queries very fast. I typically have a few ms. So there's few reason to speed this up. Accessing results is much slower, especially if there are a lot of results and you access them all. E.g. query: 1 ms, reading three fields for 50 results: 22 ms. The index is smaller than the machines memory (~ 3/4 GB Index size, 1 GB RAM). Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: a search like Google
Nicolas Maisonneuve writes: hy, i have a index with the fields : title author content i would make the same search type than Google ( a form with a textfiel). When the user search i love lucene (it's not a phrase query but just the text in the textfield ), i would like search in all the index fields but with a specific weight boost for each field. In this example title weight=2, author=1 content=1 the results would be (i suppose the default operator is and) : (title:i^2 author:i content:i) +(title:love^2 author:love content:love) +(title:lucene^2 author:lucene content:lucene) but must i modify the QueryParser or is there a different way for do this ? ( because i modified the QueryParser and it's work but if there is a cleaner way to do this , i take it ! ) If you want to use query parser you can parse the query with different default fields, set boost factors on the resulting queries and join them with a boolean query. This will give you (+title:i +title:love +title:lucene)^2 (+author:i +author:love +author:lucene) (+content:i +content:love +content:lucene) I don't if there are subtle differences between your query and this one, but it should be basically the same. Apart from the boost factors, that's AFAIK what multi field query parser does. Maybe it would be usefull to extend multi field query parser to handle different boosts factors. If you just want to allow search terms and none of the other constructs query parser handles, I would use David Spencer suggestion though. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Date Range support
tom wa writes: From: Erik Hatcher On Jan 29, 2004, at 5:08 AM, tom wa wrote: I'm trying to create an index which can also be searched with date ranges. My first attempt using the Lucene date format ran in to trouble after my index grew and I couldn't search over more than a few days. the suggestion seemed to be to use strings of the format MMdd. Using that format worked great until I remembered that my search needs to be able to support different timezones. Adding the hour to my field causes the same problem above and my queries stop working when using a range of about 2 months. When you say you couldn't search and that it stopped working, do you mean it was just unacceptably slow? (Sorry it's taken me a while to reply) It wasn't slow, my timeout is far greater than the time it takes to come back with no hits. A small example of a query would be (date: [200306081900 TO 200306201200]) AND (text: sometext) and this will return zero hits. The index contains about 1000 items for each 24hr period and the total number of documents was about 150k. I had the same results when using Lucene's built in date format too. If you think it should be able to cope with what I am trying to do then I'll take another look. An alternative to using date ranges or date filters is to use an aproach similar to the recently introduced sort on a integer field (cvs only, so far). That is, - create an array of the dates of all documents - extend the low level search, in a way that it uses this array and a upper and lower limit to do an additional selection (that's similar to what the filter does) The advantage over a filter is, that you can use the same array for arbitrary date ranges while a filter is specific to a date range. OTOH the array needs to be newly created whenever the index changes. The cost depends on the number of different dates and the array size of course. I did some tests and found, that it takes less than .1 seconds on a P4 2400 Mhz to create such an array for ~ 10 documents, ~ 1 different dates. So it depends a bit on how often your index changes, if that's a good way. Another disadvantage is, that you will have to dig a little bit deeper into lucenes search classes. And memory usage might get a problem, once you exceed a few million documents. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the status of Query Parser AND / OR ?
Daniel B. Davis writes: There was a lot of correspondence during December about this. Is there any further resolution? There's a patch and I hope it will find it's way into the lucene sources. see: http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 Seems I missed the mail about Otis latest comment. Sorry about that, I'll take a look at these issues ASAP. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query madness with NOTs...
Otis Gospodnetic writes: Redirecting to lucene-user --- Jim Hargrave [EMAIL PROTECTED] wrote: Can anyone tell me why these two queries would produce different results: +A -B A -(-B) A and +A are not the same thing when you have multiple terms in a query. Hmm. As far as I understood boolean queries so far a -b and +a -b should be the same (while a b -c and +a +b -c are different of course). a -(-b) on the other side contains a boolean query only searching for -b. Lucene can not handle this type of query. I'm not sure what happens in this case. But AFAIK you should never use a boolean query containing only prohibited terms in a query. If I test this, I don't get any results for a -(-b) and the same result for 'a' and 'a +(-b)'. The query parser patch I added yesterday to bugzilla, drops such queries. Also, we are having a hard time understanding why the Query parser takes this query: A AND NOT B and returns this +A +(-B). Shouldn't this be +A -B? a AND NOT b IS parsed to +a -b by lucenes standard query parser. Don't know where you found +a +(-b). +a +(-b) would be wrong in the above sense. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene search result no stable
Ardor Wei writes: What might be the problem? How to solve it? Any suggestion or idea will be appreciated. The only problem with locking I saw so far is that you have to make sure that the temp dir is the same for all applications. Lucene 1.3 stores it's lock in the directory that is defined by the system property java.io.tmpdir. I had one component running under tomcat and one from the shell and they used different temp dirs which is fatal in this case. Apart from this it depends pretty much on your environment. I'm using lucene on linux on local filesystems. Other operating systems or network filesystems may influence locking. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Term Questions
Erik Hatcher writes: TS==I've not been able to get negative boosting to work at all. Maybe there's a problem with my syntax. If, for example, I do a search with green beret^10, it works just fine. But green beret^-2 gives me a ParseException showing a lexical error. Have you tried it without using QueryParser and boosting a Query using setBoost on it? QueryParser is a double-edged sword and it looks like it only allows numeric characters (plus . followed by numeric characters). So QueryParser has the problem with negative boosts, but not Query itself. He said he wants to have one term less important than others (at least that's what I understood). That's done by positive boost factors smaller than 1.0 (e.g. 0.5 or 0.1) and might be called 'negative boosting' (such as breking is a form of negative acceleration). If you use negative boost factors you would even decrease the score of a match (not only increase it less) and risk of ending with a negative score. I don't think that would be a good idea. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser and stopwords
Hi, I'm currently trying to get rid of query parser problems with stopwords (depending on the query, there are ArrayIndexOutOfBoundsExceptions, e.g. for stop AND nonstop where stop is a stopword and nonstop not). While this isn't hard to fix (I'll enter a bug and patch in bugzilla), there's one issue left, I'm not sure how to deal with: What should the query parser return for a query string containing only stopwords? And when I think about this, there's another one: stop AND NOT nonstop creates a boolean query, only containing prohibited terms, which AFAIK cannot be used in a search. How to deal with this? Currently it returns an empty BooleanQuery. I think it would be more useful to return null in this case. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing of deep structured XML
Goulish, Michael writes: To really preserve the relationships in arbitrarily structured XML, you pretty much need to use a database that directly supports an XML query language like XQuery or XPath. If searching within regions is enough (something e.g. sgrep (http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html) or OpenText/PAT does), I think this can be done on top of lucene. Basically you need to index region start and region end markers. In order to search a term within a region, you can use TermPositions to loop over all matches of the term and all start and end markers of the region to check where you find a match within this region. Of course search logic for region search is quite different to lucenes document queries. There are two types of results (match points and regions) and the basic operations include match points/region in region, region containing match points/region, joins and intersection of match points or regions. I don't know if and how this could be integrated with lucenes normal queries. But of course one could get a list of matching documents from results of region searches. If you (ab)use lucenes token position to store the character position of the token, you could also extract the regions text from a stored copy. I'm currently doing some experiments with such kind of queries using lucene and find it performs quite well. You won't be able to distinguish between parents and other ancestors though and there won't be any support for searching siblings. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ordening documents
Peter Keegan writes: What is the returned order for documents with identical scores? have a look at the source of the lessThan method in org.java.lucene.search.HitQueue: protected final boolean lessThan(Object a, Object b) { ScoreDoc hitA = (ScoreDoc)a; ScoreDoc hitB = (ScoreDoc)b; if (hitA.score == hitB.score) return hitA.doc hitB.doc; else return hitA.score hitB.score; } sorting is done by this method. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Philosophy(??) question
Scott Smith writes: I have some documents I'm indexing which have multiple languages in them (i.e., some fields in the document are always English; other fields may be other languages). Now, I understand why a query against a certain field must use the same analyzer as was used when that field was indexed (stemming, stop words, etc.). It seems like different fields could use different analyzers and the world would still be a happy place. However, since the analyzer() is passed in as part of the IndexWriter, that can't happen. Is there a way to do this (other than having multiple indexes which is a problem trying to do combined searches)? Or am I missing something more subtle? Sorry if I'm plowing old ground. AFAIK you need to write one analyzer that acts different based on the the 'fieldName' parameter in the tokenStream method. I haven't done that though. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]