Re: Questions about GermanAnalyzer/Stemmer [auf Viren geprueft]
I had to moderate both Jonathan and Jon's messages in to the list. Please subscribe to the list and post to it with the address you've subscribed. I cannot always guarantee I'll catch moderation messages and send them through in a timely fashion. Erik On Mar 1, 2005, at 6:18 AM, Jonathan O'Connor wrote: Jon, I too found some problems with the German analyser recently. Here's what may help: 1. You can try reading Joerg Caumanns' paper A Fast and Simple Stemming Algorithm for German Words. This paper describes the algorithm implemented by GermanAnalyser. 2. I guess German nouns all capitalized, so maybe that's why. Although you would want to be indexing well written German and not emails or text messages! 3. The German Stemmer converts umlauts into some funny form (the code is a bit tricky, and I didn't spend any time looking at it), so maybe thats why you can't find umlauts properly. I think the main reason for this umlaut change is that many plurals are formed by umlauting: E.g. Haus, Haeuser (that ae is a umlaut). Finally, to really understand what's happening, get your hands on Luke. I just got it last week, and its brilliant. It shows you everything about your indexes. You can also feed text to an Analyser, and see what it makes of it. This will show you the real reason why your umlaut search is failing. Ciao, Jonathan O'Connor XCOM Dublin Jon Humble [EMAIL PROTECTED] 01/03/2005 09:35 Please respond to Lucene Users List lucene-user@jakarta.apache.org To lucene-user@jakarta.apache.org cc Subject Questions about GermanAnalyzer/Stemmer [auf Viren geprueft] Hello, We?re using the GermanAnalyzer/Stemmer to index/search our (German) Website. I have a few questions: (1) Why is the GermanAnalyzer case-sensitive? None of the other language indexers seem to be. What does this feature add? (2) With the German Analyzer, wildcard searches containing extended German characters do not seem to work. So, a* is fine but anä* or ö* always find zero results. (3) In a similar vein to (2), wildcard searches with escaped special characters fail to find results. So a search for co\-operative works but a search for co\-op* fails. I will be grateful for any light that can be shed on these problems. With Thanks, Jon. Jon Humble BSc (hons,) Software Engineer eMail: [EMAIL PROTECTED] TecSphere Ltd Centre for Advanced Industry Coble Dene, Royal Quays Newcastle upon Tyne NE29 6DE United Kingdom Direct Dial: +44 (191) 270 31 06 Fax: +44 (191) 270 31 09 http://www.tecsphere.com *** Aktuelle Veranstaltungen der XCOM AG *** XCOM laedt ein zur IBM Workplace Roadshow in Berlin (02.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events Workshop-Reihe Mobilisierung von Lotus Notes Applikationen in Berlin (05.03.2005) Anmeldung und Information unter http://lotus.xcom.de/events *** XCOM AG Legal Disclaimer *** Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen. This email may contain material that is confidential and for the sole use of the intended recipient. Any review, distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple indexes
It's hard to answer such a general question with anything very precise, so sorry if this doesn't hit the mark. Come back with more details and we'll gladly assist though. First, certainly do not copy/paste code. Use standard reuse practices, perhaps the same program can build the two different indexes if passed different parameters, or share code between two different programs as a JAR. What specifically are the issues you're encountering? Erik On Mar 1, 2005, at 8:06 PM, Ben wrote: Hi My site has two types of documents with different structure. I would like to create an index for each type of document. What is the best way to implement this? I have been trying to implement this but found out that 90% of the code is the same. In Lucene in Action book, there is a case study on jGuru, it just mentions them using multiple indexes. I would like to do something like them. Any resources on the Internet that I can learn from? Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost doesn't works
Use the IndexSearcher.explain() feature to look at how Lucene is calculating the score. Erik On Feb 28, 2005, at 3:32 AM, Claude Libois wrote: I use MultiFieldQueryParser(search only done on summary,title and content) with a FilteredQuery. Claude Libois [EMAIL PROTECTED] Technical associate - Unisys - Original Message - From: Morus Walter [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Monday, February 28, 2005 9:28 AM Subject: Re: Boost doesn't works Claude Libois writes: Hello. I'm using Lucene for an application and I want to boost the title of my documents. For that I use the setBoost method that is applied on the title field. However when I look with luke(1.6) I don't see any boost on this field and when I do a search the score isn't change. What's wrong? How do you search? I guess you cannot see a change unless you combine searches in different fields, since scores are normalized. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
On Feb 28, 2005, at 6:00 AM, Stanislav Jordanov wrote: my private investigation already left me sceptic about the outcome of this issue, but I've decided to post it as a final resort. What did you do in your private investigation? Suppose I have an index of about 5,000,000 docs and I am running a single term queries against it, including queries which return say 1,000,000 or even more hits. The hits are sorted by some column and I am happy with the query execution time (i.e. the time spent in the IndexSearcher.query(...) method). Now comes the problem: it is a product requirement that the client is allowed to quickly access (by scrolling) a random page of the result set. Put in different words the app must quickly (in less that a second) respond to requests like: Give me the results from No 567100 to No 567200 (remember the results are sorted thus ordered). Sorted by descending relevance (the default), or in some other way? If a search is fast enough, as you report, then you can simply start your access to Hits at the appropriate spot. For the current systems I'm working on, this is the approach I've used - start iterating hits at (pageNumber - 1) * numberOfItemsPerPage. Is that approach insufficient? Erik I took a look at Lucene's internals which only left me with the suspision that this is an impossible task. Would anyone, please, prove my suspision wrong? Regards Stanislav - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fast access to a random page of the search results.
On Feb 28, 2005, at 10:39 AM, Stanislav Jordanov wrote: What did you do in your private investigation? 1. empirical tests with an index of nearly 75,000 docs (I am attaching the test source) Only certain (.txt?) attachments are allowed to come through on the mailing list. Sorted by descending relevance (the default), or in some other way? In some other way - sorted by some column (asc or desc - doesn't matter) Using IndexSearcher(query, sort)? If a search is fast enough, as you report, then you can simply start your access to Hits at the appropriate spot. For the current systems I'm working on, this is the approach I've used - start iterating hits at (pageNumber - 1) * numberOfItemsPerPage. Is that approach insufficient? I'm afraid this is not sufficient; Either I am doing something wrong, or it is not that simple: following is a log from my test session; It appears that IndexSearcher.search(...) finishes rather fast compared to the time it takes to fetch the last document from the Hits object. I assume you are only accessing the documents you wish to display rather than all of them up to where you need. Also keep in mind that accessing a Document is when the document is pulled from the index. If you have a large amount of data in a document it will take a corresponding amount of time to load it. You may need to restructure what you store in a document to reduce the load times. Or perhaps you need to investigate the (is it in the codebase already?) patch to load fields lazily upon demand instead. Erik The log starts here: pa Found 74222 document(s) that matched query 'pa' Sorting by sfile_name query executed in 16ms Last doc accessed in 375ms us Found 74222 document(s) that matched query 'us' Sorting by sfile_name query executed in 31ms Last doc accessed in 219ms 1 Found 74222 document(s) that matched query '1' Sorting by sfile_name query executed in 15ms Last doc accessed in 235ms 5 Found 74222 document(s) that matched query '5' Sorting by sfile_name query executed in 422ms Last doc accessed in 219ms 6 Found 72759 document(s) that matched query '6' Sorting by sfile_name query executed in 344ms Last doc accessed in 250ms - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting date stored in milliseconds time
Just an idea off the top of my head you could create a custom sort, or alternatively you could store the date as separate fields such as year, month, day, time, and provide multi-field sort. Erik On Feb 25, 2005, at 11:36 PM, Ben wrote: Hi I store my date in milliseconds, how can I do a sort on it? SortField has INT, FLOAT and STRING. Do I need to create a new sort class, to sort the long value? Thanks Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: help with boolean expression
On Feb 25, 2005, at 4:19 PM, Omar Didi wrote: I have a problem understanding how would lucene iterpret this boolean expression : A AND B OR C . it neither return the same count as when I enter (A AND B) OR C nor A AND (B OR C). if anyone knows how it is interpreted i would be thankful. Output the toString() of the returned Query instances to see how QueryParser interpreted things. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorted search
Sorting by String uses up lots more RAM than a numeric sort. If you use a numeric (yet lexicographically orderable) date format (e.g. MMDD) you'll see better performance most likely. Erik On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote: Hello, lucene-user. I have index with many documents, more than 40 Mil. Each document has DateField (It is time stamp of document) I need the most recent results only. I use single instance of IndexSearcher. When I perform sorted search on this index: Sort sort = new Sort(); sort.setSort( new SortField[] { new SortField (modified, SortField.STRING, true) } ); Hits hits = searcher.search(QueryParser.parse(good, content, StandardAnalyzer()), sort); then search speed is not good. Today I have tried search without sort by modified, but with sort by Relevance. Speed was much better! I think that Sort by DateField is very slow. Maybe I do something wrong about this kind of sorted search? Can you give me advices about this? Thanks. Yura Smolsky. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using the highlighter from the sandbox with a prefix query.
On Feb 21, 2005, at 10:20 AM, Michael Celona wrote: I am using query = searcher.rewrite( query ); and it is throwing java.lang.UnsupportedOperationException . Am I able to use the searcher rewrite method like this? What's the full stack trace? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using the highlighter from the sandbox with a prefix query.
On Feb 21, 2005, at 10:53 AM, Michael Celona wrote: That the only stack I get. One thing to mention that I am using a MultiSearcher to rewrite the queries. I tried... query = searcher_last.rewrite( query ); query = searcher_cur.rewrite( query ); using IndexSearcher and I don't get an error... However, I not able to highlight wildcard queries. I use Highlighter for lucenebook.com and have two indexes that I search with MultiSearcher. Here's how I highlight: IndexReader reader = readers[indexIndex]; QueryScorer scorer = new QueryScorer(query.rewrite(reader)); SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(span class=\highlight\, /span); Highlighter highlighter = new Highlighter(formatter, scorer); I get the appropriate IndexReader for the document being highlighted. You can get the index _index_ this way: ' int indexIndex = searcher.subSearcher(hits.id(position)); Hope this helps. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: More Analyzer Question
The problem is your KeywordSynonymAnalyzer is not truly a keyword analyzer in that it is tokenizing the field into parts. So Document 1 has [test] and [mario] as tokens that come from the LowerCaseTokenizer. Look at Lucene's svn repository under contrib/analyzers and you'll see a KeywordTokenizer and corresponding KeywordAnalyzer you can use. Erik On Feb 18, 2005, at 5:44 PM, Luke Shannon wrote: I have created an Analyzer that I think should just be converting to lower case and add synonyms in the index (it is at the end of the email). The problem is, after running it I get one more result than I was expecting (Document 1 should not be there): Running testNameCombination1, expecting: 1 result The query: +(type:138) +(name:mario*) returned 2 Start Listing documents: Document: 0 contains: Name: Textname:mario test Desc: Textdesc:this is test from mario Document: 1 contains: Name: Textname:test mario Desc: Textdesc:retro End Listing documents Those same 2 documents in Luke look like this: Document 0 Textname:mario test Textdesc:this is test from mario Document 1 Textname:test mario Textdesc:retro That looks correct to me. The query shouldn't match Document 1. The analzyer used on this field is below and is applied like so: //set the default PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new SynonymAnalyzer(new FBSynonymEngine())); //the analyzer for the name field (only converts to lower case and adds synonyms analyzer.addAnalyzer(name, new KeywordSynonymAnalyzer(new FBSynonymEngine())); Any help would be appreciated. Thanks, Luke import org.apache.lucene.analysis.*; import java.io.Reader; public class KeywordSynonymAnalyzer extends Analyzer { private SynonymEngine engine; public KeywordSynonymAnalyzer(SynonymEngine engine) { this.engine = engine; } public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new SynonymFilter(new LowerCaseTokenizer(reader), engine); return result; } } Luke Shannon | Software Developer FutureBrand Toronto 207 Queen's Quay, Suite 400 Toronto, ON, M5J 1A7 416 642 7935 (office) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote: By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. But that wouldn't work for any other type of query title:somethingFuzzy~ Though now that I think more about it, a simple s/title:/title_orig:/ before parsing would work, and of course make the default field dynamic. I need to evaluate how many fields would need to be done this way - it'd be several. Thanks for the food for thought! only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Why would that be? They want a case sensitive/insensitive switch. How would it expand beyond that? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
JavaLobby Lucene presentation
I recorded a Meet Lucene presentation at JavaLobby. It is a multimedia Flash video that shows slides with my voice recorded over them which spans just over 20 minutes (you can jump to specific slides).Check it out here: http://www.javalobby.org/members-only/eps/meet-lucene/index.html? source=archives It's tailored as a high-level overview, and a quick one at that. It'll certainly be too basic for most everyone on this list, but maybe your manager would enjoy it :) It's awkward to record this type of thing and it sounds dry to me as I ended up having to script what I was going to say and read it rather than ad-lib like I would do in a face-to-face presentation. ah's and um's don't work well in an audio-only track. I'd love to hear (perhaps best through the JavaLobby forum associated with the presentation) feedback on it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: reuse of TokenStream
I'm confused on how you're reusing a TokenStream object. General Lucene usage would not involve a developer dealing with it directly. Could you share an example of what you're up to? I'm not sure if this is related, but a technique I'm using is to index the same Document instance into two different IndexWriter instances (each uses a different Analyzer) - and this is working fine. Erik On Feb 17, 2005, at 6:04 AM, Harald Kirsch wrote: Hi, is it thread safe to reuse the same TokenStream object for several fields of a document or does the IndexWriter try to parallelise tokenization of the fields of a single document? Similar question: Is it safe to reuse the same TokenStream object for several documents if I use IndexWriter.addDocument() in a loop? Or does addDocument only put the work into a queue where tasks are taken out for parallel indexing by several threads? Thanks, Harald. -- --- - Harald Kirsch | [EMAIL PROTECTED] | +44 (0) 1223/49-2593 BioMed Information Extraction: http://www.ebi.ac.uk/Rebholz-srv/whatizit - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene in the Humanties
It's about time I actually did something real with Lucene :) I have been working with the Applied Research in Patacriticism group at the University of Virginia for a few months and finally ready to present what I've been doing. The primary focus of my group is working with the Rossetti Archive - poems, artwork, interpretations, collections, and so on of Dante Gabriel Rossetti. I was initially brought on to build a collection and exhibit system, though I got detoured a bit as I got involved in applying Lucene to the archive to replace their existing search system. The existing system used an old version of Tamino with XPath queries. Tamino is not at fault here, at least not entirely, because our data is in a very complicated set of XML files with a lot of non-normalized and legacy metadata - getting at things via XPath is challenging and practically impossible in many cases. My work is now presentable at http://www.rossettiarchive.org/rose (rose is for ROsetti SEarch) This system is implicitly designed for academics who are delving into Rossetti's work, so it may not be all that interesting for most of you. Have fun and send me any interesting things you discover, especially any issues you may encounter. Here are some numbers to give you a sense of what is going on underneath... There are currently 4,983 XML files, totally about 110MB. Without getting into a lot of details of the confusing domain, there are basically 3 types of XML files (works, pictures, and transcripts). It is important that there be case-sensitive and case-insensitive searches. To accomplish that, a custom analyzer is used in two different modes, one applying a LowerCaseFilter, and one not with the same documents written to two different indexes. There is one particular type of XML file that gets indexed as two different types of documents (a specialized summary/header type). In this first set of indexes, it is basically a one-to-one mapping of XML file to Lucene Document (with one type being indexed twice in different ways) - all said there are 5539 documents in each of the two main indexes. The transcript type gets sliced into another set of original case and lowercased indexes with each document in that index representing a document division (a div element in the XML). There are 12326 documents in each of these div-level indexes. All said, the 4 indexes built total about 3GB in size - I'm storing several fields in order to hit-highlight. Only one of these indexes is being hit at a time - it depends on what parameters you use when querying for which index is used. Lucene brought the search times into a usable, and impressive to the scholars, state. The previous search solution often timed the browser out! Search results now are in the milliseconds range. The amount of data is tiny compared to most usages of Lucene, but things are getting interesting in other ways. There has been little tuning in terms of ranking quality so far, but this is the next area of work. There is one document type that is more important than the others, and it is being boosted during indexing. There is now a growing interest in tinkering with all the new knobs and dials that are now possible. Putting in similar and more-like-this features are desired and will be relatively straightforward to implement. I'm currently using catch-all-aggregate-field technique for a default field for QueryParser searching. Using a multi-field expansion is an area that is desirable instead though. So, I've got my homework to do and catch up on all the goodness that has been mentioned in this list recently regarding all of these techniques. An area where I'd like to solicit more help from the community relates to something akin to personalization. The scholars would like to be able to tune results based on the role (such as art historian) that is searching the site. This would involve some type of training or continual learning process so that someone searching feeds back preferences implicitly for their queries by visiting the actual documents that are of interest. Now that the scholars have seen what is possible (I showed them the cool SearchMorph comparison page searching Wikipedia for rossetti), they want more and more! So - here's where I'm soliciting feedback - who's doing these types of things in the realm of Humanties? Where should we go from here in terms of researching and applying the types of features dreamed about here?How would you recommend implementing these types of features? I'd be happy to share more about what I've done under the covers. As you may be able to tell, the web UI is Tapestry for the search and results pages (though you won't be able to tell from the URL's you'll see :). The UI was designed primarily by one of our very graphical/CSS savvy post doc research associates, and was designed with the research scholar in mind. I continue
Re: Lius
Rida, Please add your project to the Lucene PoweredBy page on the wiki. Also - I moderated in your messages - so please subscribe to the list to send to it in the future. Erik On Feb 17, 2005, at 5:13 PM, Rida Benjelloun wrote: Hi, I've just release an indexing framework based on Lucene witch is named LIUS. LIUS is written in Java and it adds to Lucene many files format indexing functionalities as: Ms Word, Ms Excel, Ms PowerPoint, RTF, PDF, XML, HTML, TXT, Open Office suite and JavaBeans. All the indexation process is based on a configuration file. You can visit this links for more informations about LIUS, documentation is available in English and French: www.bibl.ulaval.ca/lius/index.en.html www.sourceforge.net/projects/lius - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
And before too many replies happen on this thread, I've corrected the spelling mistake in the subject! :O - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 18, 2005, at 3:25 PM, Luke Shannon wrote: Nice work Eric. I would like to spend more time playing with it, but I saw a few things I really liked. When a specific query turns up no results you prompt the client to preform a free form search. Less sauvy search users will benefit from this strategy. That's merely an artifact of all searches going to the results page, which just shows the free-form search on it. I also like the display of information when you select a result. Everything is at your finger tips without clutter. For comparison, check the older site's search is here: http://jefferson.village.virginia.edu:2020/search.html (don't bother trying it - it's SLOOOW) And also for comparison, here is an older look: http://jefferson.village.virginia.edu:8090/styler/servlet/ SaxonServlet?source=http://jefferson.village.virginia.edu:2020/tamino/ files/1-1847.s244.raw.xmlstyle=http://jefferson.village.virginia.edu: 2020/tamino/rossetti.xslclear-stylesheet-cache=yes Dig that URL! The new look and URL is here: http://www.rossettiarchive.org/docs/1-1847.s244.raw.html I did get this error when a name search failed to turn up results and I clicked 'help' in the free form search row (the second row). Page 'help-freeform.html' not found in application namespace. I've corrected this and it'll be corrected in my next deployment :) So nice to have community of testers! Thanks. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? Erik p.s. It's fun to see the types of queries folks have already tried since I sent this e-mail (repeated queries are possibly someone paging): INFO: Query = +title:dog +archivetype:rap : hits = 3 INFO: Query = +title:dog +archivetype:rap : hits = 3 INFO: Query = +title:dog +archivetype:rap : hits = 3 INFO: Query = rosetti : hits = 3 INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR archivetype:rap) : hits = 2182 INFO: Query = advil : hits = 0 INFO: Query = test : hits = 24 INFO: Query = td : hits = 1 INFO: Query = td : hits = 1 INFO: Query = woman : hits = 363 INFO: Query = woman : hits = 363 INFO: Query = hello : hits = 0 INFO: Query = +rosetta +archivetype:rap : hits = 0 INFO: Query = +year:[ TO 1911] +(archivetype:radheader OR archivetype:rap) : hits = 2182 INFO: Query = poem : hits = 316 INFO: Query = crisis : hits = 7 INFO: Query = crisis at every moment : hits = 1 INFO: Query = toy : hits = 41 INFO: Query = title:echer : hits = 0 INFO: Query = senori : hits = 0 INFO: Query = +dear +sirs : hits = 11 INFO: Query = title:more : hits = 0 INFO: Query = more : hits = 365 INFO: Query = title:rossetti : hits = 329 INFO: Query = +blessed +damozel : hits = 103 INFO: Query = title:test : hits = 0 INFO: Query = +test +archivetype:radheader : hits = 3 INFO: Query = crisis at every moment : hits = 1 INFO: Query = rome : hits = 70 INFO: Query = fdshjkfjkhkfad : hits = 0 INFO: Query = stone : hits = 153 INFO: Query = +title:shakespeare +archivetype:radheader : hits = 1 INFO: Query = title:xx i ll : hits = 0 INFO: Query = +dog +cat : hits = 6 INFO: Query = +year:[1280 TO 1305] +archivetype:radheader : hits = 0 INFO: Query = guru : hits = 0 INFO: Query = philosophy : hits = 14 INFO: Query = title:install : hits = 0 INFO: Query = +title:install +archivetype:radheader : hits = 0 INFO: Query = help freeform.html : hits = 0 INFO: Query = help freeform.html : hits = 0 INFO: Query = install : hits = 1 INFO: Query = life : hits = 554 INFO: Query = life : hits = 554 INFO: Query = life : hits = 554 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote: On Friday 18 February 2005 21:55, Erik Hatcher wrote: On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Question
On Feb 17, 2005, at 5:51 PM, Luke Shannon wrote: My manager is now totally stuck about being able to query data with * in it. He's gonna have to wait a bit longer, you've got a slightly tricky situation on your hands WildcardQuery(new Term(name, *home\**)); The \* is the problem. WildcardQuery doesn't deal with escaping like you're trying. Your query is essentially this now: home\* Where backslash has no special meaning at all... you're literally looking for all terms that start with home followed by a backslash. Two asterisks at the end really collapse into a single one logically. Any theories as to why the it would not match: Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* Is the \ escaping both * characters? So, again, no escaping is being done here. You're a bit stuck in this situation because * (and ?) are special to WildcardQuery, and it does no escaping. Two options I think of: - Build your own clone of WildcardQuery that does escaping - or perhaps change the wildcard characters to something you do not index and use those instead. - Replace asterisks in the terms indexed with some other non-wildcard character, then replace it on your queries as appropriate. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParrellelMultiSearcher Question
If you close a Searcher that goes through a RemoteSearchable, you'll close the remote index. I learned this by experimentation for Lucene in Action and added a warning there: http://www.lucenebook.com/search?query=RemoteSearchable+close On Feb 17, 2005, at 8:27 PM, Youngho Cho wrote: Hello, Is there any pointer how closing an index and how the server deals with index updates for using ParrellelMultiSearcher with built in RemoteSearcher ?? Need your help. Thanks, Youngho - Original Message - From: Youngho Cho [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 17, 2005 6:29 PM Subject: ParrellelMultiSearcher Question Hello, I would like to use ParrellelMultiSearcher with few RemoteSearchables. If one of the remote server is down, Can I parrellelMultiSearcher set close() and make new ParrellelMultiSearcher with other live RemoteSearchables ? Thanks. Youngho - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: big index and multi threaded IndexSearcher
Are you using multiple IndexSearcher instances?Or only one and sharing it across multiple threads? If using a single shared IndexSearcher instance doesn't help, it may be beneficial to port your code to Java and try it there. I'm just now getting into PyLucene myself - building a demo for a Unix User's Group presentation I'm giving. Erik On Feb 16, 2005, at 3:04 PM, Yura Smolsky wrote: Hello. I use PyLucene, python port of Lucene. I have problem about using big index (50Gb) with IndexSearcher from many threads. I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper around a Java/libgcj thread that python is tricked into thinking it's one of its own. The core of problem: When I have many threads (more than 5) I receive this exception: File /usr/lib/python2.4/site-packages/PyLucene.py, line 2241, in search def search(*args): return _PyLucene.Searcher_search(*args) ValueError: java.lang.OutOfMemoryError No stacktrace available When I decrease number of threads to 3 or even 1 then search works. How do many threads can affect to this exception?.. I have 2 Gb of memory. So with one thread the process takes like 1200-1300Mb. Andi Vajda suggested that There may be overhead involved in having multiple threads against a given index. Does anyone here have experience in handling big indexes with many threads? Any ideas are appreciated. Yura Smolsky. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fieldinformation from Index
On Feb 15, 2005, at 11:45 AM, Karl Koch wrote: 2) I need to know which Analyzer was used to index a field. One important rule, as we all know, is to use the same analyzer for indexing and searching a field. Is this information stored in the index or in full responsibilty of the application developer? The analyzer is not stored in the index, nor its name. I believe this was discussed in the past, though. It's not a rule that the same analyzer be used for both indexing and searching, and there are cases where it makes sense to use different ones. The analyzers must be compatible though. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateFilter on UnStored field
On Feb 14, 2005, at 6:27 AM, Sanyi wrote: However, DateFilter will not work on fields indexed as 2004-11-05. DateFilter only works on fields that were indexed using the DateField. Well, can you post here a short example? When I currently type xxx.UnStored(.. I can simply type xxx.DateField(.. ? Does it take strings like 2004-11-05? DateField has a utility method to return a String: DateField.timeToString(file.lastModified()) You'd use that String to pass to Field.UnStored. I recommend, though, that you use a different format, such as the -MM-DD format you're using. One option is to use a QueryFilter instead, filtering with a RangeQuery. I've read somewhere that classic range filtering can easily exceed the maximum number of boolean query clauses. I need to filter a very large range of dates with day accuracy and I don't want to increase the max. clause count to very high values. So, I decided to use DateFilter which has no such problems AFAIK. Right! In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter which would do the trick for you. If you want to stick with Lucene 1.4.x, that's fine... just grab the code for that filter and use it as a custom filter - its compatible with 1.4.x. How much impact does DateFilter have on search times? It depends on whether you instantiate a new filter for each search. Building a filter requires scanning through the terms in the index to build BitSet for the documents that fall in that range. Filters are best used over multiple searches. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What does [] do to a query and what's up with lucene.apache.org?
Jim, The Lucene website is transitioning to the new top-level space. I have checked out the current site to the new lucene.apache.org area and set up redirects from the old Jakarta URL's. The source code, though, is not an official part of the website. Thanks to our conversion to Subversion, though, the source is browsable starting here: http://svn.apache.org/repos/asf/lucene/java/trunk The HTML of the website will need link adjustments to get everything back in shape. The brackets are documented here: http://lucene.apache.org/queryparsersyntax.html Erik On Feb 14, 2005, at 10:31 AM, Jim Lynch wrote: First I'm getting a The requested URL could not be retrieved --- - While trying to retrieve the URL: http://lucene.apache.org/src/test/org/apache/lucene/queryParser/ TestQueryParser.java The following error was encountered: Unable to determine IP address from host name for /lucene.apache.org /Guess the system is down. I'm getting this error: org.apache.lucene.queryParser.ParseException: Encountered is at line 1, column 15. Was expecting: ] ... when I tried to parse the following string [this is a test]. I can't find any documentation that tells me what the brackets do to a query. I had a user that was used to another search engine that used [] to do proximity or near searches and tried it on this one. Actually I'd like to see the documentation for what the parser does. All that is mentioned in the javadoc is + - and (). Obviously there are more special characters. Thanks, Jim. Jim. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Are wildcard searches supposed to work with fields that are saved, indexed and not tokenized?
On Feb 14, 2005, at 12:40 PM, Jim Lynch wrote: I was trying to write some documentation on how to use the tool and issued a search for: contact:DENNIS MORROW Is that literally the QueryParser string you entered? If so, that parses to: contact:DENNIS OR defaultField:MORROW most likely. And now I get 648 hits, but in some of them the contact doesn't even remotely resemble the search pattern. For instance here are the what the contact fields contain for some of these hits: Contact: GENERIC CONTACT Contact: Andre Gardinalli Contact: Brett Morrow (that's especially interesting) Contact: KEN PATTERSON And of course there are some with Dennis' name too. Any idea why this is happening? I'm using the QueryParser.parse method. I'm not sure you'll be able to do this with QueryParser with spaces in an untokenized field. First try it with an API created WildcardQuery to be sure it works the way you expect. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
On Feb 14, 2005, at 2:40 PM, Paul Jans wrote: Hi again, So is SqlDirectory recommended for use in a cluster to workaround the accessibility problem, or are people using NFS or a standalone server instead? Neither. As far as I know, Berkeley DB is the only viable DB implementation currently. NFS has notoriously had issues with Lucene and file locking. Search the archives for more details on this. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numbers in Index
On Feb 14, 2005, at 4:32 PM, Miro Max wrote: actually i'm using standard analyzer during my index process. but when i browse the index with luke there also numbers inside. which analyzer should i use to eliminate this from my index or should i specify this in my stopword list? Don't use a stop word list to remove numbers. You could do a couple of things use SimpleAnalyzer, or write a custom analyzer which uses the parts of StandardAnalyzer and applies a number removal filter at the end. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateFilter on UnStored field
Following up on PA's reply. Yes, DateFilter works on *indexed* values, so whether a field is stored or not is irrelevant. However, DateFilter will not work on fields indexed as 2004-11-05. DateFilter only works on fields that were indexed using the DateField. One option is to use a QueryFilter instead, filtering with a RangeQuery. Erik On Feb 13, 2005, at 7:09 AM, Sanyi wrote: Hi! Does DateFilter work on fields indexed as UnStored? Can I filter an UnStored field with values like 2004-11-05 ? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Keywords/Keyphrases fields
The real question to answer is what types of queries you're planning on making. Rather than look at it from indexing forward, consider it from searching backwards. How will users query using those keyword phrases? Erik On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote: I'm getting a bit more serious about the final form of our lucene index. Each document has DocNumber, Authors, Title, Abstract, and Keywords. By Keywords, I mean a comma separated list, each entry having possibly many terms in a phrase like: temporal infomax, finite state automata, Markov chains, conditional entropy, neural information processing I presume I should be using a field Keywords which have many entries or instances per document (one per comma separated phrase). But I'm not sure the right way to handle all this. My assumption is that I should analyze them individually, just as we do for free text (the Abstract, for example), thus in the example above having 5 entries of the nature doc.add(Field.Text(Keywords, finite state automata)); etc, analyzing them because these are author-supplied strings with no canonical form. For guidance, I looked in the archive and found the attached email, but I didn't see the answer. (I'm not concerned about the dups, I presume that is equivalent to a boos of some sort) Does this seem right? Thanks once again. Owen From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Multiple equal Fields? Date: Tue, 17 Feb 2004 12:47:58 +0100 Hi! What happens if I do this: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, blah)); Is there a field foo with value blah or are there two foos (actually not possible) or is there one foo with the values bar and blah? And what does happen in this case: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); Does lucene store this only once? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Negative Match
On Feb 11, 2005, at 9:52 AM, Luke Shannon wrote: Hey Erik; The problem with that approach is I get document that don't have a kcfileupload field. This makes sense because these documents don't match the prohibited clause, but doesn't fit with the requirements of the system. Ok, so instead of using the dummy field with a single dummy value, use a dummy field to list the field names. Field.Keyword(fields,kcfileupload), but only for the documents that should have it, of course. Then use a query like (using QueryParser syntax, but do it with the API as you have since QueryParser doesn't support leading wildcards): +fields:kcfileupload -kcfileupload:*jpg* Again, your approach is risky with term expansion. Get more than 1,024 unique kcfileupload values and you'll see! Erik What I like best about this approach is it doesn't require a filter. The system I integrate with is presently designed to accept a query object. I wasn't looking forward to having to add the possibility that queries might require filters. I may have to still do this, but for now I would like to try this and see how it goes. Thanks, Luke - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 10, 2005 7:23 PM Subject: Re: Negative Match On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote: I think I found a pretty good way to do a negative match. In this query I am looking for all the Documents that have a kcfileupload field with any value except for jpg. Query negativeMatch = new WildcardQuery(new Term(kcfileupload, *jpg*)); BooleanQuery typeNegAll = new BooleanQuery(); Query allResults = new WildcardQuery(new Term(kcfileupload, *)); IndexSearcher searcher = new IndexSearcher(fsDir); BooleanClause clause = new BooleanClause(negativeMatch, false, true); typeNegAll.add(allResults, true, false); typeNegAll.add(clause); Hits hits = searcher.search(typeNegAll); With the little testing I have done this *seems* to work. Does anyone see a problem with this approach? Sure do you realize what WildcardQuery does under the covers? It literally expands to a BooleanQuery for all terms that match the pattern. There is an adjustable limit built-in of 1,024 clauses to BooleanQuery. You obviously have not hit that limit ... yet! You're better off using the advice offered on this thread previously create a single dummy field with a fixed value for all documents. Combine a TermQuery for that dummy value with a prohibited clause like y our negativeMatch above. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Fields with same name
On Feb 10, 2005, at 11:48 PM, Ramon Aseniero wrote: If I store multiple fields with same name for example Author with 3 values bob,jane,bill once I retrieve the doc are the values in the same order? Did you try it? :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
On Feb 10, 2005, at 5:00 PM, Paul Jans wrote: A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. Don't forget to get your copy of Lucene in Action too :) 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) However, most projects do fine with cruder techniques such as sharing the Lucene index on a common drive and ensuring that locking is configured to use the common drive also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
On Feb 11, 2005, at 1:36 PM, Erik Hatcher wrote: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] oops, to be completely technically correct, use curly brackets to get rather than = degree:cs AND gpa:{3.0 TO 9.9} (I'll assume GPA's only go to 4.0 or 5.0 :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Fields with same name
On Feb 11, 2005, at 3:51 PM, Ramon Aseniero wrote: I have not tried it -- Are there examples on the Lucene book? (I just bought the book and cant find thats related to my problem) No, this particular item is not covered in the book. My initial response was a succinct way of making a point. A lot of times it is worth investing in giving something a try with a little bit of code, and doing this with Lucene is trivial. I don't want to discourage anyone from asking questions, but rather encourage us all to do a little tinkering to find out things for ourselves and then ask if our assumptions don't come out as expected. In fact, I'd have to mock up an example to find out myself for sure, but my hunch is that Lucene would maintain the order as it probably doesn't make sense algorithmically to do anything but keep the order. Erik Thanks, Ramon -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, February 11, 2005 7:34 AM To: Lucene Users List Subject: Re: Multiple Fields with same name On Feb 10, 2005, at 11:48 PM, Ramon Aseniero wrote: If I store multiple fields with same name for example Author with 3 values bob,jane,bill once I retrieve the doc are the values in the same order? Did you try it? :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.8.7 - Release Date: 2/10/2005 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.8.7 - Release Date: 2/10/2005 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcards, stemming and searching
How would you deal with a query like a*z though? I suspect, however, that you only care about suffix queries and stemming those. If thats the case, then you could subclass getWildcardQuery and do internal stemming (remove trailing wildcard, run it through the analyzer directly there and return a modified WildcardQuery instance. With wildcard queries though, this is risky. Prefixes won't necessarily stem to what the full word would stem to. Erik On Feb 9, 2005, at 6:26 PM, aaz wrote: Hi, We are not using QueryParser and have some custom Query construction. We have an index that indexes various documents. Each document is Analyzed and indexed via StandardTokenizer() -StandardFilter() - LowercaseFilter() - StopFilter() - PorterStemFilter() We also want to support wildcard queries, hence on an inbound query we need to deal with * in the value side of the comparison. We also need to analyze the value side of the query against the same analyzer in which the index was built with. This leads to some problems and would like your solution opinion. User queries. somefield = united* After the analyzer hits united*, we get back unit. Hence we cannot detect that the user requested a wildcard. Lets say we come up with some solution to escape the * char before the Analyzer hits it. For example somefield = united* - unitedXXWILDCARDXX After analysis this then becomes unitedxxwildcardxx, which we can then turn into a WildcardQuery united* The problem here is that the term united will never exist in the indexing due to the stemming which did not stem properly due to our escape mechanism. How can I solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Negative Match
On Feb 10, 2005, at 4:06 PM, Luke Shannon wrote: I think I found a pretty good way to do a negative match. In this query I am looking for all the Documents that have a kcfileupload field with any value except for jpg. Query negativeMatch = new WildcardQuery(new Term(kcfileupload, *jpg*)); BooleanQuery typeNegAll = new BooleanQuery(); Query allResults = new WildcardQuery(new Term(kcfileupload, *)); IndexSearcher searcher = new IndexSearcher(fsDir); BooleanClause clause = new BooleanClause(negativeMatch, false, true); typeNegAll.add(allResults, true, false); typeNegAll.add(clause); Hits hits = searcher.search(typeNegAll); With the little testing I have done this *seems* to work. Does anyone see a problem with this approach? Sure do you realize what WildcardQuery does under the covers? It literally expands to a BooleanQuery for all terms that match the pattern. There is an adjustable limit built-in of 1,024 clauses to BooleanQuery. You obviously have not hit that limit ... yet! You're better off using the advice offered on this thread previously create a single dummy field with a fixed value for all documents. Combine a TermQuery for that dummy value with a prohibited clause like y our negativeMatch above. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
The only caveat to your VerbatimAnalyzer is that it will still split strings that are over 255 characters. CharTokenizer does that. Granted, though, that keyword fields probably don't make much sense to be that long. As mentioned yesterday - I added the LIA KeywordAnalyzer into the contrib area of Subversion. I had built one like you had also, but the one I contributed reads the entire input stream into a StringBuffer ensuring it does not get split like CharTokenizer would. Erik On Feb 9, 2005, at 4:40 AM, Miles Barr wrote: On Tue, 2005-02-08 at 12:19 -0500, Steven Rowe wrote: Why is there no KeywordAnalyzer? That is, an analyzer which doesn't mess with its input in any way, but just returns it as-is? I realize that under most circumstances, it would probably be more code to use it than just constructing a TermQuery, but having it would regularize query handling, and simplify new users' experience. And for the purposes of the PerFieldAnalyzerWrapper, it could be helpful. It's fairly straightforward to write one. Here's the one I put together for PerFieldAnalyzerWrapper situations: package org.apache.lucene.analysis; import java.io.Reader; public class VerbatimAnalyzer extends Analyzer { public VerbatimAnalyzer() { super(); } public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new VerbatimTokenizer(reader); return result; } /** * This tokenizer assumes that the entire input is just one token. */ public static class VerbatimTokenizer extends CharTokenizer { public VerbatimTokenizer(Reader reader) { super(reader); } protected boolean isTokenChar(char c) { return true; } } } -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sounds like spellcheck
On Feb 9, 2005, at 7:23 AM, Aad Nales wrote: In my Clipper days I could build an index on English words using a technique that was called soundex. Searching in that index resulted in hits of words that sounded the same. From what i remember this technique only worked for English. Has it ever been generalized? I do not know how Soundex/Metaphone/Double Metaphone work with non-English languages, but these algorithms are in Jakarta Commons Codec. I used the Metaphone algorithm as a custom analyzer example in Lucene in Action. You'll see it in the source code distribution under src/lia/analysis/codec. I did a couple of variations, one that adds the metaphoned version as a token in the same position and one that simply replaces it in the token stream. I even envisioned this sounds-like feature being used for children. I was mulling over this idea while having lunch with my son one day last spring (he was 5 at the time). I asked him how to spell cool cat and he replied c-o-l c-a-t. I tried it out with the metaphone algorithm and it matches! http://www.lucenebook.com/search?query=cool+cat Erik What i am trying to solve is this. A customer is looking for a solution to spelling mistakes made by children (upto 10) when typing in queries. The site is Dutch. Common mistakes are 'sgool' when searching for 'school'. The 'normal' spellcheckers and suggestors typically generate a list where the 'sounds like' candidates' are too far away from the result. So what I am thinking about doing is this: 1. create a parser that takes a word and creates a soundindex entry. 2. create list of 'correctly' spelled words either based on the index of the website or on some kind of dictionary. 2a. perhaps create a n-gram index based on these words 3. accept a query, figure out that a spelling mistake has been made 3a find alternatives by parsing the query and searching the 'sound like index' and then calculate and order the results Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1. My thinking is processing a series of replacement statements that go like: -- g sounds like ch if the immediate predecessor is an s. o sounds like oo if the immediate predecessor is a consonant -- But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions? Cheers, Aad - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
On Feb 9, 2005, at 4:51 AM, mark harwood wrote: A GUI plugin for Squirrel SQL ( http://squirrel-sql.sourceforge.net/) would make a great way of configuring the mapping. That would be slick! 1) Should we build this mapper into Luke instead? We would have to lift a LOT of the DB handling smarts from Squirrel. Luke however is doing a lot with Analyzer configuration which would certainly be useful code in any mapping tool (can we lift those and use in Squirrel?). The dilemma with Luke is that its not ASL'd (because of the thinlet integration). Anyone up for a Swing conversion project? :) It would be quite cool if Lucene had a built-in UI tool (like or actually Luke). Luke itself is ASL'd and I believe Andrzej has said he'd gladly donate it to Lucene's codebase, but the Thinlet LGPL is an issue. 2) What should the XML for the batch-driven configuration look like? Is it ANT tasks or a custom framework? Don't concern yourselves with Ant at the moment. Anything that is easily callable from Java can be made into an Ant task. In fact, the minimum requirements for an Ant task is a public void execute() method. Whatever Java infrastructure you come up with, I'll gladly create the Ant task wrapper for it when its ready. 3) If our mapping understands the make-up of the rdbms and the Lucene index should we introduce a higher-level software layer for searching which sits over the rdbms and Lucene and abstracts them to some extent? This layer would know where to go to retrieve field values or construct filters ie understands whether to retrieve title field for display from database column or a Lucene stored field and whether the price $100 search criteria is resolved by a lucene query or an RDBMS-query to produce a Lucene filter. It seems like currently, every DB+Lucene integration project struggles with designing a solution to manage this divide and handcodes the solution. Wow... that is getting pretty clever. I like it! I don't personally have a need for relational database indexing, but I support this effort to make a generalized mapping facility. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Configurable indexing of an RDBMS, has it been done before?
I agree that it is a worthwhile contribution. Some suggestions... allow the configuration to specify field boost values, and analyzer(s). If analyzers are specified per-field, then wrap then automatically with a PerFieldAnalyzerWrapper. Also, having a facility to aggregate fields into a contents-like field would be nice - though maybe this would be covered implicitly as part of the SQL mapping with one of the columns being an aggregate column. Perhaps the configuration aspect of it (XML mapping of expressions to field details) could be generalized to work with an object graph as well as SQL result sets. OGNL (www.ognl.org) makes expression language glue and I can see it being used for mappings - for example the name field could be mapped to company.president.name, where company is an object (or Map) with a president property, and so on. Erik On Feb 8, 2005, at 2:42 AM, Aad Nales wrote: If that is a general thought then I will plan for some time to put this in action. Cheers, Aad David Spencer wrote: Nice, very similar to what I was thinking of, where the most significant difference is probably just that I was thinking of a batch indexer, not one embedded in a web container. Probably a worthwhile contribution to the sandbox. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
The problem is that QueryParser analyzes all pieces of a query expression regardless of whether you indexed them as a Field.Keyword or not. If you need to use QueryParser and still support keyword fields, you'll want to plug in an analyzer specific to that field using PerFieldAnalyzerWrapper. You'll see this demonstrated in the Lucene in Action source code. Here's a quick pointer to where we cover it in the book: http://www.lucenebook.com/search?query=KeywordAnalyzer On Feb 8, 2005, at 9:26 AM, Mike Miller wrote: Thanks for the quick response. Sorry for my lack of understanding, but I am learning! Won't the query parser still handle this query? My limited understanding was that the search call provides the 'all' field as default field for query terms in the case where fields aren't specified. Using the current code, searches like author:Mike and title:Lucene work fine. -Original Message- From: Miles Barr [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject: Re: Problem searching Field.Keyword field You're using the query parser with the standard analyser. You should construct a term query manually instead. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does anyone have a copy of the highligher code?
On Feb 8, 2005, at 9:50 AM, Jim Lynch wrote: Our firewall prevents me from using cvs to check out anything. Does anyone have a jar file or a set of class files publicly available? The Lucene in Action source code - http://www.lucenebook.com - contains JAR files, including the Highlighter, for lots of Lucene add-on goodies. Also, Lucene just converted to using Subversion, which is much more firewall friendly. Try this after you have installed the svn client: svn co http://svn.apache.org/repos/asf/lucene/java/trunk Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote: Hi Erik, I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik From what I was reading in the mailing list there are more lucene users that would like to be able to construct sufix queries. They are very usefull for german language, because it has many long composite words , created by concatenation of other simple words. This is one of the requirements of our system. Therefore I needed to patch lucene to make QueryParser to allow SufixQueries. Now I will need to update lucene library to the latest version, and I need to patch it again. Do you think it will be possible in the future to have a field in QueryParser, boolean ALLOW_SUFFIX_QUERIES? I have no objections to that type of switch. Please submit a path to QueryParser.jj that implements this as an option with the default to disallow suffix queries, along with a test case and I'd be happy to apply it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
Kelvin - I respectfully disagree - could you elaborate on why this is not an appropriate use of Field.Keyword? If the category is How To, Field.Text would split this (depending on the Analyzer) into how and to. If the user is selecting a category from a drop-down, though, you shouldn't be using QueryParser on it, but instead aggregating a TermQuery(category, How To) into a BooleanQuery with the rest of it. The rest may be other API created clauses and likely a piece from QueryParser. Erik On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote: As I posted previously, Field.Keyword is appropriate in only certain situations. For your use-case, I believe Field.Text is more suitable. k On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote: This may or may not be correct, but I am indexing it as a keyword because I provide a (required) radio button on the add screen for the user to determine which category the document should be assigned. Then in the search, provide a dropdown that can be used in the advanced search so that they can search only for a specific category of documents (like HowTo, Troubleshooting, etc). -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 9:32 AM To: Lucene Users List Subject: RE: Problem searching Field.Keyword field Mike, is there a reason why you're indexing category as keyword not text? k On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote: Thanks for the quick response. Sorry for my lack of understanding, but I am learning! Won't the query parser still handle this query? My limited understanding was that the search call provides the 'all' field as default field for query terms in the case where fields aren't specified. Using the current code, searches like author:Mike and title:Lucene work fine. -Original Message- From: Miles Barr [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject: Re: Problem searching Field.Keyword field You're using the query parser with the standard analyser. You should construct a term query manually instead. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. -- -- - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- -- - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
On Feb 8, 2005, at 12:19 PM, Steven Rowe wrote: Why is there no KeywordAnalyzer? That is, an analyzer which doesn't mess with its input in any way, but just returns it as-is? I realize that under most circumstances, it would probably be more code to use it than just constructing a TermQuery, but having it would regularize query handling, and simplify new users' experience. And for the purposes of the PerFieldAnalyzerWrapper, it could be helpful. It's long been on my TODO list. I just adapted (changed the package names) the Lucene in Action KeywordAnalyzer and added it to the new contrib area: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/ src/java/org/apache/lucene/analysis/KeywordAnalyzer.java In the next official release of Lucene, the contrib (formerly known as the Sandbox) components will be packaged along with the Lucene core. I'm still working on this packaging build process as I migrate the Sandbox over to contrib. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
On Feb 7, 2005, at 2:07 AM, sergiu gordea wrote: Hi Erick, In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards code*/code or code?/code. I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change must to should. Will this change available in the next realease of lucene? How do you plan to implement this? Will this be available as an atributte of QueryParser? I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Similarity coord,lengthNorm
On Feb 7, 2005, at 8:53 AM, Michael Celona wrote: Would fixing the lengthNorm to 1 fix this problem? Yes, it would eliminate the length of a field as a factor. Your best bet is to set up a test harness where you can try out various tweaks to Similarity, but setting the length normalization factor to 1.0 may be all you need to do, as the coord() takes care of the other factor you're after. Erik Michael -Original Message- From: Michael Celona [mailto:[EMAIL PROTECTED] Sent: Monday, February 07, 2005 8:48 AM To: Lucene Users List Subject: Similarity coord,lengthNorm I have varying length text fields which I am searching on. I would like relevancy to be dictated predominantly by the number of terms in my query that match. Right now I am seeing a high relevancy for a single word matching in a small document even though all the terms in my query don't match. Does, anyone have an example of a custom Similarity sub class which overrides the coord and lengthNorm methods. Thanks.. Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Analyzer
On Feb 7, 2005, at 11:29 AM, Ravi wrote: How do I set the analyzer when I build the query in my code instead of using a query parser ? You don't. All terms you use for any Query subclasses you instantiate must match exactly the terms in the index. If you need an analyzer to do this then you're responsible for doing it yourself, just as QueryParser does underneath. I do this myself in my current application like this: private Query createPhraseQuery(String fieldName, String string, boolean lowercase) { RossettiAnalyzer analyzer = new RossettiAnalyzer(lowercase); TokenStream stream = analyzer.tokenStream(fieldName, new StringReader(string)); PhraseQuery pq = new PhraseQuery(); Token token; try { while ((token = stream.next()) != null) { pq.add(new Term(fieldName, token.termText())); } } catch (IOException ignored) { // ignore - shouldn't get an IOException on a StringReader } if (pq.getTerms().length == 1) { // optimize single term phrase to TermQuery return new TermQuery(pq.getTerms()[0]); } return pq; } Hope that helps. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fwd: SearchBean?
I want to double-check with the user community now that I've run this past the lucene-dev list. Anyone using SearchBean from the Sandbox? If so, please speak up and let me know what it offers that the sort feature does not. If this is now essentially deprecated, I'd like to remove it. Thanks, Erik Begin forwarded message: From: Erik Hatcher [EMAIL PROTECTED] Date: February 6, 2005 10:02:37 AM EST To: Lucene List lucene-dev@jakarta.apache.org Subject: SearchBean? Reply-To: Lucene Developers List lucene-dev@jakarta.apache.org Is the SearchBean code in the Sandbox still useful now that we have sorting in Lucene 1.4? If so, what does it offer that the core does not provide now? As I'm cleaning up the sandbox and migrating it to a contrib area, I'm evaluating the pieces and making sure it makes sense to keep or if it is no longer useful or should be reorganized in some way. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote: If you want to start doing suffix queries (ie: all names ending with s, or all names ending with Smith) one approach would be to use WildcarQuery, which as Erik mentioned, will allow you to use a quey Term that starts with a *. ie... Query q3 = new WildcardQuery(new Term(name,*s)); Query q4 = new WildcardQuery(new Term(name,*Smith)); (NOTE: Erik says you can do this, but the docs for WildcardQuery say you can't I'll assume the docs are wrong and Erik is correct.) I assume you mean this comment on WildcardQuery's javadocs: In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards code*/code or code?/code. I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change must to should. And yes, WildcardQuery itself supports a leading wildcard character exactly as you have shown. Which leads me to my point: if you denormalize your data so that you store both the Term you want, and the *reverse* of the term you want, then a Suffix query is just a Prefix query on a reversed field -- by sacrificing space, you can get all the speed efficiencies of a PrefixQuery when doing a SuffixQuery... D1 name:Adam Smith rname:htimS madA age:13 state:CA ... D2 name:Joe Bob rname:boB oeJ age:42 state:WA ... D3 name:John Adams rname:smadA nhoJ age:35 state:NV ... D3 name:Sue Smith rname:htimS euS age:33 state:CA ... Query q1 = new PrefixQuery(new Term(name,J*)); Query q2 = new PrefixQuery(new Term(name,Sue*)); Query q3 = new PrefixQuery(new Term(rname,s*)); Query q4 = new PrefixQuery(new Term(rname,htimS*)); (If anyone sees a flaw in my theory, please chime in) This trick has been mentioned on this list before, and is a good one. I'll go one step further and mention another technique I found in the book Managing Gigabytes, making *string* queries drastically more efficient for searching (though also impacting index size). Take the term cat. It would be indexed with all rotated variations with an end of word marker added: cat$ at$c t$ca $cat The query for *at* would be preprocessed and rotated such that the wildcards are collapsed at the end to search for at* as a PrefixQuery. A wildcard in the middle of a string like c*t would become a prefix query for t$c*. Has anyone tried this technique with Lucene? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Eventually you can just do PHP within the servlet container http://www.jcp.org/en/jsr/detail?id=223 and have your cake and eat it too! :) Erik On Feb 6, 2005, at 12:10 PM, Owen Densmore wrote: I'm building a lucene project for a client who uses php for their dynamic web pages. It would be possible to add servlets to their environment easily enough (they use apache) but I'd like to have minimal impact on their IT group. There appears to be a php java extension that lets php call back forth to java classes, but I thought I'd ask here if anyone has had success using lucene from php. Note: I looked in the Lucene In Action search page, and yup, I bought the book and love it! No examples there tho. The list archives mention that using java lucene from php is the way to go, without saying how. There's mention of a lucene server and a php interface to that. And some similar comments. But I'm a bit surprised there's not a bit more in terms of use of the official java extension to php. Thanks for the great package! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Some questions about index...
On Feb 5, 2005, at 10:04 AM, Karl Koch wrote: 1) Can I store all the information of the text file, but also apply a analyser. E.g. I use the StopAnalyzer. After finding the document, I want to extract the original text also from the index. Does this require that I store the information twice in two different fields (one indexed and one unindexed) ? You should use a single stored, tokenized, and indexed field for this purpose. Be cautious of how you construct the Field object to achieve this. 2) I would like to extract information from the index which can found in a boolean way. I know that Lucene is a VSM which provides Boolean operators. This however does not change its functioning. For example, I have a field with contains an ID number and I want to use the search like a database operatation (e.g. to find the document with id=1). I can solve the problem by searching with query id:1. However, this does not ensure that I will only get one result. Usually the first result is the document I want. But it could happen, that this sometimes does not work. Why wouldn't it work? For ID-type fields, use a Field.Keyword (stored, indexed, but not tokenized). Search for a specific ID using a TermQuery (don't use QueryParser for this, please). If the ID values are unique, you'll either get zero or one result. What happens if I should get no results? I guess if I search for id=5 and 5 did not exist I would probably get 50, 51, .. just because the contain 5. Did somebody work with this and can suggest a stable solution? No, this would not be the case, unless you're analyzing the ID field with some strange character-by-character analyzer or doing a wildcard *5* type query. A good solution for these two questions would help me avoiding a database which would need to replicate most the data which I already have in my Lucene index... You're on the right track and avoiding a database when it is overkill or duplicative is commendable :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
On Feb 4, 2005, at 9:49 AM, Simeon Koptelov wrote: The LiA says that I can use Sort.INDEXORDER when indexing order is relevant and gives an example where documents' ids (got from Hits.id() ) are increasing from top to bottom of resultset. Are that ids the same thing as document numbers? Yes, id is the same as document number. If they are the same, how can it be that they are preserved during indexing process? LiA says that documents are renumbered when merging segments. By renumbered, it means it squeezes out holes left by deletes. The actual order does not change and thus does not affect a Sort.INDEXORDER sort. Documents are stored in the index in the order that they were indexed - nothing changes this order. Document id's are not permanent if deletes occur followed by an optimize. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document numbers and ids
On Feb 4, 2005, at 12:24 PM, Simeon Koptelov wrote: By renumbered, it means it squeezes out holes left by deletes. The actual order does not change and thus does not affect a Sort.INDEXORDER sort. Documents are stored in the index in the order that they were indexed - nothing changes this order. Document id's are not permanent if deletes occur followed by an optimize. Thanks for clarification, Erik. Could you answer one more question: can I control the assignment of document numbers during indexing? No, you cannot control Lucene's document id scheme - it is basically for internal use. Maybe I should explain, why I'm asking. I'm searching for documents, but for most (almost all) of them I don't really care about their content. I only want to know a particular numeric field from document (id of document's category). I also need to know how many docs in category were found, so I can't index categories instead of docs. The result set can be pertty big (30K) and all must be handled in inner loop. So I wanna use HitCollector and assign intervals of ids to categories of documents. Following this way, there's no need to actually retrieve document in inner loop. Am I on the right way? You should explore the use of IndexReader. Index your documents with category id field, and use the methods on IndexReader to find all unique categories (TermEnum). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
It matches both because you're tokenizing the name field. In both documents, the name field has a testing term in it (it gets lowercased also). A PrefixQuery matches terms that start with the prefix. Use an untokenized field type (Field.Keyword) if you want to keep the entire original string as-is for searching purposes - however you'd have issues with case-sensitivity in your example. Also keep in mind that QueryParser only allows a trailing asterisk, creating a PrefixQuery. However, if you use a WildcardQuery directly, you can use an asterisk as the starting character (at the risk of performance). Erik On Feb 4, 2005, at 7:50 PM, Luke Shannon wrote: Hello; I have these two documents: Textsort:9 Keywordmodified:0e1as4og8 Textprogress_ref:1099927045180 Textname:FutureBrand Testing Textdesc:Demo Textanouncement:We are testing our project Textcategory:Category 1 Textolfaithfull:stillhere Textposter:hello Texturgent:yes Textprovider:Mo Textsort:1 TextAuthor:cbalom TextCreator:PScript5.dll Version 5.2.2 Keywordmodified:0e1bgsfk0 Keywordmodified:0e1bgsfk0 TextProducer:Acrobat Distiller 5.0.5 (Windows) Textprogress_ref:1099957931806 Textname:testing stuff Textdesc:testing Textcategory:Category 1 Textolfaithfull:stillhere Textposter:hello TextTitle:Microsoft Word - FINAL-FutureBrand Creates, Launches 'Air Canada' Brand Ide. Textprovider:Ray Textkcfileupload:aircanada3.pdf I would like to be able to match a name fields that starts with testing (specifically) and those that end with it. I thought the below code would parse to a Prefix Query that would satisfy my starting requirment (maybe I don't understand what this query is for). But this matches both. Query query = QueryParser.parse(testing*, name, new StandardAnalyzer()); Has anyone done this before? Any tips? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Has anyone tried indexing xml files: DigesterXMLHandler.java file before?
You're missing the Commons Digester JAR, which is in the lib directory of the LIA download. Check the build.xml file for the build details of how the compile class path is set. You'll likely need some other JAR's at runtime too. Erik On Feb 3, 2005, at 2:12 AM, jac jac wrote: Hi, I just tried to compile DigesterXMLHandler.java from the LIA codes which I have gotten from the src directory. I placed it into my own directory... I could't seem to be able to compile DigesterXMLHandler.java: It keeps prompting: DigesterXMLHandler.java:9: package org.apache.commons.digester does not exist import org.apache.commons.digester.Digester; ^ DigesterXMLHandler.java:19: cannot resolve symbol symbol : class Digester location: class lia.handlingtypes.xml.DigesterXMLHandler private Digester dig; ^ DigesterXMLHandler.java:25: cannot resolve symbol symbol : class Digester location: class lia.handlingtypes.xml.DigesterXMLHandler dig = new Digester(); I have set the classpath... May I know how do we run the file in order to get my index folder? so sorry, i really can't interpret the way to run it... are there any documentation around...? thanks very much! Yahoo! Mobile - Download the latest ringtones, games, and more! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Subversion conversion
We can work the 1.x and 2.0 lines of code however we need to. We can branch (a branch or tag in Subversion is inexpensive and a constant time operation). How we want to manage both versions of Lucene is open for discussion. Nothing about Subversion changes how we manage this from how we'd do it with CVS. Currently the 1.x and 2.x lines of code are one and the same. Once they diverge in 2.0, it will depend on who steps up to maintain 1.x but I suspect there will be a strong interest in keeping it alive by some, but we would of course encourage everyone using 1.x upgrade to 1.9 and remove deprecation warnings. Erik On Feb 3, 2005, at 4:33 AM, Miles Barr wrote: On Wed, 2005-02-02 at 22:11 -0500, Erik Hatcher wrote: I've seen both of these types of procedures followed on Apache projects. It really just depends. Lucene's codebase is not being modified frequently, so it is not necessary to branch and merge back. Rather we simply develop off of the trunk (HEAD) and when we're ready for a release we'll just do it from the trunk. Actually we'd most likely tag and build from that tag just to be clean about it. What consequences does this have for the 1.9/2.0 releases? i.e. after 2.0 the deprecated API will be removed, does this mean 1.x will no longer be supported after 2.0? The typical scenario being a bug is found that affects 1.x and 2.x, it's patched in 2.x (i.e. the trunk) but we can't patch the last 1.x release. The other scenario being a bug is found in the 1.x code, but it cannot be applied. -- Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Right way to make analyzer
On Feb 3, 2005, at 9:26 AM, Owen Densmore wrote: Is this the right way to make a porter analyzer using the standard tokenizer? I'm not sure about the order of the filters. Owen class MyAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), StopAnalyzer.ENGLISH_STOP_WORDS)); } } Yes, that is correct. Analysis starts with a tokenizer, and chains the output of that to the next filter and so on. I strongly recommend, as you start tinkering with custom analysis, to use a little bit of code to see how your analyzer works on some text. The Lucene Intro article I wrote for java.net has some code you can borrow to do this, as does Lucene in Action's source code. Also, Luke has this capability - which is a tool I also highly recommend. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: REPLACE USING ANALYZERS
On Feb 2, 2005, at 4:12 AM, Karthik N S wrote: Hi Guys Apologies. I am would like to know if Any Analyzers out there which can give me the required o/p as shown below Sure: string.replaceAll(~,) :) 1) I/p = +~shoes -~nike O/p = +shoes -nike 2) I/p = +(+~shoes -~nike) O/p = +(+shoes -nike) 3) I/p = +~shoes -~nike O/p = +shoes -nike [ Note:- I am Using the Javascript tool avaliable from Lucene Contributers Site to build Advance Search with synonym factor ] Thx in advance image.tiff WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: enquiries - pls help, thanks
On Feb 2, 2005, at 2:40 AM, jac jac wrote: May I know whether Lucene currently supports indexing of xml documents? That's a loaded question. Lucene supports it by being able to index text, sure. But Lucene does not include an XML parser and the facility to automatically turn an XML file into a Lucene document, nor would you want that. For example - in my current project, I'm parsing XML documents, and indexing pieces of them individually as Lucene Documents - in fact I'm doing that in all kinds of various ways too. The demo applications that you've tried are not designed for anything but a very very basic demonstration of how to use Lucene - these example applications were never intended to be used as-is for anything other than some code you could borrow and learn from to build your own custom solutions. If you want a quick jump on processing XML with Lucene, try out the code that comes with Lucene in Action (grab it from www.lucenebook.com). When you get the code, run this: $ ant ExtensionFileHandler Buildfile: build.xml ... ExtensionFileHandler: [echo] [echo] This example demonstrates the file extension document handler. [echo] Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are [echo] all handled by the framework. The contents of the Lucene Document [echo] built for the specified file is displayed. [echo] [input] Press return to continue... [input] File: [src/lia/handlingtypes/data/HTML.html] src/lia/handlingtypes/data/addressbook.xml [echo] Running lia.handlingtypes.framework.ExtensionFileHandler... [java] log4j:WARN No appenders could be found for logger (org.apache.commons.digester.Digester.sax). [java] log4j:WARN Please initialize the log4j system properly. [java] DocumentKeywordtype:business Keywordname:SAMOFIX d.o.o. Keywordaddress:Ilica 47-2 Keywordcity:Zagreb Keywordprovince: Keywordpostalcode:1 Keywordcountry:Croatia Keywordtelephone:+385 1 123 4567 BUILD SUCCESSFUL Total time: 18 seconds Note that I typed in the path to an XML file where it asks for [input]. Now dig into the source tree and borrow what you need from src/lia/handlingtypes Erik I tried building an index to index all my directories in webapps: via: java org.apache.lucene.demo.IndexFiles /homedir/tomcat/webapps then I tried using the following command to search: java org.apache.lucene.demo.SearchFiles and i typed in my query. I was able to see the files which directs me the path which holds my data. However, when I do java org.apache.lucene.demo.IndexHTML -create -index /homedir/index .. and I went to my website I realised it can't serach for the data I wanted instead. I want to search data within XML documents... May I know if the current demo version allows indexing of XML documents? Why is it that after I do java org.apache.lucene.demo.IndexHTML -create -index /homedir/index .. then the data I wanted can't be searched? thanks alot! jac Yahoo! Mobile - Download the latest ringtones, games, and more! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. The code in the Lucene Sandbox for parsing HTML with JTidy (under contributions/ant) for the index task does what you ask. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Compile lucene
On Feb 2, 2005, at 2:26 PM, Helen Butler wrote: Hi Im trying to Compile Lucene but am encountering the following error on typing ant from the root of Lucene-1.4.3 C:\lucene-1.4.3ant Buildfile: build.xml init: compile-core: BUILD FAILED C:\lucene-1.4.3\build.xml:140: srcdir C:\lucene-1.4.3\src\java does not e= xist! I've installed a jdk and ant successfully and set the following CLASSPATH C:\lucene-1.4.3\lucene-demos-1.4.3.jar;C:\lucene-1.4.3\lucene-1.4.3.jar first rule of using Ant, don't use a CLASSPATH. It is unnecessary, not to mention you put JAR files in there that you appear to be trying to build. Do you have the source code distribution of Lucene? It appears not, or you'd have src/java available. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Subversion conversion
The conversion to Subversion is complete. The new repository is available to users read-only at: http://svn.apache.org/repos/asf/lucene/java/trunk Besides /trunk, there is also /branches and /tags. /tags contains all the CVS tags made so that you could grab a snapshot of a previous version. /trunk is analogous to CVS HEAD. You can learn more about the Apache repository configuration here and how to use the command-line client to check out the repository: http://www.apache.org/dev/version-control.html Learn about Subversion, including the complete O'Reilly Subversion book in electronic form for free here: http://subversion.tigris.org For committers, check out the repository using https and your Apache username/password. The Lucene sandbox has been integrated into our single Subversion repository, under /java/trunk/sandbox: http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/ The Lucene CVS repositories have been locked for read-only. If there are any issues with this conversion, let me know and I'll bring them to the Apache infrastructure group. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Subversion conversion
I've seen both of these types of procedures followed on Apache projects. It really just depends. Lucene's codebase is not being modified frequently, so it is not necessary to branch and merge back. Rather we simply develop off of the trunk (HEAD) and when we're ready for a release we'll just do it from the trunk. Actually we'd most likely tag and build from that tag just to be clean about it. Erik On Feb 2, 2005, at 7:49 PM, Chakra Yadavalli wrote: Hello ALL, It might not be the right place for it but as we are talking about SCM, I have a quick question. First, I haven't used CVS/SVN on any project. I am a ClearCase/PVCS guy. I just would like to know WHICH CONFIGURATION MANAGEMENT PLAN DO YOU FOLLOW IN LUCENE DEVELOPMENT. PLAN A: DEVELOP IN TRUNK AND BRANCH OFF ON RELEASE Recently I had a discussion with a friend about developing in the TRUNK (which in the /main in ClearCase speak), which my friend claims that is done in the APACHE/Open Source projects. The main advantage he pointed was that Merging could be avoided if you are developing in the TRUNK. And when there is a release, they create a new Branch (say LUCENE_1.5 branch) and label them. That branch will be used for maintenance and any code deltas will be merged back to TRUNK as needed. PLAN B: BRANCH OF BEFORE PLANNED RELEASE AND MERGE BACK TO MAIN/TRUNK As I am from a private workspace/isolated development school of thought promoted by ClearCase, I am used to create a branch at the project/release initiation and develop in that branch (say /main/dev). Similarly, we have /main/int for making changes when the project goes to integration phase, and a /main/acp branch for acceptance. In this school, the /main will always have fewer versions of files and the difference between any two consecutive versions is the NET CHANGE of that SCM element (either file or dir) between two releases (say LUCENE 1.4 and 1.5). Thanks in advance for your time. Chakra Yadavalli http://jroller.com/page/cyblogue -Original Message- From: aurora [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 02, 2005 4:25 PM To: lucene-user@jakarta.apache.org Subject: Re: Subversion conversion Subversion rocks! I have just setup the Windows svn client TortoiseSVN with my favourite file manager Total Commander 6.5. The svn status and commands are readily integrated with the file manager. Offline diff and revert are two things I really like from svn. The conversion to Subversion is complete. The new repository is available to users read-only at: http://svn.apache.org/repos/asf/lucene/java/trunk Besides /trunk, there is also /branches and /tags. /tags contains all the CVS tags made so that you could grab a snapshot of a previous version. /trunk is analogous to CVS HEAD. You can learn more about the Apache repository configuration here and how to use the command-line client to check out the repository: http://www.apache.org/dev/version-control.html Learn about Subversion, including the complete O'Reilly Subversion book in electronic form for free here: http://subversion.tigris.org For committers, check out the repository using https and your Apache username/password. The Lucene sandbox has been integrated into our single Subversion repository, under /java/trunk/sandbox: http://svn.apache.org/repos/asf/lucene/java/trunk/sandbox/ The Lucene CVS repositories have been locked for read-only. If there are any issues with this conversion, let me know and I'll bring them to the Apache infrastructure group. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Visit my weblog: http://www.jroller.com/page/cyblogue - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Can I sort search results by score and docID at one time?
On Feb 1, 2005, at 4:21 AM, Jingkang Zhang wrote: Lucene support sort by score or docID.Now I want to sort search results by score and docID or by two fields at one time, like sql command order by score,docID , how can I do it? Sorting by multiple fields (including score and document id) is supported. Here's an example: new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) }) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
On Feb 1, 2005, at 9:01 AM, Jerry Jalenak wrote: Is there a way to eliminate duplicate hits being returned from the index? Sure, don't put duplicate documents in the index :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
On Feb 1, 2005, at 9:49 AM, Jerry Jalenak wrote: Given Erik's response of 'don't put duplicate documents in the index', how can I accomplish this in the IndexWriter? As John said - you'll have to come up with some way of knowing whether you should index or not. For example, when dealing with filesystem files, the Ant index task (in the sandbox) checks last modified date and only indexes new files. Using a unique id on your data (primary key from a DB, URL from web pages, etc) is generally what people use for this. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: User Rights Management in Lucene
On Feb 1, 2005, at 10:01 AM, Verma Atul (extern) wrote: Hi, I'm new to Lucene and want to know, whether Lucene has the capability of displaying the search results based the Users Rights. For Example: There are suppose some resources, like : Resource 1 Resource 2 Resource 3 Resource 4 And there are say 2 users with User 1 having access to Resource 1, Resource 2 and Resource 4; and User 2 having access to Resource 1 and Resource 3 So when User 1 searches the database, then he should get results from Resource 1, 2 and 4, but When User 2 searches the databse, then he should get results from Resource 1 and 3. Lucene in Action has a SecurityFilterTest example (grab the source code distribution). You can see a glimpse of this here: http://www.lucenebook.com/search?query=security So yes, its possible to index a username or roles alongside each document and apply that criteria to any search a user makes such that a user only gets documents allowed. How complex this gets depends on how you need the permissions to work - the LIA example is rudimentary and simply associates an owner with each document and users are only allowed to see the documents they own. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Duplicate Hits
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an IndexWriter open. I may have to try and deal with this issue through some sort of filter on the query side, provided it doesn't impact performance to much. You can use an IndexReader and IndexWriter at the same time (the caveat is that you cannot delete with the IndexReader at the same time you're writing with an IndexWriter). Is there no other identifying information, though, on the incoming documents with a date stamp? Identifier? Or something unique you can go on? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Format
How are you indexing your document? If you're using QueryParser with the default operator set to OR (which is the default), then you've already provided the expression you need :) Erik On Feb 1, 2005, at 6:29 PM, Hetan Shah wrote: Hello All, What should my query look like if I want to search all or any of the following key words. Sun Linux Red Hat Advance Server replies are much appreciated. -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results
On Feb 1, 2005, at 7:36 PM, Hetan Shah wrote: Another question for the day: How to make sure that the results shown are the only one containing the keywords specified? e.g. the result for the query Red AND HAT AND Linux should result in documents which has all the three key words and not show documents that only has one or two keywords? Huh? You would never get documents returned that only had two of those terms given that AND'd query. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search results excerpt similar to Google
On Jan 28, 2005, at 1:46 AM, Jason Polites wrote: I think they do a proximity result based on keyword matches. So... If you search for lucene and the document returned has this word at the very start and the very end of the document, then you will see the two sentences (sequences of words) surrounding the two keyword matches, one from the start of the document and one from the end. There is a Highlighter package in the Lucene sandbox. Highlighting looks like this: http://www.lucenebook.com/search?query=highlighter How you determine which words from the result you include in the summary is up to you. The problem with this it that in Lucene-land you have to store the content of the document inside in index verbatim (so you can get arbitrary portions of it out). This means your index will be larger than it really needs to be. You do not have to store the content in the index, it just happens to be convenient for most situations. Content could be stored anywhere. Getting the text and reanalyzing it for Highlighter is all that is required. Storing in the index has some performance benefits in the CVS version of Lucene, as you can store term position offset information and avoid having to re-analyze for highlighting. Erik I usually just store the first 255 characters in the index and use this as a summary. It's not as good as Google, but it seems to work ok. - Original Message - From: Ben [EMAIL PROTECTED] To: Lucene lucene-user@jakarta.apache.org Sent: Friday, January 28, 2005 5:08 PM Subject: Search results excerpt similar to Google Hi Is it hard to implement a function that displays the search results excerpts similar to Google? Is it just string manipulations or there are some logic behind it? I like their excerpts. Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene query (sql kind)
Ross - I'm really perplexed by your message. You create HTML from a database so that you can index it with Lucene, yet wish you could simply index the data in your database tied to a primary key directly, right? Well, you're in luck - you already can do this! What are you using for indexing? It sounds like you borrowed the Lucene demo and have just run with that directly. Erik On Jan 28, 2005, at 11:02 AM, Ross Rankin wrote: I agree. My site is all dynamic pages created from the database. Right now, I have to have a process create dummy pages, index them with Lucene, then translate the Lucene results into meaningful links. It actually works better than it sounds, however it could be easier. If I could just give Lucene a query result (i.e. a list of rows) and then have Lucene send me back say the primary key of the rows that match and the other Lucene goodness: ranking, number of hits, etc. Could be pretty powerful and simplify the deployment for database driven applications. [Note: this opinion and $3.00 will get you a coffee at Starbucks] Ross -Original Message- From: PA [mailto:[EMAIL PROTECTED] Sent: Friday, January 28, 2005 6:44 AM To: Lucene Users List Subject: Re: lucene query (sql kind) On Jan 28, 2005, at 12:40, sunil goyal wrote: I want to run dynamic queries against the lucene index. Is there any native syntax available for Lucene so that I can query, by first generating the query in say an XML or SQL like format (cache this query) and then use this query over lucene index. Talking of which, did anyone contemplated the possibility of a gaspJDBC/gasp adaptor of sort for Lucene? Cheers -- PA, Onnay Equitursay http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LuceneReader.delete (term t) Failure ?
How did you index the uid field? Field.Keyword? If not, that may be the problem in that the field was analyzed. For a key field like this, it needs to be unanalyzed/untokenized. Erik On Jan 27, 2005, at 6:21 PM, [EMAIL PROTECTED] wrote: Hi, I am trying to delete a document from Lucene index using: Term aTerm = new Term( uid, path ); aReader.delete( aTerm ); aReader.close(); If the variable path=xxx/foo.txt then I am able to delete the document. However, if path variable has - in the string, the delete method does not work e.g. path=xxx-yyy/foo.txt // Does Not work!! Can I get around this problem. I cannot subsitute minus character with '.' as it has other implications. is this a bug ? I am using Lucene 1.4-final version. Thanks for the help Atul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there
I've often said that there is a business to be had in packaging up Lucene (and now Nutch) into a cute little box with user friendly management software to search your intranet. SearchBlox is already there (except they don't include the box). I really hope that an application like SearchBlox/Zilverline can be created as part of the Lucene project itself, replacing the sad demos that currently ship with Lucene. I've got so many things on my plate that I don't foresee myself getting to this as soon as I'd like, but I would most definitely support and contribute what time I could to such an effort. If the web UI used Tapestry, I'd be very inclined to dig in hardcore to it. Any other web UI technology would likely turn me off. One of these days I'll Tapestry-ify Nutch just for grins and submit it as a replacement for the JSPs. And I'm even more sold on it if Mac Mini's are involved! :) Erik On Jan 27, 2005, at 7:16 PM, David Spencer wrote: This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to configure the thing and to customize the LF of the search results. jian chen wrote: Hi, I was searching using google and just found that there was a new feature called google mini. Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The nice feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LuceneReader.delete (term t) Failure ?
Could you work up a self-contained RAMDirectory-using example that demonstrates this issue? Erik On Jan 27, 2005, at 9:10 PM, [EMAIL PROTECTED] wrote: Erik, I am using the keyword field doc.add(Field.Keyword(uid, pathRelToArea)); anything else I can check on ? thanks atul PS we worked together for Darden project From: Erik Hatcher [EMAIL PROTECTED] Date: 2005/01/27 Thu PM 07:46:40 EST To: Lucene Users List lucene-user@jakarta.apache.org Subject: Re: LuceneReader.delete (term t) Failure ? How did you index the uid field? Field.Keyword? If not, that may be the problem in that the field was analyzed. For a key field like this, it needs to be unanalyzed/untokenized. Erik On Jan 27, 2005, at 6:21 PM, [EMAIL PROTECTED] wrote: Hi, I am trying to delete a document from Lucene index using: Term aTerm = new Term( uid, path ); aReader.delete( aTerm ); aReader.close(); If the variable path=xxx/foo.txt then I am able to delete the document. However, if path variable has - in the string, the delete method does not work e.g. path=xxx-yyy/foo.txt // Does Not work!! Can I get around this problem. I cannot subsitute minus character with '.' as it has other implications. is this a bug ? I am using Lucene 1.4-final version. Thanks for the help Atul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search on heterogenous index
On Jan 26, 2005, at 5:44 AM, Simeon Koptelov wrote: Heterogenous Documents/indices are OK - check out the second hit: http://www.lucenebook.com/search?query=heterogenous+different Thanks, I'll consider buying Lucene in Action. Our master plan is working! :) Just kidding I have on my TODO list to aggregate more Lucene related content (like the javadocs, Lucene's own documentation, perhaps a crawl of the wiki and the Lucene resources) into our search engine so that it becomes a richer resource and seems less than a marketing ploy. Though the highlighted snippets do have enough information to be useful in some cases, which is nice. I will start dedicating a few minutes a day to blog some useful content. By all means, if you have other suggestions for our site, let us know at [EMAIL PROTECTED] Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestions for documentation or LIA
On Jan 26, 2005, at 10:25 AM, Ian Soboroff wrote: Erik Hatcher [EMAIL PROTECTED] writes: By all means, if you have other suggestions for our site, let us know at [EMAIL PROTECTED] One of the things I would like to see, but which isn't either in the Lucene site, documentation, or Lucene in Action, is a complete description of how the retrieval algorithm works. That is, how the HitCollector, Scorers, Similarity, etc all fit together. I'm involved in a project which to some degree is looking at poking deeply into this part of the Lucene code. We have a nice (non-Lucene) framework for working with more different kinds of similarity functions (beyond tf-idf) which should also be expandable to include query expansion, relevance feedback, and the like. I used to think that integrating it would be as simple as hacking in Similarity, but I'm beginning to think it might need broader changes. I could obviously hook in our whole retrieval setup by just diving for an IndexReader and doing it all by hand, but then I would have to redo the incremental search and possibly the rich query structure, which would be a lose. So anyway, I got LIA hoping for a good explanation (not a good Explanation) on this bit, but it wasn't there. Hacking Similarity wasn't covered in LIA for one simple reason - Lucene's built-in scoring mechanism really is good enough for almost all projects. The book was written for developers of those projects. Personally, I've not had to hack Similarity, though I've toyed with it in prototypes and am using a minor tweak (turning off length normalization for the title field) for the lucenebook.com book indexing. There are some hints on the Lucene site, but nothing complete. If I muddle it out before anything gets contributed, I'll try to write something up, but don't expect anything too soon... And maybe you'd contribute what you write to LIA 2nd edition :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: text highlighting
Also, there are some examples in the Lucene in Action source code (grab it from http://www.lucenebook.com) (see HighlightIt.java). Erik On Jan 26, 2005, at 5:52 PM, markharw00d wrote: Michael Celona wrote: Does any have a working example of the highlighter class found in the sandbox? There are several in the accompanying Junit test: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ contributions/highlighter/src/test/org/apache/lucene/search/highlight/ Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: reading fields selectively
I'm not sure what the status of the selective field access feature is, but from what you've written it sounds like you are not aware of the existing feature of fields to turn off storage. For an all field, you probably do not want to access its actual value, so simply do not store it in the index. Field.UnStored or Field.Text with a java.io.Reader will do the trick. If you use unstored fields for text, and only have a Field.Keyword id field, you will minimize the size of your index and the Document objects obtained from Hits will be tiny. Erik On Jan 25, 2005, at 3:38 AM, sergiu gordea wrote: Hi to all lucene developers, The read fields selectively feature would be a very useful for me. Do you plan to include it in the next lucene realeases? I can patch lucene, but I will need to do it each time I upgrade my version, and probably I would need to run the unit tests, and this is just duplicated effort I'm working on an application that uses lucene only to index information that we store in the database and in external files. We perform the search with lucene to get the IDs of our database records. The ID keyword field is the only one that we need to read from the index. Each document may index a few txt, pdf, doc, html, ppt, or xls files, and some other database fields, so .. the size of the lucene documents may be quite big. Writing the ID as the first field in the index, and having the possibility to read only the ID from the index will be a great performance improvement in our case (speed and memory usage). Another frecquenty met situation is to have an index with an ALL field, in order to perform the search easily, and a few another separate fields, needed to get information from the index and to apply special constraints (i.e. for extended search functionality). Also in this case, the information from the ALL field won't be read, but lucene will load it in the memory, and the memory usage will be at least twice bigger. Thanks for understanding, Sergiu mark harwood wrote: There is no API for this, but I recall somebody talking about adding support for this a few months back See http://marc.theaimsgroup.com/?l=lucene-devm=109485996612177w=2 This implementation was working on a version of Lucene before compression was introduced so things may have changed a little. Cheers, Mark ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple filters
There is a ChainedFilter in the jakarta-lucene-sandbox CVS repository allowing you to AND/OR/XOR, and more with multiple filters. I covered it in LIA: http://www.lucenebook.com/search?query=ChainedFilter And the source code you can download has some code that demonstrates it. Erik On Jan 25, 2005, at 6:57 PM, aaz wrote: Hello, Every document in my index has 2 date related fields. created_date and modified_date stored via the DateField.dateToString() Users want to be able to search via such between like queries such as: (where modified_date X AND modified_date X AND created_date = created_date = X) Now I tried using RangeQuery's for this but quickly ran into the TooManyClauses exception issue. The next thing I am looking at is the use of DateFilters to pass in with the query at searcher.search(). However the interfaces only supports one filter. Is it possible to pass multiple filters that would be needed for my example above? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stemming
On Jan 24, 2005, at 7:24 AM, Kevin L. Cobb wrote: Do stemming algorithms take into consideration abbreviations too? No, they don't. Adding abbreviations, aliases, synonyms, etc is not stemming. And, the next logical question, if stemming does not take care of abbreviations, are there any solutions that include abbreviations inside or outside of Lucene? Nothing built into Lucene does this, but the infrastructure allows it to be added in the form of a custom analysis step. There are two basic approaches, adding aliases at indexing time, or adding them at query time by expanding the query. I created some example analyzers in Lucene in Action (grab the source code from the site linked below) that demonstrate how this can be done using WordNet (and mock) synonym lookup. You could extrapolate this into looking up abbreviations and adding them into the token stream. http://www.lucenebook.com/search?query=synonyms Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote: I spent some time reading the Lucene in Action book this weekend (great job, btw) Thanks! public class AccountFilter extends Filter I see where the AccountFilter is setting the cooresponding 'bits', but I end up without any 'hits': Entering AccountFilter... Entering AccountFilter... Entering AccountFilter... Setting bit on Setting bit on Setting bit on Setting bit on Setting bit on Leaving AccountFilter... Leaving AccountFilter... Leaving AccountFilter... ... Found 0 matching documents in 1000 ms Can anyone tell me what I've done wrong? A filter constrains which documents will be consulted during a search, but the Query needs to match some documents that are turned on by the filter bits. I'm guessing that your Query did not match any of the documents you turned on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
As Paul suggested, output the Lucene document numbers from your Hits, and also output which bit you're setting in your filter. Do those sets overlap? Erik On Jan 24, 2005, at 2:13 PM, Jerry Jalenak wrote: Paul / Erik - I'm use the ParallelMultiSearcher to search three indexes concurrently - hence the three entries into AccountFilter. If I remove the filter from my query, and simply enter the query on the command line, I get two hits back. In other words, I can enter this: smith AND (account:0011) and get hits back. When I add the filter back in (which should take care of the account:0011 part of the query), and enter only smith as my query, I get 0 hits. Jerry Jalenak Senior Programmer / Analyst, Web Publishing LabOne, Inc. 10101 Renner Blvd. Lenexa, KS 66219 (913) 577-1496 [EMAIL PROTECTED] -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, January 24, 2005 1:07 PM To: Lucene Users List Subject: Re: Filtering w/ Multiple Terms On Jan 24, 2005, at 12:26 PM, Jerry Jalenak wrote: I spent some time reading the Lucene in Action book this weekend (great job, btw) Thanks! public class AccountFilter extends Filter I see where the AccountFilter is setting the cooresponding 'bits', but I end up without any 'hits': Entering AccountFilter... Entering AccountFilter... Entering AccountFilter... Setting bit on Setting bit on Setting bit on Setting bit on Setting bit on Leaving AccountFilter... Leaving AccountFilter... Leaving AccountFilter... ... Found 0 matching documents in 1000 ms Can anyone tell me what I've done wrong? A filter constrains which documents will be consulted during a search, but the Query needs to match some documents that are turned on by the filter bits. I'm guessing that your Query did not match any of the documents you turned on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This transmission (and any information attached to it) may be confidential and is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient or the person responsible for delivering the transmission to the intended recipient, be advised that you have received this transmission in error and that any use, dissemination, forwarding, printing, or copying of this information is strictly prohibited. If you have received this transmission in error, please immediately notify LabOne at the following email address: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sort Performance Problems across large dataset
On Jan 24, 2005, at 7:01 PM, Peter Hollas wrote: I am working on a public accessible Struts based Well there's the problem right there :)) (just kidding) To sort the resultset into alphabetical order, we added the species names as a seperate keyword field, and sorted using it whilst querying. This solution works fine, but is unacceptable since a query that returns thousands of results can take upwards of 30 seconds to sort them. 30 seconds... wow. My question is whether it is possible to somehow return the names in alphabetical order without using a String SortField. My last resort will be to perform a monthly index rebuild, and return results by index order (about a day to re-index!). But ideally there might be a way to modify the Lucene API to incorporate a scoring system in a way that scores by lexical order. What about assigning a numeric value field for each document with the number indicating the alphabetical ordering? Off the top of my head, I'm not sure how this could be done, but perhaps some clever hashing algorithm could do this? Or consider each character position one digit in a base 27 (or 27 to include a space) and construct a number for that? (though that would be an enormous number and probably too large) - sorry my off-the-cuff estimating skills are not what they should be. Certainly sorting by a numeric value is far less resource intensive than by String - so perhaps that is worth a try? At the very least, give each document a random number and try sorting by that field (the value of the field can be Integer.toString()) to see how it compares performance-wise. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Limit and offset
Random accessing a start point in Hits will probably be sufficient for what you want to do. I do this for all the web applications I've built with Lucene and performance has been more than acceptable. Erik On Jan 23, 2005, at 9:37 AM, Kristian Hellquist wrote: Hi! I want to retrieve a selected area of the hits I get when I search the index similar to a SQL-clause. SELECT foo FROM bar OFFSET 10 LIMIT 10 How should I do this and experience good performance? Or is it just so simple that I use the method Hits.doc(int)? Thanks! Kristian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document 'Context' Relation to each other
On Jan 21, 2005, at 10:47 PM, Paul Smith wrote: As a log4j developer, I've been toying with the idea of what Lucene could do for me, maybe as an excuse to play around with Lucene. First off, let me thank you for your work with log4j! I've been using it at lucenebook.com with the SMTPAppender (once I learned that I needed a custom trigger to release e-mails when I wanted, not just on errors) and it's been working great. Now, I could provide a Field to the LoggingEvent Document that has a sequence #, and once a user has chosen an appropriate matching event, do another search for the documents with a Sequence # between +/- the context size. My question is, is that going to be an efficient way to do this? The sequence # would be treated as text, wouldn't it? Would the range search on an int be the most efficient way to do this? I know from the Hits documentation that one can retrieve the Document ID of a matching entry. What is the contract on this Document ID? Is each Document added to the Index given an increasing number? Can one search an index by Document ID? Could one search for Document ID's between a range? (Hope you can see where I'm going here). You wouldn't even need the sequence number. You'll certainly be adding the documents to the index in the proper sequence already (right?). It is easy to random access documents if you know Lucene's document ids. Here's the pseudo-code - construct an IndexReader - open an IndexSearcher using the IndexReader - search, getting Hits back - for a hit you want to see the context, get hits.id(hit#) - subtract context size from the id, grab documents using reader.document(id) You don't search for a document by id, but rather jump right to it with IndexReader. Many thanks for an excellent API, and kudos to Erik Otis for a great eBook btw. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? Indexing and searching Chinese basically is no different than using English with Lucene. We covered a bit about it in Lucene in Action: http://www.lucenebook.com/search?query=chinese And a screenshot here: http://www.blogscene.org/erik/LuceneInAction/i18n.html The main issues of dealing with Chinese, and of course other languages, are encoding concerns in both indexing and querying of reading in the text and analysis (as you can see from the screenshot). Lucene itself works with Unicode fine and you're free to index anything. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
On Jan 20, 2005, at 5:02 PM, Jerry Jalenak wrote: In looking at the examples for filtering of hits, it looks like I can only specify a single term; i.e. Filter f = new QueryFilter(new TermQuery(new Term(acct, acct1))); I need to specify more than one term in my filter. Short of using something like ChainFilter, how are others handling this? You can make as complex of a Query as you want for QueryFilter. If you want to filter on multiple terms, construct a BooleanQuery with nested TermQuery's, either in an AND or OR fashion. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: closing an IndexSearcher
On Jan 19, 2005, at 12:14 PM, Cocula Remi wrote: Hi , I remarked that after closing an IndexSearcher, queries on this Seacher will still run. My question is : why not always closing an IndexSearcher ? IndexSearcher.close: public void close() throws IOException { if(closeReader) reader.close(); } However, you open it with a String: - searcher = new IndexSearcher(c:\\tmp\\index); Which should close the underlying IndexReader. Maybe this was a bug that has since been fixed in CVS (which is the code I'm referencing)? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LuceneRAR project announcement
On Jan 19, 2005, at 2:27 PM, Joseph Ottinger wrote: After babbling endlessly about an RDMS directory and my lack of success with it, I've created a project on java.net to create a Lucene JCA component, to allow J2EE components to interact with a Lucene service. It's at https://lucenerar.dev.java.net/ currently. Could you elaborate on some use cases? What drove you to consider JCA rather than some other technique? I'm curious why it is important to get all J2EE with it rather than working with Lucene much more naturally at a lower level of abstraction. I briefly browsed the source tree from java.net and saw this comment in your Hits.java: This method loads a LuceneRAR hits object with its equivalent from the Apache Lucene Hits object. It basically walks the Lucene Hits object, copying values as it goes, so it may not be as light or fast as its Apache equivalent I'll say! For large result sets, which are more often the norm than the exception for a search, you are going to take a huge performance hit doing something like this, not to mention possibly even killing the process as you run out of RAM. This brings me back to my first questions - abstractions around Lucene tend to leak heavily. While it sounds clever to wrap layers around Hits, the fact of the matter is that searches often return an enormous amount of results and only the first few are generally needed. Lucene's Hits takes this into account and fetches on demand from the index. Admittedly for Java Development with Ant, I implemented a stateless session bean that walked the hits and packaged them up to send across the wire. This was naive and only worked because I never tried it with a large number of hits. These days I push back from J2EE in the larger let's add acronyms because we can sense and opt for much lighter weight, simpler solutions. JCA sounds like an unnecessary abstraction around Lucene - though I'm open to be convinced otherwise. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [newbie] Confused about PrefixQuery
On Jan 19, 2005, at 3:16 PM, Jerry Jalenak wrote: The text files have two control lines at the beginning of them - CC and AN. That's quite a complex example to ask a user list to decipher. Simplifying the example, besides making it easier for us to understand, would likely shed light on the problem. Everything (I think) indexes correctly. To be sure, try Luke out and see what got indexed exactly. You can also use Luke as an ad-hoc search tool rather than writing your own. When I search against this index, though, I get some weird results, especially when using an '*' at the end of my criteria. The results you got definitely are weird given the query, and in my initial glance through your code I did not see the issue pop out. Luke will likely shed much more light on the matter. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LuceneRAR project announcement
On Jan 19, 2005, at 3:30 PM, Joseph Ottinger wrote: On Wed, 19 Jan 2005, Erik Hatcher wrote: On Jan 19, 2005, at 2:27 PM, Joseph Ottinger wrote: After babbling endlessly about an RDMS directory and my lack of success with it, I've created a project on java.net to create a Lucene JCA component, to allow J2EE components to interact with a Lucene service. It's at https://lucenerar.dev.java.net/ currently. Could you elaborate on some use cases? Sure, and I'll pick the one that's been driving me along: I have a set of J2EE servers, all of which can generate new content for search, and all of which will be performing searches. They're on separate machines. Sharing directories isn't my idea of doing J2EE correctly. doing J2EE correctly is a funny phrase. If sharing directories works and gets the job done right, on time, under budget, can be adjusted later if needed, and has been reasonably well tested, then you've done it right. And since its in Java and not on a cell phone, its basically J2EE. Also, what about using Lucene over RMI using the RemoteSearchable facility built-in? Therefore, I chose to represent Lucene as an enterprise service, one communicated to via a remote service instead, so that every module can communicate with Lucene without realising the communication layer... for the most part. And this is where I think the abstraction leaks. The Nutch project has a very scalable enterprise approach to this type of remote service also. Plus, I no longer violate my purist's sensibilities. Ah, now we get to the real rationale! :) I'm not giving you, personally, a hard time, really ... but rather this purist approach, where purist means fitting into the acronyms under the J2EE umbrella. I've been there myself, read the specs, and cringed when I saw file system access from a session bean, and so on. The Hits object could CERTAINLY use optimization - callbacks into the connector would probably be acceptable, for example. Gotcha. Yes, callbacks would be the right approach with this type of abstraction. JCA sounds like an unnecessary abstraction around Lucene - though I'm open to be convinced otherwise. I'm more than happy to talk about it. If I can fulfill my needs with no code, hey, that's great! Would RemoteSearchable get you closer to no code? I just haven't been able to successfully do so yet, and everyone to whom I've spoken who says that they HAVE managed... well, they've almost invariably done so by lowering the bar a great deal in order to accept what Lucene requires. I'm definitely a skeptic when it comes to generic layers on top of Lucene, though there is definitely a yearning for easier management of the lower-level details. I'll definitely follow your work with LuceneRAR closely and will do what I can to help out in this forum. So take my feedback as constructive criticism, but keep up the good work! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardAnalyzer unit tests?
On Jan 17, 2005, at 4:51 AM, Chris Lamprecht wrote: I submitted a testcase -- http://issues.apache.org/bugzilla/show_bug.cgi?id=33134 I reviewed and applied your contributed unit test. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and number of occurence
On Jan 13, 2005, at 5:03 AM, Bertrand VENZAL wrote: Hi all, Im quite new in this mailing list. I ve many difficulties to find the number of a word (occurence) in a document, I need to use indexSearcher because of the query but the score returning is not wot i m looking for. I found in the mailing List the class TermDoc but it seems to work only with indexReader. If anyone can give a hand of this one, I will appreciate ... Perhaps this technique is what you're looking for set the field(s) you're interested in capturing frequency on to be vectored. You'll see that flag as additional overloaded methods on the Field. You'll still need to use an IndexReader, but that is no problem. Construct an IndexReader and use it to construct the IndexSearcher that you'll also use. Here's some snippets of code: // During indexing, subject field was added like this: doc.add(Field.UnStored(subject, subject, true)); ... // now during searching... IndexReader reader = IndexReader.open(directory); ... // from your Hits, get the document id int id = hits.doc(i); TermFreqVector vector = reader.getTermFreqVector(id, subject); Now read up on the TermFreqVector API to get at the frequency of a specific term. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Réf. : Re: IndexSearcher and number of occurence
On Jan 13, 2005, at 10:17 AM, Bertrand VENZAL wrote: Hi, Thanks for your quick answer, I understood wot u meant by using the indexSearcher to get the termFreqVector. But, you use an int as an id to find the termFrequency so I suppose that it is the position number in the IndexReader vector. My problem is : during the indexing phase, I can store the id, but if a document is deleted and recreated later on (like in an update), this will change my vector and all the id's previously set will be no more correct. Am i right on this point ? or am i missing something ... Yes, the Document id (the one Lucene uses) is not to be relied on long-term. But, in the example you'd get it from Hits immediately after a search, and thus it would be accurate and usable. You do not need to store any the id during indexing - Lucene maintains it and gives it to you from Hits. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching a document for a keyword
On Jan 12, 2005, at 4:13 AM, Swati Singhal wrote: I have a txt file, which contains the path to jpg files. These jpg files are organized into folders. My search is limited to searching only this txt file. So when i search based on a folder name, a match is found in the txt file, but i want it to return me the entire line as a search result and not the document name. (which is the txt file) How can I do that using Lucene? I have already built the index by giving the txt file as an input to build the index. If this is not possible, please tell me a way to parse jpg files to form an index file. First let me re-phrase what I think you want. You want to be able to search on a folder name and retrieve back JPG filenames that are in that folder. Correct? You're using the text file as simply a way to get text into Lucene? Does this text file have any other relevance here? If you have a folder of JPG images and all you're after is their filenames and the results granularity to be a JPG image file name, write a simple file system crawler that recurses your directory tree, and indexes a single document for each JPG, with a field for filename. What type of field should the filename field be? That depends on how you want to search. You could make it a Field.Keyword(), which would require exact (TermQuery) or PrefixQuery's to work. The Indexer example from Lucene in Action makes a great starting place for this crawler - you'd have to adapt it to recognize .jpg extensions and adjust it to only index the filename, not the contents (though the contents may contain text and be worth indexing also). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QUERYPARSIN BOOSTING
On Jan 12, 2005, at 5:30 AM, Karthik N S wrote: If somebody's is been closely watching GOOGLE, It boost's WEBSITES for payed category sites based on search words. Do you have an example of this? My understanding is Google *separates* the display of sponsored sites and ad links (like the one a friend of mine registered for me on my name). Separating is different than boosting. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]