Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE
On Wednesday 12 January 2005 01:47, David Spencer wrote: Amusingly then, documents with the terms liberal wienerwurst match big dog! :) There's something like frequency information in WordNet, it could probably be used to ignore the uncommon meanings. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching a document for a keyword
Hi, I'm new to Lucene and also this forum. I have a txt file, which contains the path to jpg files. These jpg files are organized into folders. My search is limited to searching only this txt file. So when i search based on a folder name, a match is found in the txt file, but i want it to return me the entire line as a search result and not the document name. (which is the txt file) How can I do that using Lucene? I have already built the index by giving the txt file as an input to build the index. If this is not possible, please tell me a way to parse jpg files to form an index file. Thanks, Swati = __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching a document for a keyword
On Jan 12, 2005, at 4:13 AM, Swati Singhal wrote: I have a txt file, which contains the path to jpg files. These jpg files are organized into folders. My search is limited to searching only this txt file. So when i search based on a folder name, a match is found in the txt file, but i want it to return me the entire line as a search result and not the document name. (which is the txt file) How can I do that using Lucene? I have already built the index by giving the txt file as an input to build the index. If this is not possible, please tell me a way to parse jpg files to form an index file. First let me re-phrase what I think you want. You want to be able to search on a folder name and retrieve back JPG filenames that are in that folder. Correct? You're using the text file as simply a way to get text into Lucene? Does this text file have any other relevance here? If you have a folder of JPG images and all you're after is their filenames and the results granularity to be a JPG image file name, write a simple file system crawler that recurses your directory tree, and indexes a single document for each JPG, with a field for filename. What type of field should the filename field be? That depends on how you want to search. You could make it a Field.Keyword(), which would require exact (TermQuery) or PrefixQuery's to work. The Indexer example from Lucene in Action makes a great starting place for this crawler - you'd have to adapt it to recognize .jpg extensions and adjust it to only index the filename, not the contents (though the contents may contain text and be worth indexing also). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: QUERYPARSIN BOOSTING
Hi Guys Apologies... If somebody's is been closely watching GOOGLE, It boost's WEBSITES for payed category sites based on search words. Can This [ boost the Full WEBSITE ] be achieved in Lucene's search based on searchword If So Please Explain /examples ???. with regards karthik -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 2:00 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: RE: QUERYPARSIN BOOSTING Karthik, I don't think the boost in your example does much since you are using an AND query, i.e. all hits will have to contain both vendor:nike and contents:shoes. If you used an OR, then the boost would put nike products above (non-nike) shoes, unless there was some other factor that causes score of contents:shoes to be 10x greater than that of vendor:nike. It's a good idea to look at the results of explain() when analyzing what's happening with scoring, tuning your boosts and your Similarity. Chuck -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 12:21 AM To: Lucene Users List Subject: Re: QUERYPARSIN BOOSTING From the text on the Lucene Jakarta Site : http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, ^, symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term jakarta to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type: jakarta^4 apache This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: jakarta apache^4 jakarta lucene By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2) Regards. Nader Henein Karthik N S wrote: Hi Guys Apologies... This Question may be asked million times on this form ,need some clarifications. 1) FieldType = keyword name = vendor 2)FieldType = text name = contents Question: 1) How to Construct a Query which would allow hits avaliable for the VENDOR to appear first ?. 2) If boosting is to be applied How TO ?. 3) Is the Query Constructed Below correct?. +Contents:shoes +((vendor:nike)^10) Please Advise. Thx in advance. WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QUERYPARSIN BOOSTING
On Jan 12, 2005, at 5:30 AM, Karthik N S wrote: If somebody's is been closely watching GOOGLE, It boost's WEBSITES for payed category sites based on search words. Do you have an example of this? My understanding is Google *separates* the display of sponsored sites and ad links (like the one a friend of mine registered for me on my name). Separating is different than boosting. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HELP! Directory is NOT getting closed!
*sigh* Yet again, I apologize. I'm generating altogether too much traffic here lately! I'm stuck. I have a custom Directory, and I *need* a callback point so I can clean up. There's a method for this: Directory.close(), which I've overridden. It never gets called! According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir is set, it's supposed to close the directory. That's fine - but that leads me to believe that for some reason, closeDir is *not* set. Why? Under what circumstances would this not be true, and under what circumstances would you NOT want to close the Directory? This is absolutely slaughtering my attempt at a Directory, because I need a single unit-of-work, and I need a place to commit it, when it's done. If I commit it inside the directory's innards, then the UOW gets corrupted (and looks like it's more than one atomic action, which is EXACTLY what I don't need.) --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HELP! Directory is NOT getting closed!
Joseph Ottinger writes: According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir is set, it's supposed to close the directory. That's fine - but that leads me to believe that for some reason, closeDir is *not* set. Why? Under what circumstances would this not be true, and under what circumstances would you NOT want to close the Directory? From the sources, you can see, that is is true only, if the directory is created by the IndexWriter itself. If you provide a directory to the IndexWriter you have to close it yourself. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HELP! Directory is NOT getting closed!
On Wed, 12 Jan 2005, Morus Walter wrote: Joseph Ottinger writes: According to IndexWriter.java, line 246 (in 1.4.3's codebase), if closeDir is set, it's supposed to close the directory. That's fine - but that leads me to believe that for some reason, closeDir is *not* set. Why? Under what circumstances would this not be true, and under what circumstances would you NOT want to close the Directory? From the sources, you can see, that is is true only, if the directory is created by the IndexWriter itself. If you provide a directory to the IndexWriter you have to close it yourself. ARGH! (I've been saying that a lot lately!) Okay, I was looking at the sources but missed that. Thank you very much. *sigh* --- Joseph B. Ottinger http://enigmastation.com IT Consultant[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Performance hits using MultiSearcher?
I am pretty new to Lucene. In my situation, there will be one, most likely, fairly large index, and over time a trickle of smaller indexes being created that could eventually number into the hundreds. Does using MultiSearcher to search against all these separate indexes impose a performance hit as compared to merging the smaller indexes into the original larger one? How long could a typical index merge take, just arbitrarily? Thanks, Ashley - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: QUERYPARSIN BOOSTING
Google has natural results on the left and sponsored results on the right. I do not believe the natural results are affected by paid keywords at all. What you seem to be describing is the behavior of the sponsored results, which I believe are explicitly attached to certain keywords. The same approach would work in Lucene. Create a field to hold purchased keywords (any keywords you want to associate with the result). Then you can include this field in your search with a high boost (see DistributingMultiFieldQueryParser, http://issues.apache.org/bugzilla/show_bug.cgi?id=32674). Google prefers certain results over others for certain keywords based on various factors of the keyword purchase and the site (amount paid for the keyword, Page Rank of the site, tenure of the listing, popularity of the listing, etc.). You could emulate this in various ways, using a combination of document/field boosting and perhaps replication of the term in the field (to increase its tf), or even perhaps multiple fields that are boosted at different levels. I'm not sure of the best approach to this part -- you could experiment a little. Chuck -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 12, 2005 2:30 AM To: Lucene Users List Subject: RE: QUERYPARSIN BOOSTING Hi Guys Apologies... If somebody's is been closely watching GOOGLE, It boost's WEBSITES for payed category sites based on search words. Can This [ boost the Full WEBSITE ] be achieved in Lucene's search based on searchword If So Please Explain /examples ???. with regards karthik -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 2:00 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: RE: QUERYPARSIN BOOSTING Karthik, I don't think the boost in your example does much since you are using an AND query, i.e. all hits will have to contain both vendor:nike and contents:shoes. If you used an OR, then the boost would put nike products above (non-nike) shoes, unless there was some other factor that causes score of contents:shoes to be 10x greater than that of vendor:nike. It's a good idea to look at the results of explain() when analyzing what's happening with scoring, tuning your boosts and your Similarity. Chuck -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 11, 2005 12:21 AM To: Lucene Users List Subject: Re: QUERYPARSIN BOOSTING From the text on the Lucene Jakarta Site : http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, ^, symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be. Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term jakarta to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type: jakarta^4 apache This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: jakarta apache^4 jakarta lucene By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2) Regards. Nader Henein Karthik N S wrote: Hi Guys Apologies... This Question may be asked million times on this form ,need some clarifications. 1) FieldType = keyword name = vendor 2)FieldType = text name = contents Question: 1) How to Construct a Query which would allow hits avaliable for the VENDOR to appear first ?. 2) If boosting is to be applied How TO ?. 3) Is the Query Constructed Below correct?. +Contents:shoes +((vendor:nike)^10) Please Advise. Thx in advance. WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]