AW: Lucene internal document number?
Hi, a have a short question it's regarding lucene internal document numbers: can you give me an idea where they are written into the index and how they are generated? I am not 100% sure about the technical design, only from my experience with Lucene: The numbers depend on when the document was indexed. The older the document, the smaller the number. All documents are numbered from 0 to n-1 where n is the number of documents the current reader sees. There are never any gaps in this numbering. There is, to my knowledge, no explicit point where these numbers are written in the index. Think of positions in a list - they are not part of the list itself. You have to take into account that these numbers may change for documents after any deletions in the index. Regards, Karsten -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab Xtramind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone +49 (681) 3 02-51 13 Fax +49 (681) 3 02-51 09 [EMAIL PROTECTED] www.xtramind.com Besuchen Sie uns ! DMS | Halle 2 Stand 2705 | 07.- 09. September 2004 | Messe Essen | www.dmsexpo.de -Ursprüngliche Nachricht- Von: B. Grimm [Eastbeam GmbH] [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 6. August 2004 13:42 An: [EMAIL PROTECTED] Betreff: Lucene internal document number? hi there, i looked around through the source but i dont get it. i also read the faq and i know that numbers are incremental for each index and start by 0 and change when optimizing and so one... i looked at the doc writers in lucene, but i dont get the point where numbers are given and written (i assume by using writeVInt() or something like that). it would be very kind if anyone can tell me what line in which file i had to look for. thanks in andvance and kind regards from berlin, germany. bastian -- Mit freundlichem Gruß, Bastian Grimm - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: How to acces informations from a part of the index
Hi, Why don't you just use two indexes? You probably do not hate to index the test set at all. If you have two or more subsets, just use filters that only matches the subsets you are interested in. Counting documents and such that do contain a certain term in one of the subset becomes then a search over the filtered document index and counting the number of results. Filters are quite efficient. Hope this helps, Karsten -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab Xtramind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone +49 (681) 3 02-51 13 Fax +49 (681) 3 02-51 09 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 9. Juli 2004 11:22 An: [EMAIL PROTECTED] Betreff: How to acces informations from a part of the index Hello, for my thesis I have to use Lucene index for a Text categorization program. For that I need to split the index in two. So i have a learning set and a validation set. The problem is that I don't know how to ask lucene to give me,for exemple, the number of documents IN ONLY ONE of these subsets containing a specific term. For example, I would to get number of document containing term hello in a subset of document. This subset is a set of the document number({5,3} and the complete index would contains document {0,1,2,3,4,5}) How can I do this in an efficient way? I tried to get all document containing the term and then verify which document belong to my subset. However, it appears that it's very slow to do this. Thanks in advance Claude Libois - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: clustering results
Hi (danger: shameless advertising below), our partner, Brox It Solutions, is using our - XtraMind Technologies GmbH - clustering for implementing meta-search clustering of search results ala Vivisimo. Check out: http://www.anyfinder.de/ The clustering is done on the snipplets coming from search engines, but the original version that we still use in our own products is based on modified Lucene indexes as these can efficiently handle lots of information on texts and terms. Our clustering engine does not only cluster search results, but also performs trend recognition for competitive intelligence and similar tasks, but not too many people require such specialized features. Brox' price models for this engine may be interesting for those who find other products too expensive; it also works with all existing search engines, not only Lucene. -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Sonntag, 11. April 2004 19:03 An: Lucene Users List Betreff: Re: clustering results I got all excited reading the subject line clustering results but this isn't really clustering is it? This is more sorting. Does anyone know of any work within Lucene (or another indexer) to do actual subject clustering (i.e. like Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? It would be pretty awesome if Lucene had such ability, I know there aren't a whole lot of clustering options, and the commercial products are very expensive. Anyhow, just curious. A brief definition of clustering: automatically organizing search or database query results into meaningful hierarchical folders ... transforming long lists of search results into categorized information without any clumsy pre- processing of the source documents. I'm not sure how it would be done...? Based off of top Term Frequencies for a document? -K Quoting Michael A. Schoen [EMAIL PROTECTED]: So as Venu pointed out, sorting doesn't seem to help the problem. If we have to walk the result set, access docs and dedupe using brute force, we're better off w/ the standard order by relevance. If you've got an example of this type of clustering done in a more efficient way, that'd be great. Any other ideas? - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, April 10, 2004 12:35 AM Subject: Re: clustering results On Apr 9, 2004, at 8:16 PM, Michael A. Schoen wrote: I have an index of urls, and need to display the top 10 results for a given query, but want to display only 1 result per domain. It seems that using either Hits or a HitCollector, I'll need to access the doc, grab the domain field (I'll have it parse ahead of time) and only take/display documents that are unique. A significant percentage of the time I expect I may have to access thousands of results before I find 10 in unique domains. Is there a faster approach that won't require accessing thousands of documents? I have examples of this that I can post when I have more time, but a quick pointer... check out the overloaded IndexSearcher.search() methods which accept a Sort. You can do really really interesting slicing and dicing, I think, using it. Try this one on for size: example.displayHits(allBooks, new Sort(new SortField[]{ new SortField(category), SortField.FIELD_SCORE, new SortField(pubmonth, SortField.INT, true) })); Be clever indexing the piece you want to group on - I think you may find this the solution you're looking for. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Paid support for Lucene
and eHatcher Solutions would be happy to as well :)) I don't think that one can be in much better hands here :) Anyway, for mid-size to larger projects around the use of any search engines in Germany I can recommend Brox IT-Solutions (http://www.brox.de/). They use a nice flexible framework where you can apply Lucene plus other optional search engines (I think, they have now some 10 engines to choose from with many tools like summarization that work with all these engines). With the help of this framework, integrating Lucene into an existing setup, enhancing or replacing other search engines can be done without programming ones leg off. I know them because they use my clustering algorithm when doing meta-searches. See http://searchdemo.brox.de/ (search for Lucene - the clustering is geared towards German though!) Regards, -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: Erik Hatcher [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 29. Januar 2004 19:46 An: Lucene Users List Betreff: Re: Paid support for Lucene and eHatcher Solutions would be happy to as well :)) On Jan 29, 2004, at 12:16 PM, Ryan Ackley wrote: I know of two: http://superlinksoftware.com http://jboss.org - Original Message - From: Boris Goldowsky [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, January 29, 2004 12:04 PM Subject: Paid support for Lucene Strangely, the web site does not seem to list any vendors who provide incident support for Lucene. That can't be right, can it? Can anyone point me to organizations that would be willing to provide support for Lucene issues? Thanks, Boris -- Boris Goldowsky [EMAIL PROTECTED] www.goldowsky.com/consulting - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Copy Directory to Directory function ( backup)
Hi, an elegant method is to create an empty directory and merge the index to be copied into it, using .addDirectories() of IndexWriter. This way, you do not have to deal with files at all. Regards, Karsten -Ursprüngliche Nachricht- Von: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 15. Januar 2004 13:28 An: [EMAIL PROTECTED] Betreff: Copy Directory to Directory function ( backup) hy , i would like backup a index. 1) my first idea is to make a system copy of all the files but in the FSDirectory class, there is no public method to know where is located the directory. A simple methode like public File getDirectoryFile() { return directory; would be great; } 2) so i decide to create a copy(Directory source, Directory target) method i seen the openFile() and createFile method but after i but i don't know how use it (see my function , this function make a Exception ) private void copy (Directory source, Directory target) throws IOException { String[] files=source.list(); for(int i=0; ifiles.length; i++) { InputStream in=source.openFile(files[i]); OutputStream out=target.createFile(files[i]); byte c; while((c=in.readByte())!=-1) { out.writeByte(c); } in.close(); out.close(); } someone could help me please nico - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Probabilistic Model in Lucene - possible?
Hi, I would highly appreciate it if the experts here (especially Karsten or Chong) look at my idea and tell me if this would be possible. Sorry, I have no idea about how to use a probabilistic approach with Lucene, but if anyone does so, I would like to know, too. I am currently puzzled by a related question: I would like to know if there are any approaches to get a confidence value for relevance rather than a ranking. I.e., it would be nice to have a ranking weight whose value has some kind of semantics such that we could compare results from different queries. Can probabilistic approches do anything like this? Any help appreciated, Karsten -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 3. Dezember 2003 15:13 An: [EMAIL PROTECTED] Betreff: Probabilistic Model in Lucene - possible? Hello group, from the very inspiring conversations with Karsten I know that Lucene is based on a Vector Space Model. I am just wondering if it would be possible to turn this into a probabilistic Model approach. Of course I do know that I cannot change the underlying indexing and searching principles. However it would be possible to change the index term weight to eigther 1.0 (relevant) or 0.0 (non-relevant). For the similarity I would need to implement another similarity algorithm. I would highly appreciate it if the experts here (especially Karsten or Chong) look at my idea and tell me if this would be possible. If yes, how much effort would need to go into that? I am sure there are many other issues which I have not considered... Kind Regards, Ralf -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Document Similarity
Hi, Do they produce same ranking results? No; Lucene's operations on query weight and length normalization is not equivalent to a vanilla cosine in vector space. I guess the 2nd approach will be more precise but slow. Query similarity will indeed be faster, but may actually not be worse. A straightforward cosine without IDF weighting of terms (as Lucene does) will almost certainly be less precise if you have documents of different length - word occurence probabilities in texts of different lengths vary greatly, and the cosine of independent longer texts will often be greater than those that actually have the same topic, but are short, just because of randomly found non-content words. If, on the other hand, you choose the right TF/IDF weighting of terms, the cosine in this warped vector space could be (a) equivalent to the one Lucene does - requires some work to do so, or (b) might even get better on average. However, the last time I counted, there where about 250 different TF/IDF formulas around in IR publications, machine learning, computational linguistics and so on. Performance depends on domain and language. But if I was you, I just would start playing and have fun with the stuff... Karsten -Ursprüngliche Nachricht- Von: Jing Su [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 2. Dezember 2003 18:12 An: [EMAIL PROTECTED] Betreff: Document Similarity Hi, I have read some posts in user/developer archives about Lucene-based document similarity comparison. In summary there are two approaches are mentioned: 1 - Construct document to a query; 2 - Calculate each document to be a vector, then rank accoring to their distance (cosine). Do they produce same ranking results? Is there any other way to do so? I guess the 2nd approach will be more precise but slow. Thanks. Jing - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Real Boolean Model in Lucene?
Hi, My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not use the boolean model for the boolean query...) Lucene indeed uses TF/IDF with length normalization for fields and documents. However, Lucene is downward compatible to the Boolean Model where documents are represented as 0/1-vectors in Vector Space. Ranking just adds weights to the elements of the result set, so the underlying interpretation of a query result can be still that of a Propositional/Boolean model. If a document appears in the result, its tokens valuate the query (which actually is a propositional formula formed over words and phrases) to true. The representation of documents is more complex in Lucene than required for the Boolean Model, and as a result, Lucene can efficiently handle phrases and proximity searches, but these seem to be compatible extensions - if you can do it in the Boolean Model, you can do it in Lucene :) One place where Lucene is not 100% compatible with a basic Boolean Model is that full negation is a bit tricky - you can not simply ask for all documents that do not contain a certain term unless you also have some term that appears in all documents. Not a great deal, really. If TF/IDF weighting is a problem to you, the Similarity interface implementation allows you to remove all references to length normalization and document frequencies. Regards, Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Montag, 1. Dezember 2003 13:11 An: [EMAIL PROTECTED] Betreff: Real Boolean Model in Lucene? Hi, is it possible to use a real boolean model in lucene for searching. When one is using the Queryparser with a boolean query (i.e. dog AND horse) one does get a list of documents from the Hits object. However these documents have a ranking (score). My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not use the boolean model for the boolean query...) How can one use a boolean model search, where the outcome are all score=1 ? Example? Cheers, Ralph -- Neu bei GMX: Preissenkung für MMS-Versand und FreeMMS! Ideal für alle, die gerne MMS verschicken: 25 FreeMMS/Monat mit GMX TopMail. http://www.gmx.net/de/cgi/produktemail +++ GMX - die erste Adresse für Mail, Message, More! +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: AW: Real Boolean Model in Lucene?
Hello Ralf, According to your description, Lucene basically maps the boolean query into the vector space and measures the cosine similarity towards other documents in the vector space. If I understood you correctly you mean if a document is found by Lucene based on a boolean query it is relevant (boolean true). If it is not returned, if was boolean false. The score sits on top of it and can be used for ranking. If I would like to use true boolean model I would therefore just need to ignore the score of the Hits document. Did I understand correctly? Yes, I think that this is indeed pretty close to some theoretical foundation: The Boolean Model explains which documents fit to a query, while some appropriate (Lucene is good!) similarity function in vector space yields the ranking. Now hell would be the place for me where I would have to prove that Lucene's ranking is exactly equivalent to some transformation of vector space and then using the *cosine* for the ranking. Can't be really, as Lucene sometimes returns results 1.0 and only some ruthless normalisation keeps it within 0.0 to 1.0. In other words, there still are some rough corners in Lucene where a good theorist could find some work. Could we leave this topic aside until some suicid.. err, I mean enthusiastic fellow tries to work out a really good theory? Regards, Karsten -Ursprüngliche Nachricht- Von: Ralf B [mailto:[EMAIL PROTECTED] Gesendet: Montag, 1. Dezember 2003 14:28 An: Lucene Users List Betreff: Re: AW: Real Boolean Model in Lucene? Hi Karsten, I want to thank you for your qualified answer as well as your answer from the 14th of November, where you agreed with me that Lucene is basically a VSM implementation. Sometimes it is difficult to make the link between the clear theory and its implementation. According to your description, Lucene basically maps the boolean query into the vector space and measures the cosine similarity towards other documents in the vector space. If I understood you correctly you mean if a document is found by Lucene based on a boolean query it is relevant (boolean true). If it is not returned, if was boolean false. The score sits on top of it and can be used for ranking. If I would like to use true boolean model I would therefore just need to ignore the score of the Hits document. Did I understand correctly? I aggree that nobody really want to do that. My question intended to find out more about the implemented theory within Lucene. Cheers, Ralph Hi, My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not use the boolean model for the boolean query...) Lucene indeed uses TF/IDF with length normalization for fields and documents. However, Lucene is downward compatible to the Boolean Model where documents are represented as 0/1-vectors in Vector Space. Ranking just adds weights to the elements of the result set, so the underlying interpretation of a query result can be still that of a Propositional/Boolean model. If a document appears in the result, its tokens valuate the query (which actually is a propositional formula formed over words and phrases) to true. The representation of documents is more complex in Lucene than required for the Boolean Model, and as a result, Lucene can efficiently handle phrases and proximity searches, but these seem to be compatible extensions - if you can do it in the Boolean Model, you can do it in Lucene :) One place where Lucene is not 100% compatible with a basic Boolean Model is that full negation is a bit tricky - you can not simply ask for all documents that do not contain a certain term unless you also have some term that appears in all documents. Not a great deal, really. If TF/IDF weighting is a problem to you, the Similarity interface implementation allows you to remove all references to length normalization and document frequencies. Regards, Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Montag, 1. Dezember 2003 13:11 An: [EMAIL PROTECTED] Betreff: Real Boolean Model in Lucene? Hi, is it possible to use a real boolean model in lucene for searching. When one is using the Queryparser with a boolean query (i.e. dog AND horse) one does get a list of documents from the Hits object. However these documents have a ranking (score). My Question: Does Lucene use TF/IDF for getting this? (which would mean it does not use the boolean model for the boolean query...) How can one use a boolean model search, where the outcome are all score=1 ? Example? Cheers, Ralph -- Neu bei GMX
AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
Hi, it actually is quite nice and it can be used in production for such things as have been discussed lately in this group. If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15. March), the precision of the algorithm can be increased if you never break after a number. The implementation is fast. Regards, Karsten Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: Philippe Laflamme [mailto:[EMAIL PROTECTED] Gesendet: Montag, 17. November 2003 15:39 An: Lucene Users List Betreff: RE: inter-term correlation [was Re: Vector Space Model in Lucene?] There is already an implementation in the Java API for sentence boundary detection. The BreakIterator in the java.text package has this to say about sentence splitting: Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses. http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html The whole i18n Java API is based on the ICU framework from IBM: http://oss.software.ibm.com/icu/index.html It supports many languages. I personally do not have any experience with the BreakIterator in Java. Has anyone used it in any production environment? I'd be very interested to learn more about it's efficiency. Regards, Phil -Original Message- From: Chong, Herb [mailto:[EMAIL PROTECTED] Sent: November 17, 2003 08:53 To: Lucene Users List Subject: RE: inter-term correlation [was Re: Vector Space Model in Lucene?] i have a program written in Icon that does basic sentence splitting. with about 5 heuristics and one small lookup table, i can get well over 90% accuracy doing sentence boundary detection on email. for well edited English text, like newswires, i can manage closer to 99%. this is all that is needed for significantly improving a search engine's performance when the query engine respects sentence boundaries. incidentally, the GATE Information Extraction framework cites some references that indicate that for named entity feature extraction, their system can exceed the ability of trained humans to detect and classify named entities if only one person does the detection. collaborating humans are still better, but no-one has the time in practical applications. you probably know, since you know about Markov chains, that within sentence term correlation, and hence the language model, is different than across sentences. linguists have known this for a very long time. it isn't hard to put this capability into a search engine, but it absolutely breaks down unless there is sentence boundary information stored for use at query time. Herb -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 5:54 PM To: Lucene Users List Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] Well ... Sure, nothing can replace a human mind. But believe it or not, there are studies which show that even human experts can significantly differ in their opinions on what are key-phrases for a given text. So, the results are never clear cut with humans either... So, in this sense a heuristic tool for sentence splitting and key-phrase detection can go long ways. For example, the application I mentioned, uses quite a few heuristic rules (+ Markov chains as a heavier ammunition :-), and it comes up with the following phrases for your email discussion (the text quoted below): (lang=EN): NLP, trainable rule-based tagging, natural language processing, apache, NLP expert Now, this set of key-phrases does reflect the main noun-phrases in the text... which means I have a practical and tangible benefit from NLP. QED ;-) Best regards, Andrzej - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: inter-term correlation [was Re: Vector Space Model in Lucene?]
Rules of linguistics? Is there such a thing? :) Yes there are. How can you expect communication (the goal of the game that natural language is about) to work if the game has no rules? Anyway, Herb is right, sentence boundaries do carry a meaning and the linguistic rule could be phrased as: Constituents (Concepts) mentioned in one sentence together have a closer relation than those that are not. I was wondering whether we could, while indexing, make a use of this by increasing the position counter by a large number, let's say 1000, whenever we encounter a sentence separator (Note, this is not trivial; not every '.' ends a sentence etc. etc. etc.). Thus, searching for income tax~100 tax gain~100 income tax gain~100 income tax gain would find income tax gain as usual, but would boost all texts where the phrases involved appear within sentence boundaries - I assume that a sentence with 100 words would be pretty unlikely, but still within the 1000 word separation done by increasing the position. No linguistics necessary, actually, but it is an application of a linguistic rule! Sure. But my take on this, is that pigs will fly before NLP turns into a predictable science :) You mean like physics (new models every 10 years), biology (same), medicine (er.. cancer research anyone?), chemistry (the result could be verified in 8 of 10 experiments...). What does predictabiltity mean to you? What sciences beside mathematics do give you 100% certainty? But I guess you are in flame mode anyway now :) Regards, Karsten -Ursprüngliche Nachricht- Von: petite_abeille [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 14. November 2003 20:04 An: Lucene Users List Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] On Nov 14, 2003, at 19:50, Chong, Herb wrote: if you are handling inter correlation properly, then terms can't cross sentence boundaries. Could you not break down your document along sentences boundary? If you manage to figure out what a sentence is, that is. if you are not paying attention to sentence boundaries, then you are not following rules of linguistics. Rules of linguistics? Is there such a thing? :) PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sentence dependencies (was: inter-term relation)
Hello, There are many cases where linguistically separate sentences do have strong dependendies; in web world simple things like list items may be very closely related. Put another way; it may not be trivially easy to detect sentence boundaries, nor is it certain that what (from language viewpoint) is a boundary really is hard boundary from semantic perspective? And are there not varying levels of separation (sentences close to each other often are related, back references being common), not just one, between sentences? There is a computational linguistic theory that deals with such questions, Rhetorical Structure Theory, see http://www.sil.org/~mannb/rst/. Basically, each text is seen as a hierarchical structure fromed from on a few rhetorical relations. Interestingly, some relations are not too hard to guess once your text is semi-structured already (the relation between a paragraph header and its paragraph is a rhetorical one for instance, a HTML list is a sequence of sentences connected by the list relation and so forth). Applying such theories to Lucene would require quite a lot of work while analysing the texts, but I doubt whether Lucene could not be convinced to work on such structures and boost the relation of terms more if they appear within closer RST-structure connections. Regards, Karsten Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: Tatu Saloranta [mailto:[EMAIL PROTECTED] Gesendet: Samstag, 15. November 2003 02:15 An: Lucene Users List Betreff: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] On Friday 14 November 2003 11:50, Chong, Herb wrote: if you are handling inter correlation properly, then terms can't cross sentence boundaries. if you are not paying attention to sentence boundaries, then you are not following rules of linguistics. Isn't that quite strict interpretation, however? There are many cases where linguistically separate sentences do have strong dependendies; in web world simple things like list items may be very closely related. Put another way; it may not be trivially easy to detect sentence boundaries, nor is it certain that what (from language viewpoint) is a boundary really is hard boundary from semantic perspective? And are there not varying levels of separation (sentences close to each other often are related, back references being common), not just one, between sentences? As to storing boundaries in index; am I naive if I suggested just marker tokens that could easily be used to mark boundaries (sentence, paragraph, section)? Code that uses that information would obviously need to know details of marking used, but would it be infeasible to use such in-band information? -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Slow response time with datefilter
Not only is the query slow, but it seems to be slower the more results it returns. Any suggestions? If you have a lot of terms in that range, you can see that there is obviously some cycles spinning to do the work needed. If the number of different date terms causes this effect, why not round the date to the nearest or next midnight while indexing. Thus, filtering for the last 15 days would require walking over 15-17 different date terms. If you don't do this, the number of different terms will be the same as the number of documents you indexed, explaining the slowing down when you have more results. Regards, Karsten -Ursprüngliche Nachricht- Von: Erik Hatcher [mailto:[EMAIL PROTECTED] Gesendet: Samstag, 15. November 2003 17:31 An: Lucene Users List Betreff: Re: Slow response time with datefilter On Friday, November 14, 2003, at 07:16 PM, Dror Matalon wrote: We're seeing slow response time when we apply datefilter. A search that takes 7 msec with no datefilter takes 368 msec when I filter on the last fifteen days, and 632 msec on the last 30 days. Initially we saved doing document.add(Field.Keyword(dtstamp, dtstamp)); and then change to doing document.add(Field.Keyword(dtstamp, DateField.dateToString(dtstamp))); where dtstamp is a java.util.Date Both of the above lines of code are equivalent. This is where having open-source is handy :) public static final Field Keyword(String name, Date value) { return new Field(name, DateField.dateToString(value), true, true, false); } We search doing the following: days_ago_value = Long.parseLong(days); //could throw NumberFormatException days_ago_value = new java.util.Date().getTime() - (days_ago_value * 8640L); hits = indexSearcher.search(query, DateFilter.After(dtstamp, days_ago_value)); DateFilter itself is walking all the terms in the range you provide before executing the query. If you have a lot of terms in that range, you can see that there is obviously some cycles spinning to do the work needed. Not only is the query slow, but it seems to be slower the more results it returns. Any suggestions? If this date range is pretty static, you could (in Lucene's CVS codebase) wrap the DateFilter with a CachingWrappingFilter. Or you could construct a long-lived instance of an equivalent QueryFilter and reuse it across multiple queries. You would likely see dramatic differences using either of these approaches. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Vector Space Model in Lucene?
Hi, vector space is only one of several important ones. what are these several other important ones? While Lucene does not give an explicit vector space representation - you can not efficiently access the vector of one document - the index' basic representation is a reduction of each document to its terms and frequencies, hence a mapping into a vector space and hence a vector space model. The relative term weights (TF/IDF) warp the space and the vectors, but all of Lucene's search operations nevertheless are operations on a vector space model (ok, maybe phrase search is a bit different as it requires an extension by position information). E.g., searching a term means finding all vectors that have a certain common dimension and ranking means weighting these relatively to their angle in vector space. KK Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: Chong, Herb [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 14. November 2003 14:35 An: Lucene Users List Betreff: RE: Vector Space Model in Lucene? does it matter? vector space is only one of several important ones. Herb -Original Message- From: Leo Galambos [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 4:00 AM To: Lucene Users List Subject: Re: Vector Space Model in Lucene? Really? And what model is used/implemented by Lucene? THX Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Negative boosting?
I have done negative boosts, it does work; you must construct your query terms accordingly. I found the results somewhat unintuitive - the mixture of negative and postive boosts (mainly 1.0), TF/IDF and document length normalization will very often make documents more relevant that you did not expect to be. Regards, Karsten Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: Terry Steichen [mailto:[EMAIL PROTECTED] Gesendet: Donnerstag, 11. September 2003 16:05 An: Lucene Users Group Betreff: Negative boosting? I've often found the use of query-based boosting to be very beneficial. This is particularly so when it's easy to identify the term that I want to stand out as a primary selector. However, I've come across quite a few other cases where it would be easier (and more logical) to apply a negative boost - to de-emphasize the match when the term is present. Is it possible to apply a negative boost (It doesn't seem to work), and if not, would it break anything signficant if that were added? Regards, Terry - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Exceptions while Updating an Index
Hi, it is very easy to provoke the errrors you describe when you are opening many alternating writers and readers on Windows. You can circumvent this problem by using fewer writer and reader objects, e.g., first delete all documents to update, then write all the updated documents. Or use a second index only for the writing and merge this into the first after you have deleted the update documents there. Regards, Karsten -Ursprüngliche Nachricht- Von: Wilton, Reece [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 27. August 2003 23:18 An: Lucene Users List Betreff: Exceptions while Updating an Index Hi, I am getting exceptions because Lucene can't rename files. Here are a couple of the exceptions that I'm getting: - java.io.IOException: couldn't rename _6lr.tmp to _6lr.del - java.io.IOException: couldn't rename segments.new to segments I am able to index many documents successfully on my Windows machine. The problem occurs for me during the updating process. My updating process goes like this: for (each xml file i want to index) { // create new document parse the xml file populate a new Lucene document with the fields from my XML file // remove old document from index open an index reader delete the term from the index // this successfully deletes the one document close the index reader // add new document to index open an index writer add the document to the index writer close the index writer } Any ideas on how to stop these exceptions from occuring? No other process is reading or writing to the index while this process is running. Thanks, Reece - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mysterious bugs...
Hi, after indexing 238000 Documents on a Linux box, we get the following error: Caused by:java.lang.IllegalStateException: docs out of order at: java.lang.IllegalStateException: docs out of order at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250) Another error message we sometimes see (not reproducable) is: IOException: No buffer space available. Does anybody know the cause of these problems? Thanks! Karsten - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Analyzers, Queries: three questions
Hi, 1) How can I search untokenized fields? Do I have to pass my query through a NullAnalyzer? No, the contents of an untokenized (i.e., keyword) field are stored as one lucene token. Hence, you must build such a token from your query and build a TermQuery for being able to search it. In general, index those fields over which you search unless you want to treat field contents as identifiers (e.g., unique document names or such). 2) How can I pass the value of a field through an Analyzer before storing it? A text field is automatically analyzed and tokenized by the given analyzer, you do not have to do it manually. However, you could preprocess your text in any way you want before that happens - simply apply your operations on the content you index, but make sure that you use a compatible analyzer when searching. 3) How can I fine-tune my query, e.g. by saying that for searching within the contents fields I want to pass the query through an Analyzer, for searching within the title field, however, I don't want the Analyzer pass. And I want a Hit if either field provides it. Unfortunately, you can not give different analyzers for different fields. You could process your query after parsing by traversing and manipulating the query object; this method requires some programming, though; with the power of Lucene's default query language, you might end in a lot of work here. Regards, Karsten -Ursprüngliche Nachricht- Von: Ulrich Mayring [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 11. Juni 2003 11:50 An: [EMAIL PROTECTED] Betreff: Analyzers, Queries: three questions Hi folks, I'm using the Snowball analyzer to index my documents. As an example I took the Tomcat documentation, which includes a document with the title Workers HowTo. I put this string in a field called title, within which I later do my query (of course again with the same SnowballAnalyzer). At first I indexed the field as a Keyword (== not tokenized) and Lucene later couldn't find it, when I searched for Workers HowTo. I found out that tokenization apparently includes application of the Analyzer, so if I put my query through an Analyzer, then the field to search must be tokenized. Hence my first question: 1) How can I search untokenized fields? Do I have to pass my query through a NullAnalyzer? Next I made the title field a Text field, so it is tokenized. Now Lucene finds the document, but with a low score of 0.27. Sure enough, browsing the index showed me that the value of the title field is stored unanalyzed, i.e. Workers HowTo - exactly as retrieved from the document. On the other hand, after parsing the query, the query is actually transformed to (title:worker title:howto). This does of course not give an exact match, therefore I guess the low score and my next questions: 2) How can I pass the value of a field through an Analyzer before storing it? 3) How can I fine-tune my query, e.g. by saying that for searching within the contents fields I want to pass the query through an Analyzer, for searching within the title field, however, I don't want the Analyzer pass. And I want a Hit if either field provides it. Currently I'm using the MultiFieldQueryParser, but that only allows one Analyzer for all the fields. Thank you very much in advance for any pointers, Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: AW: Analyzers, Queries: three questions
Hi, field contents indexed with Field.text are stored verbatim in the index - thus, you can get back the original text when you access it using stingValue(). This has nothing to do with how the text is indexed, i.e., how it is tokenized and stored into the index. You probably have a token workers and one howto, both pointing to this text (that's why it is called an inverted index, the words point to the text). Your analyzer does this tokenization for you. If you search using the query parser, you can only do this on indexed fields, e.g., those indexed with Field.text or Field.UnStored. If you store a text as a keyword, you must construct a TermQuery and search with it. Thus, you would actually get a term (title, Workers HowTo). Regards, Karsten -Ursprüngliche Nachricht- Von: Ulrich Mayring [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 11. Juni 2003 13:36 An: [EMAIL PROTECTED] Betreff: Re: AW: Analyzers, Queries: three questions Karsten Konrad wrote: 2) How can I pass the value of a field through an Analyzer before storing it? A text field is automatically analyzed and tokenized by the given analyzer, you do not have to do it manually. Well, but if I browse my index I see all the terms stored in the original form. I use this code: doc.add(Field.Text(title,Workers HowTo); ... // Build and execute Query, so that only the above document is found Document d = hits.doc(0); Field field = d.get(title); System.out.println(field.name() + , + field.stringValue()); This outputs title,Workers HowTo - the untokenized, unanalyzed form. So, what's wrong here? cheers, Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: DBDirectory available for download
Thanks, do you have already some numbers how it compares to the file system implementation, i.e., how fast is indexing and searching? Regards, Karsten -Ursprüngliche Nachricht- Von: Anthony Eden [mailto:[EMAIL PROTECTED] Gesendet: Montag, 2. Juni 2003 22:23 An: Lucene Users List Betreff: DBDirectory available for download Version 1.0 of the DBDirectory library, which implements a Directory which can store indeces in a database is now available for download. There are two versions: Tar GZIP: http://www.anthonyeden.com/download/lucene-dbdirectory-1.0.tar.gz ZIP: http://www.anthonyeden.com/download/lucene-dbdirectory-1.0.zip The source code is included. Please read the README file for instructions on using DBDirectory. I have only tested it with MySQL but would be happy to add other database scripts if anyone would like to submit them. Please post any questions here on the mailing list. Otis, is there anything left to do to get this into the sandbox? Additionally, how will I maintain the code if it is in the sandbox? Will I get write access to the part of the CVS repository which would house DBDirectory? I currently have all of the code in my private CVS. Sincerely, Anthony Eden - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Search for similar terms
Hi, the expensive part of the algorithm is the comparison of two terms using Levenshtein edit distance which is done for all terms - with possibly horrible results for performance on large indexes. With: TermEnum enum = reader.terms(new Term(field, start)); you get a term enumerator that starts at the given start prefix. Use this to compute a term enumerator that starts near the term(s) you are looking for. In the termCompare method, you should make sure that the prefix is the same and that the length of the terms to compare is not too different. Like, e.g.: if ((field == term.field()) target.startsWith(start)) { int targetlen = target.length(); if (Math.abs(textlen - targetlen) 5) { int dist = editDistance(text, target, textlen, targetlen); distance = 1 - ((double)dist / (double)Math.min(textlen, targetlen)); The modification I propose here has some downsides - if a typo occures at the beginning of a word, you will not get a proper result. I am not sure on this, but I think that term enumeration could be much more efficient for purposes like this if the terms(Term t) method would only enumerate terms of the same field as t. As far as I understand this comment, the enumeration goes over all terms after t: /** Returns an enumeration of all terms after a given term. The enumeration is ordered by Term.compareTo(). Each term is greater than all that precede it in the enumeration. */ public abstract TermEnum terms(Term t) throws IOException; I haven't found a way to stop the enumeration once I am sure that the input term can not match any more :) Regards, Karsten -Ursprüngliche Nachricht- Von: Eric Jain [mailto:[EMAIL PROTECTED] Gesendet: Montag, 2. Juni 2003 13:17 An: Karsten Konrad Cc: Lucene Users List Betreff: Re: Search for similar terms have a look at the FuzzyTermEnum class in Lucene. The FuzzyTermEnum class is truely useful... if I could get it to be a bit faster. By faster I mean something in the order of one second for a half gigabyte index; currently the best I get is five seconds. What I am trying to accomplish: - If a query does not yield any results, choose and display out of all similar terms the one which occurs most often in the index. What I have tried so far: - Required first three characters to match exactely, excluded from similarity search (time reduced from 15s to 5s). - Increased FUZZY_THRESHOLD to 1.75 (no significant effect on time). - Only executed termCompare for terms with a higher frequency than the best matching term seen so far (no effect) Observations: - Time seems to be independant of the frequency of a term. Any further ideas would be greatly appreciated! Also (dear committers...), it would be great if FuzzyTermEnum could be subclassed, rather than having to resort to copy paste (the class is final). -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Search for similar terms
Hi, please have a look at the FuzzyTermEnum class in Lucene. There is an impressive implementation of Levenshtein distance there that you can use; simply set the fuzzy distance higher than 0.5 (0.75 seems to work fine) and modify the termCompare method such that the last term produced is always the one which you consider the best, i.e., which has the smallest edit distance but the highest idf. You can greatly speed up the computation by making sure in your termCompare method that you only compare terms by Levenshtein that have at least a common prefix of a few characters, say 3 or 4. Thus, it will repair notebok into notebook, but not nitebook into notebook. Most spelling errors seem to appear at the end of a word, so the restriction is not unreasonable. I use a similar method for auto-expanding dubious terms on large indexes ( 1Gig), and the performance is still quite good. Regards, Karsten -Ursprüngliche Nachricht- Von: Dario Dentale [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 30. Mai 2003 19:05 An: Lucene Users List Betreff: Re: Search for similar terms Thanks, for the answer. I was searching for a solution not based on a dictionary, but on the list of terms (with relative frequency) contained in the Lucene index. In this way (I think) I can obtain more significant results, I can use this method on multiple languages (without relative dictionary and without know which laguage is used in the query string) and especially on out-of-dictionary terms (i.e.: in a e-commerce site you can find Nikon coolpix that are not in a dictionary). I was searching for some algorithm that can calculate the similarity coefficient between two terms and multiplying it to the frequency in the indexed documents can obtain a score. Do you think that this is a wrong way? Regards, Dario - Original Message - From: [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, May 30, 2003 3:51 PM Subject: Re: Search for similar terms Perform the lucene search. If you get no or few hits, send the query term to a spell checker, like ispell. Echo the alternative spelling(s) to the user. DaveB Dario Dentale [EMAIL PROTECTED]To: [EMAIL PROTECTED] rtalis.it cc: Subject: Search for similar terms 05/30/03 05:15 AM Please respond to Lucene Users List Hi, anybody knows which is the best way to implements in Lucene a fuctionality (that Google has) like this: Search text- notebok Answer- Did you mean: notebook ? Thanks, Dario - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]