RE: Solr statistics of top searches and results returned
Hi, In our specific implementation this is not really an issue, but I can imagine it could impact performance. I guess a new thread could spawned, which takes care of any performance issues, thanks for pointing it out. I'll post a message when I coded the change. Regards, Patrick -Original Message- From: rswart [mailto:rjsw...@gmail.com] Sent: dinsdag 26 mei 2009 16:42 To: solr-user@lucene.apache.org Subject: RE: Solr statistics of top searches and results returned If this is is not done in an async way wouldn't this have a serious performance impact? Plaatje, Patrick wrote: Hi all, I created a script that uses a Solr Search Component, which hooks into the main solr core and catches the searches being done. After this it tokenizes the search and send both the tokenized as well as the original query to another Solr core. I have not written a factory for this, but if required, it shouldn't be so hard to modify the script and code Database support into it. You can find the source here: http://www.ipros.nl/uploads/Stats-component.zip It includes a README, and a schema.xml that should be used. Please let me know you're thoughts. Best, Patrick -Original Message- From: Umar Shah [mailto:u...@wisdomtap.com] Sent: vrijdag 22 mei 2009 10:03 To: solr-user@lucene.apache.org Subject: Re: Solr statistics of top searches and results returned Hi, good feature to have, maintaining top N would also require storing all the search queries done so far and keep updating (or atleast in some time window). having pluggable persistent storage for all time search queries would be great. tell me how can I help? -umar On Fri, May 22, 2009 at 12:21 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Fri, May 22, 2009 at 3:22 AM, Grant Ingersoll gsing...@apache.orgwrote: I think you will want some type of persistence mechanism otherwise you will end up consuming a lot of resources keeping track of all the query strings, unless I'm missing something. Either a Lucene index (Solr core) or the option of embedding a DB. Ideally, it would be pluggable such that people could choose their storage mechanism. Most people do this kind of thing offline via log analysis as logs can grow quite large quite quickly. For a general case, yes. But I was thinking more of a top 'n' queries as a running statistic. -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Solr-statistics-of-top-searches-and-results-returned-tp23621779p23724277.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr statistics of top searches and results returned
Hi all, I created a script that uses a Solr Search Component, which hooks into the main solr core and catches the searches being done. After this it tokenizes the search and send both the tokenized as well as the original query to another Solr core. I have not written a factory for this, but if required, it shouldn't be so hard to modify the script and code Database support into it. You can find the source here: http://www.ipros.nl/uploads/Stats-component.zip It includes a README, and a schema.xml that should be used. Please let me know you're thoughts. Best, Patrick -Original Message- From: Umar Shah [mailto:u...@wisdomtap.com] Sent: vrijdag 22 mei 2009 10:03 To: solr-user@lucene.apache.org Subject: Re: Solr statistics of top searches and results returned Hi, good feature to have, maintaining top N would also require storing all the search queries done so far and keep updating (or atleast in some time window). having pluggable persistent storage for all time search queries would be great. tell me how can I help? -umar On Fri, May 22, 2009 at 12:21 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Fri, May 22, 2009 at 3:22 AM, Grant Ingersoll gsing...@apache.orgwrote: I think you will want some type of persistence mechanism otherwise you will end up consuming a lot of resources keeping track of all the query strings, unless I'm missing something. Either a Lucene index (Solr core) or the option of embedding a DB. Ideally, it would be pluggable such that people could choose their storage mechanism. Most people do this kind of thing offline via log analysis as logs can grow quite large quite quickly. For a general case, yes. But I was thinking more of a top 'n' queries as a running statistic. -- Regards, Shalin Shekhar Mangar.
RE: Solr statistics of top searches and results returned
Hi, At the moment Solr does not have such functionality. I have written a plugin for Solr though which uses a second Solr core to store/index the searches. If you're interested, send me an email and I'll get you the source for the plugin. Regards, Patrick -Original Message- From: solrpowr [mailto:solrp...@hotmail.com] Sent: dinsdag 19 mei 2009 20:21 To: solr-user@lucene.apache.org Subject: Solr statistics of top searches and results returned Hi, Besides my own offline processing via logs, does solr have the functionality to give me statistics such as top searches, how many results were returned on these searches, and/or how long it took to get these results on average. Thanks, Bob -- View this message in context: http://www.nabble.com/Solr-statistics-of-top-searches-and-results-returned-tp23621779p23621779.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr statistics of top searches and results returned
Hi Shalin, Let me investigate. I think the challenge will be in storingmanaging these statistics. I'll get back to the list when I have thought of something. Rgrds, Patrick -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: woensdag 20 mei 2009 10:33 To: solr-user@lucene.apache.org Subject: Re: Solr statistics of top searches and results returned On Wed, May 20, 2009 at 1:31 PM, Plaatje, Patrick patrick.plaa...@getronics.com wrote: At the moment Solr does not have such functionality. I have written a plugin for Solr though which uses a second Solr core to store/index the searches. If you're interested, send me an email and I'll get you the source for the plugin. Patrick, this will be a useful addition. However instead of doing this with another core, we can keep running statistics which can be shown on the statistics page itself. What do you think? A related approach for showing slow queries was discussed recently. There's an issue open which has more details: https://issues.apache.org/jira/browse/SOLR-1101 -- Regards, Shalin Shekhar Mangar.
Getting request object within search component
Hi All, I developed my own custom search component, in which I need to get the requestors ip-address. But I can't seem to find a request object from where I can get this string, ideas anyone? Best, Patrick
RE: Change in config file (synonym.txt) requires container restart?
Hi , I'm wondering if you could not implement a custom filter which reads the file realtime (you might even keep the create synonym map in memory for a predefined time). This then doesn't need a restart of the container. Best, Patrick -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: vrijdag 19 december 2008 7:30 To: solr-user@lucene.apache.org Subject: Re: Change in config file (synonym.txt) requires container restart? Please note that a core reload will also stop Solr from serving any search requests in the time it reloads. On Fri, Dec 19, 2008 at 8:24 AM, Sagar Khetkade sagar.khetk...@hotmail.comwrote: But i am using CommonsHttpSolrServer for Solr server configuation as it is accepts the url. So here how can i reload the core. -Sagar Date: Thu, 18 Dec 2008 07:55:02 -0500 From: -Sagar markrmil...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Change in config file (synonym.txt) requires container restart? Sagar Khetkade wrote: Hi, I am using SolrJ client to connect to the Solr 1.3 server and the whole POC (doing a feasibility study ) reside in Tomcat web server. If any change I am making in the synonym.txt file to add the synonym in the file to make it reflect I have to restart the tomcat server. The synonym filter factory that I am using are in both in analyzers for type index and query in schema.xml. Please tell me whether this approach is good or any other way to make the change reflect while searching without restarting of tomcat server.Thanks and Regards, Sagar Khetkade _ Chose your Life Partner? Join MSN Matrimony FREE http://in.msn.com/matrimony You can also reload the core. - Mark _ Chose your Life Partner? Join MSN Matrimony FREE http://in.msn.com/matrimony -- Regards, Shalin Shekhar Mangar.
RE: php client. json communication
Or have a look at the Wiki, probably a better way to start: http://wiki.apache.org/solr/SolPHP Best, Patrick -- Just trying to help http://www.ipros.nl/ -- -Original Message- From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com] Sent: dinsdag 16 december 2008 15:14 To: solr-user@lucene.apache.org Subject: Re: php client. json communication Check out this link http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html If anyone of you used it can you share your experiences. Thanks, Kishore Veleti A.V.K. Julian Davchev wrote: Hi, I am about to integrate solr for index/search of my documents/data. It's php application but I see it should be no problem as solr works with xml by default. Is there any read php lib that will ease/help whole communication with solr and if possible to send/receive json data. I looked up archive list and seems not many discussions in php. Also from manual it seems that it can only get json response but request should always be xml. Cheers, -- View this message in context: http://www.nabble.com/php-client.-json-communication-tp21033573p21033806 .html Sent from the Solr - User mailing list archive at Nabble.com.
Using DIH, getting exception
Hi All, I'm trying to use the Data import handler, with the data config below (snippet): dataSource type=JdbcDataSource name=mySource driver=com.mysql.jdbc.Driver url=jdbc:mysql://myhost/myDB user=username password=password/ document name=post The variables are all good (userrname+password, etc), but I'm getting the following exception, any thoughts? org.apache.solr.handler.dataimport.DataImportHandlerException: No dataSource :null available for entity :item Processing Document # Best, Patrick
RE: checkout 1.4 snapshot
Hi, You can find the SVN repository here: http://www.apache.org/dev/version-control.html#anon-svn I'm not sure if this represent the 1.4 version, but as being the trunk it's the latest version. Best, Patrick -Original Message- From: roberto [mailto:miles.c...@gmail.com] Sent: dinsdag 16 december 2008 22:13 To: solr-user@lucene.apache.org Subject: checkout 1.4 snapshot Hello, Someone could tell me how can i checkout the 1.4 snapshot ? thanks, -- Without love, we are birds with broken wings. Morrie
RE: checkout 1.4 snapshot
Sorry all, Wrong url in the post, right url should be: http://svn.apache.org/repos/asf/lucene/solr/ Best, Patrick -Original Message- From: Plaatje, Patrick [mailto:patrick.plaa...@getronics.com] Sent: dinsdag 16 december 2008 22:19 To: solr-user@lucene.apache.org Subject: RE: checkout 1.4 snapshot Hi, You can find the SVN repository here: http://www.apache.org/dev/version-control.html#anon-svn I'm not sure if this represent the 1.4 version, but as being the trunk it's the latest version. Best, Patrick -Original Message- From: roberto [mailto:miles.c...@gmail.com] Sent: dinsdag 16 december 2008 22:13 To: solr-user@lucene.apache.org Subject: checkout 1.4 snapshot Hello, Someone could tell me how can i checkout the 1.4 snapshot ? thanks, -- Without love, we are birds with broken wings. Morrie
RE: php client. json communication
Glad that's sorted. On the other issue (directly accessing solr from any client) I think I saw a discussion on the list earlier, but I don't know what the result was, browse through the archives and look for something about security (I think). Best, patrick -Original Message- From: Julian Davchev [mailto:j...@drun.net] Sent: dinsdag 16 december 2008 23:02 To: solr-user@lucene.apache.org Subject: Re: php client. json communication I think I got it now. Search request is actually just simple url with few params...no json or xml or fancy stuff needed. I was concerned with this cause I need to use solr with javascript directly, bypassing application and directly searching stuff. Plaatje, Patrick wrote: Hi Julian, I'm a bit confused. The indexing is indeed being done through XML, but in searching it is possible to get JSON results by using the wt=json parameter, have a look here: http://wiki.apache.org/solr/SolJSON Best, Patrick -Original Message- From: Julian Davchev [mailto:j...@drun.net] Sent: dinsdag 16 december 2008 22:39 To: solr-user@lucene.apache.org Subject: Re: php client. json communication Hi, 1. Thanks for links, I looked at both. Still I think that solr or at least those php clients doesn't support jason as input. It's clear that it's possible to get json response.but search is only possible via xml queries. Plaatje, Patrick wrote: Or have a look at the Wiki, probably a better way to start: http://wiki.apache.org/solr/SolPHP Best, Patrick -- Just trying to help http://www.ipros.nl/ -- -Original Message- From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com] Sent: dinsdag 16 december 2008 15:14 To: solr-user@lucene.apache.org Subject: Re: php client. json communication Check out this link http://www.ibm.com/developerworks/library/os-php-apachesolr/index.htm l If anyone of you used it can you share your experiences. Thanks, Kishore Veleti A.V.K. Julian Davchev wrote: Hi, I am about to integrate solr for index/search of my documents/data. It's php application but I see it should be no problem as solr works with xml by default. Is there any read php lib that will ease/help whole communication with solr and if possible to send/receive json data. I looked up archive list and seems not many discussions in php. Also from manual it seems that it can only get json response but request should always be xml. Cheers, -- View this message in context: http://www.nabble.com/php-client.-json-communication-tp21033573p21033 8 06 .html Sent from the Solr - User mailing list archive at Nabble.com.
RE: php client. json communication
Hi Julian, I'm a bit confused. The indexing is indeed being done through XML, but in searching it is possible to get JSON results by using the wt=json parameter, have a look here: http://wiki.apache.org/solr/SolJSON Best, Patrick -Original Message- From: Julian Davchev [mailto:j...@drun.net] Sent: dinsdag 16 december 2008 22:39 To: solr-user@lucene.apache.org Subject: Re: php client. json communication Hi, 1. Thanks for links, I looked at both. Still I think that solr or at least those php clients doesn't support jason as input. It's clear that it's possible to get json response.but search is only possible via xml queries. Plaatje, Patrick wrote: Or have a look at the Wiki, probably a better way to start: http://wiki.apache.org/solr/SolPHP Best, Patrick -- Just trying to help http://www.ipros.nl/ -- -Original Message- From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com] Sent: dinsdag 16 december 2008 15:14 To: solr-user@lucene.apache.org Subject: Re: php client. json communication Check out this link http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html If anyone of you used it can you share your experiences. Thanks, Kishore Veleti A.V.K. Julian Davchev wrote: Hi, I am about to integrate solr for index/search of my documents/data. It's php application but I see it should be no problem as solr works with xml by default. Is there any read php lib that will ease/help whole communication with solr and if possible to send/receive json data. I looked up archive list and seems not many discussions in php. Also from manual it seems that it can only get json response but request should always be xml. Cheers, -- View this message in context: http://www.nabble.com/php-client.-json-communication-tp21033573p210338 06 .html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Keyword extraction
Hi Aleksander, With all the help of you and the other comments, we're now at a point where a MoreLikeThis list is returned, and shows 10 related records. However on the query executed there are no keywords whatsoever being returned. Is the querystring still wrong or is something else required? The querystring we're currently executing is: http://suempnr3:8080/solr/select/?q=amsterdammlt.fl=textmlt.displayTerms=listmlt=true Best, Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 15:07 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Ah, yes, That is important. In lucene, the MLT will see if the term vector is stored, and if it is not it will still be able to perform the querying, but in a much much much less efficient way.. Lucene will analyze the document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to limit the number of tokens that will be parsed). (don't want to go into details on this since I haven't really dug through the code:p) But when the field isn't stored either, it is rather difficult to re-analyze the document;) On a general note, if you want to really understand how the MLT works, take a look at the wiki or read this thorough blog post: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ Regards, Aleksander On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick [EMAIL PROTECTED] wrote: Hi Aleksander, This was a typo on my end, the original query included a semicolon instead of an equal sign. But I think it has to do with my field not being stored and not being identified as termVectors=true. I'm recreating the index now, and see if this fixes the problem. Best, patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 14:37 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Hi there! Well, first of all i think you have an error in your query, if I'm not mistaken. You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called id, you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick. If not, try adding the debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query. Hope this helps. Cheers, Aleksander On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick [EMAIL PROTECTED] wrote: Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: lst name=moreLikeThis result name=18477975 numFound=0 start=0/ /lst Instead of delivering details of the interestingTerms. Thanks in advance Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simi l ar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e 2 2ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED] wrote: Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with most
RE: Keyword extraction
Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.interes tingTerms=listmlt=truemlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick
RE: Keyword extraction
Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: lst name=moreLikeThis result name=18477975 numFound=0 start=0/ /lst Instead of delivering details of the interestingTerms. Thanks in advance Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED] wrote: Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with most frequent terms from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (more like this doesn't work in this case). I hope it helps, Best Regards, Vitalie Scurtu --- On Wed, 11/26/08, Plaatje, Patrick [EMAIL PROTECTED] wrote: From: Plaatje, Patrick [EMAIL PROTECTED] Subject: RE: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 10:52 AM Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.inter es tingTerms=listmlt=truemlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
RE: Keyword extraction
Hi Aleksander, This was a typo on my end, the original query included a semicolon instead of an equal sign. But I think it has to do with my field not being stored and not being identified as termVectors=true. I'm recreating the index now, and see if this fixes the problem. Best, patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 14:37 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Hi there! Well, first of all i think you have an error in your query, if I'm not mistaken. You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called id, you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick. If not, try adding the debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query. Hope this helps. Cheers, Aleksander On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick [EMAIL PROTECTED] wrote: Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: lst name=moreLikeThis result name=18477975 numFound=0 start=0/ /lst Instead of delivering details of the interestingTerms. Thanks in advance Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil ar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2 2ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED] wrote: Dear Partick, I had the same problem with MoreLikeThis function. After briefly reading and analyzing the source code of moreLikeThis function in solr, I conducted: MoreLikeThis uses term vectors to ranks all the terms from a document by its frequency. According to its ranking, it will start to generate queries, artificially, and search for documents. So, moreLikeThis will retrieve related documents by artificially generating queries based on most frequent terms. There's a big problem with most frequent terms from documents. Most frequent words are usually meaningless, or so called function words, or, people from Information Retrieval like to call them stopwords. However, ignoring technical problems of implementation of moreLikeThis function, this approach is very dangerous, since queries are generated artificially based on a given document. Writting queries for retrieving a document is a human task, and it assumes some knowledge (user knows what document he wants). I advice to use others approaches, depending on your expectation. For example, you can extract similar documents just by searching for documents with similar title (more like this doesn't work in this case). I hope it helps, Best Regards, Vitalie Scurtu --- On Wed, 11/26/08, Plaatje, Patrick [EMAIL PROTECTED] wrote: From: Plaatje, Patrick [EMAIL PROTECTED] Subject: RE: Keyword extraction To: solr-user@lucene.apache.org Date: Wednesday, November 26, 2008, 10:52 AM Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.inte r es tingTerms=listmlt=truemlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
Keyword extraction
Hi all, Strugling with a question I recently got from a collegue: is it possible to extract keywords from indexed content? In my opinion it should be possible to find out on what words the ranking of the indexed content is the highest (Lucene or Solr), but have no clue where to begin. Anyone having suggestions? Best, Patrick