RE: Solr statistics of top searches and results returned

2009-05-26 Thread Plaatje, Patrick
Hi,

In our specific implementation this is not really an issue, but I can imagine 
it could impact performance. I guess a new thread could spawned, which takes 
care of any performance issues, thanks for pointing it out. I'll post a message 
when I coded the change.

Regards,

Patrick


-Original Message-
From: rswart [mailto:rjsw...@gmail.com] 
Sent: dinsdag 26 mei 2009 16:42
To: solr-user@lucene.apache.org
Subject: RE: Solr statistics of top searches and results returned


If this is is not done in an async way wouldn't this have a serious performance 
impact? 

 

Plaatje, Patrick wrote:
 
 Hi all,
 
 I created a script that uses a Solr Search Component, which hooks into 
 the main solr core and catches the searches being done. After this it 
 tokenizes the search and send both the tokenized as well as the 
 original query to another Solr core. I have not written a factory for 
 this, but if required, it shouldn't be so hard to modify the script 
 and code Database support into it.
 
 You can find the source here:
 
 http://www.ipros.nl/uploads/Stats-component.zip
 
 It includes a README, and a schema.xml that should be used.
 
 Please let me know you're thoughts.
 
 Best,
 
 Patrick
 
 
 
  
 
 -Original Message-
 From: Umar Shah [mailto:u...@wisdomtap.com]
 Sent: vrijdag 22 mei 2009 10:03
 To: solr-user@lucene.apache.org
 Subject: Re: Solr statistics of top searches and results returned
 
 Hi,
 
 good feature to have,
 maintaining top N would also require storing all the search queries 
 done so far and keep updating (or atleast in some time window).
 
 having pluggable persistent storage for all time search queries would 
 be great.
 
 tell me how can I help?
 
 -umar
 
 On Fri, May 22, 2009 at 12:21 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:
 On Fri, May 22, 2009 at 3:22 AM, Grant Ingersoll
 gsing...@apache.orgwrote:


 I think you will want some type of persistence mechanism otherwise 
 you will end up consuming a lot of resources keeping track of all 
 the query strings, unless I'm missing something.  Either a Lucene 
 index (Solr core) or the option of embedding a DB.  Ideally, it 
 would be pluggable such that people could choose their storage mechanism.
 Most people do this kind of thing offline via log analysis as logs 
 can grow quite large quite quickly.


 For a general case, yes. But I was thinking more of a top 'n' queries 
 as a running statistic.

 --
 Regards,
 Shalin Shekhar Mangar.

 
 

--
View this message in context: 
http://www.nabble.com/Solr-statistics-of-top-searches-and-results-returned-tp23621779p23724277.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Solr statistics of top searches and results returned

2009-05-25 Thread Plaatje, Patrick
Hi all,

I created a script that uses a Solr Search Component, which hooks into the main 
solr core and catches the searches being done. After this it tokenizes the 
search and send both the tokenized as well as the original query to another 
Solr core. I have not written a factory for this, but if required, it shouldn't 
be so hard to modify the script and code Database support into it.

You can find the source here:

http://www.ipros.nl/uploads/Stats-component.zip

It includes a README, and a schema.xml that should be used.

Please let me know you're thoughts.

Best,

Patrick



 

-Original Message-
From: Umar Shah [mailto:u...@wisdomtap.com] 
Sent: vrijdag 22 mei 2009 10:03
To: solr-user@lucene.apache.org
Subject: Re: Solr statistics of top searches and results returned

Hi,

good feature to have,
maintaining top N would also require storing all the search queries done so far 
and keep updating (or atleast in some time window).

having pluggable persistent storage for all time search queries would be great.

tell me how can I help?

-umar

On Fri, May 22, 2009 at 12:21 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:
 On Fri, May 22, 2009 at 3:22 AM, Grant Ingersoll gsing...@apache.orgwrote:


 I think you will want some type of persistence mechanism otherwise 
 you will end up consuming a lot of resources keeping track of all the 
 query strings, unless I'm missing something.  Either a Lucene index 
 (Solr core) or the option of embedding a DB.  Ideally, it would be 
 pluggable such that people could choose their storage mechanism.  
 Most people do this kind of thing offline via log analysis as logs can grow 
 quite large quite quickly.


 For a general case, yes. But I was thinking more of a top 'n' queries 
 as a running statistic.

 --
 Regards,
 Shalin Shekhar Mangar.



RE: Solr statistics of top searches and results returned

2009-05-20 Thread Plaatje, Patrick
Hi,

At the moment Solr does not have such functionality. I have written a plugin 
for Solr though which uses a second Solr core to store/index the searches. If 
you're interested, send me an email and I'll get you the source for the plugin.

Regards,

Patrick

-Original Message-
From: solrpowr [mailto:solrp...@hotmail.com] 
Sent: dinsdag 19 mei 2009 20:21
To: solr-user@lucene.apache.org
Subject: Solr statistics of top searches and results returned


Hi,

Besides my own offline processing via logs, does solr have the functionality to 
give me statistics such as top searches, how many results were returned on 
these searches, and/or how long it took to get these results on average.


Thanks,
Bob
--
View this message in context: 
http://www.nabble.com/Solr-statistics-of-top-searches-and-results-returned-tp23621779p23621779.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Solr statistics of top searches and results returned

2009-05-20 Thread Plaatje, Patrick
Hi Shalin,

Let me investigate. I think the challenge will be in storingmanaging these 
statistics. I'll get back to the list when I have thought of something.

Rgrds,

Patrick

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: woensdag 20 mei 2009 10:33
To: solr-user@lucene.apache.org
Subject: Re: Solr statistics of top searches and results returned

On Wed, May 20, 2009 at 1:31 PM, Plaatje, Patrick  
patrick.plaa...@getronics.com wrote:


 At the moment Solr does not have such functionality. I have written a 
 plugin for Solr though which uses a second Solr core to store/index 
 the searches. If you're interested, send me an email and I'll get you 
 the source for the plugin.


Patrick, this will be a useful addition. However instead of doing this with 
another core, we can keep running statistics which can be shown on the 
statistics page itself. What do you think?

A related approach for showing slow queries was discussed recently. There's an 
issue open which has more details:

https://issues.apache.org/jira/browse/SOLR-1101

--
Regards,
Shalin Shekhar Mangar.


Getting request object within search component

2008-12-24 Thread Plaatje, Patrick
Hi All,

I developed my own custom search component, in which I need to get the
requestors ip-address. But I can't seem to find a request object from
where I can get this string, ideas anyone?

Best,

Patrick


RE: Change in config file (synonym.txt) requires container restart?

2008-12-19 Thread Plaatje, Patrick
Hi ,

I'm wondering if you could not implement a custom filter which reads the
file realtime (you might even keep the create synonym map in memory for
a predefined time). This then doesn't need a restart of the container.

Best,

Patrick

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: vrijdag 19 december 2008 7:30
To: solr-user@lucene.apache.org
Subject: Re: Change in config file (synonym.txt) requires container
restart?

Please note that a core reload will also stop Solr from serving any
search requests in the time it reloads.

On Fri, Dec 19, 2008 at 8:24 AM, Sagar Khetkade
sagar.khetk...@hotmail.comwrote:


 But i am using CommonsHttpSolrServer for Solr server configuation as 
 it is accepts the url. So here how can i reload the core.

 -Sagar Date: Thu, 18 Dec 2008 07:55:02 -0500 From: 
 -Sagar markrmil...@gmail.com
 To: solr-user@lucene.apache.org Subject: Re: Change in config file
 (synonym.txt) requires container restart?  Sagar Khetkade wrote:  
 Hi, 
   I am using SolrJ client to connect to the Solr 1.3 server and the 
   whole
 POC (doing a feasibility study ) reside in Tomcat web server. If any 
 change I am making in the synonym.txt file to add the synonym in the 
 file to make it reflect I have to restart the tomcat server. The 
 synonym filter factory that I am using are in both in analyzers for 
 type index and query in schema.xml. Please tell me whether this 
 approach is good or any other way to make the change reflect while 
 searching without restarting of tomcat server.Thanks and 
 Regards,  Sagar Khetkade  
 _  
 Chose your Life Partner? Join MSN Matrimony FREE  
 http://in.msn.com/matrimony
   You can also reload the core.  - Mark
 _
 Chose your Life Partner? Join MSN Matrimony FREE 
 http://in.msn.com/matrimony




--
Regards,
Shalin Shekhar Mangar.


RE: php client. json communication

2008-12-16 Thread Plaatje, Patrick
Or have a look at the Wiki, probably a better way to start:

http://wiki.apache.org/solr/SolPHP

Best,

Patrick

--
Just trying to help 
http://www.ipros.nl/
--

-Original Message-
From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com] 
Sent: dinsdag 16 december 2008 15:14
To: solr-user@lucene.apache.org
Subject: Re: php client. json communication


Check out this link
http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html

If anyone of you used it can you share your experiences.

Thanks,
Kishore Veleti A.V.K.


Julian Davchev wrote:
 
 Hi,
 I am about to integrate solr for index/search of my documents/data. 
 It's php application but I see it should be no problem as solr works 
 with xml by default.
 Is there any read php lib that will ease/help whole communication with

 solr and if possible to send/receive json data.
 
 I looked up archive list and seems not many discussions in php. Also 
 from manual it seems that it can only get json response but request 
 should always be xml.
 Cheers,
 
 

--
View this message in context:
http://www.nabble.com/php-client.-json-communication-tp21033573p21033806
.html
Sent from the Solr - User mailing list archive at Nabble.com.



Using DIH, getting exception

2008-12-16 Thread Plaatje, Patrick
Hi All,

I'm trying to use the Data import handler, with the data config below
(snippet):

dataSource type=JdbcDataSource name=mySource
driver=com.mysql.jdbc.Driver url=jdbc:mysql://myhost/myDB
user=username password=password/
document name=post

The variables are all good (userrname+password, etc), but I'm getting
the following exception, any thoughts?

org.apache.solr.handler.dataimport.DataImportHandlerException: No
dataSource :null available for entity :item Processing Document #


Best,

Patrick


RE: checkout 1.4 snapshot

2008-12-16 Thread Plaatje, Patrick
Hi,

You can find the SVN repository here:
http://www.apache.org/dev/version-control.html#anon-svn

I'm not sure if this represent the 1.4 version, but as being the trunk
it's the latest version.

Best,

Patrick


-Original Message-
From: roberto [mailto:miles.c...@gmail.com] 
Sent: dinsdag 16 december 2008 22:13
To: solr-user@lucene.apache.org
Subject: checkout 1.4 snapshot

Hello,

Someone could tell me how can i  checkout the 1.4 snapshot ?

thanks,


--
Without love, we are birds with broken wings.
Morrie


RE: checkout 1.4 snapshot

2008-12-16 Thread Plaatje, Patrick
Sorry all,

Wrong url in the post, right url should be:

http://svn.apache.org/repos/asf/lucene/solr/

Best,

Patrick

 

-Original Message-
From: Plaatje, Patrick [mailto:patrick.plaa...@getronics.com] 
Sent: dinsdag 16 december 2008 22:19
To: solr-user@lucene.apache.org
Subject: RE: checkout 1.4 snapshot

Hi,

You can find the SVN repository here:
http://www.apache.org/dev/version-control.html#anon-svn

I'm not sure if this represent the 1.4 version, but as being the trunk
it's the latest version.

Best,

Patrick


-Original Message-
From: roberto [mailto:miles.c...@gmail.com]
Sent: dinsdag 16 december 2008 22:13
To: solr-user@lucene.apache.org
Subject: checkout 1.4 snapshot

Hello,

Someone could tell me how can i  checkout the 1.4 snapshot ?

thanks,


--
Without love, we are birds with broken wings.
Morrie


RE: php client. json communication

2008-12-16 Thread Plaatje, Patrick
Glad that's sorted. On the other issue (directly accessing solr from any
client) I think I saw a discussion on the list earlier, but I don't know
what the result was, browse through the archives and look for something
about security (I think).

Best,

patrick 

-Original Message-
From: Julian Davchev [mailto:j...@drun.net] 
Sent: dinsdag 16 december 2008 23:02
To: solr-user@lucene.apache.org
Subject: Re: php client. json communication

I think I got it now. Search request is actually just simple url with
few params...no json or xml or fancy stuff needed.
I was concerned with this cause I need to use solr with javascript
directly, bypassing application and directly searching stuff.

Plaatje, Patrick wrote:
 Hi Julian,

 I'm a bit confused. The indexing is indeed being done through XML, but

 in searching it is possible to get JSON results by using the wt=json 
 parameter, have a look here:

 http://wiki.apache.org/solr/SolJSON

 Best,

 Patrick


 -Original Message-
 From: Julian Davchev [mailto:j...@drun.net]
 Sent: dinsdag 16 december 2008 22:39
 To: solr-user@lucene.apache.org
 Subject: Re: php client. json communication

 Hi,
 1. Thanks for links, I looked at both. Still I think that solr or 
 at least those php clients doesn't support jason as input.
 It's clear that it's possible to get json response.but search is 
 only possible via xml queries.


 Plaatje, Patrick wrote:
   
 Or have a look at the Wiki, probably a better way to start:

 http://wiki.apache.org/solr/SolPHP

 Best,

 Patrick

 --
 Just trying to help
 http://www.ipros.nl/
 --

 -Original Message-
 From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com]
 Sent: dinsdag 16 december 2008 15:14
 To: solr-user@lucene.apache.org
 Subject: Re: php client. json communication


 Check out this link
 http://www.ibm.com/developerworks/library/os-php-apachesolr/index.htm
 l

 If anyone of you used it can you share your experiences.

 Thanks,
 Kishore Veleti A.V.K.


 Julian Davchev wrote:
   
 
 Hi,
 I am about to integrate solr for index/search of my documents/data. 
 It's php application but I see it should be no problem as solr works

 with xml by default.
 Is there any read php lib that will ease/help whole communication 
 with
 
   
   
 
 solr and if possible to send/receive json data.

 I looked up archive list and seems not many discussions in php. Also

 from manual it seems that it can only get json response but request 
 should always be xml.
 Cheers,


 
   
 --
 View this message in context:
 http://www.nabble.com/php-client.-json-communication-tp21033573p21033
 8
 06
 .html
 Sent from the Solr - User mailing list archive at Nabble.com.

   
 

   



RE: php client. json communication

2008-12-16 Thread Plaatje, Patrick
Hi Julian,

I'm a bit confused. The indexing is indeed being done through XML, but
in searching it is possible to get JSON results by using the wt=json
parameter, have a look here:

http://wiki.apache.org/solr/SolJSON

Best,

Patrick


-Original Message-
From: Julian Davchev [mailto:j...@drun.net] 
Sent: dinsdag 16 december 2008 22:39
To: solr-user@lucene.apache.org
Subject: Re: php client. json communication

Hi,
1. Thanks for links, I looked at both. Still I think that solr or at
least those php clients doesn't support jason as input.
It's clear that it's possible to get json response.but search is
only possible via xml queries.


Plaatje, Patrick wrote:
 Or have a look at the Wiki, probably a better way to start:

 http://wiki.apache.org/solr/SolPHP

 Best,

 Patrick

 --
 Just trying to help
 http://www.ipros.nl/
 --

 -Original Message-
 From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com]
 Sent: dinsdag 16 december 2008 15:14
 To: solr-user@lucene.apache.org
 Subject: Re: php client. json communication


 Check out this link
 http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html

 If anyone of you used it can you share your experiences.

 Thanks,
 Kishore Veleti A.V.K.


 Julian Davchev wrote:
   
 Hi,
 I am about to integrate solr for index/search of my documents/data. 
 It's php application but I see it should be no problem as solr works 
 with xml by default.
 Is there any read php lib that will ease/help whole communication 
 with
 

   
 solr and if possible to send/receive json data.

 I looked up archive list and seems not many discussions in php. Also 
 from manual it seems that it can only get json response but request 
 should always be xml.
 Cheers,


 

 --
 View this message in context:
 http://www.nabble.com/php-client.-json-communication-tp21033573p210338
 06
 .html
 Sent from the Solr - User mailing list archive at Nabble.com.

   



RE: Keyword extraction

2008-11-27 Thread Plaatje, Patrick
Hi Aleksander,

With all the help of you and the other comments, we're now at a point where a 
MoreLikeThis list is returned, and shows 10 related records. However on the 
query executed there are no keywords whatsoever being returned. Is the 
querystring still wrong or is something else required?

The querystring we're currently executing is:

http://suempnr3:8080/solr/select/?q=amsterdammlt.fl=textmlt.displayTerms=listmlt=true


Best,

Patrick 

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] 
Sent: woensdag 26 november 2008 15:07
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Ah, yes, That is important. In lucene, the MLT will see if the term vector is 
stored, and if it is not it will still be able to perform the querying, but in 
a much much much less efficient way.. Lucene will analyze the document (and the 
variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to limit the number of 
tokens that will be parsed). (don't want to go into details on this since I 
haven't really dug through the code:p) But when the field isn't stored either, 
it is rather difficult to re-analyze the
document;)

On a general note, if you want to really understand how the MLT works, take a 
look at the wiki or read this thorough blog post:  
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

Regards,
  Aleksander

On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick [EMAIL PROTECTED] wrote:

 Hi Aleksander,

 This was a typo on my end, the original query included a semicolon 
 instead of an equal sign. But I think it has to do with my field not 
 being stored and not being identified as termVectors=true. I'm 
 recreating the index now, and see if this fixes the problem.

 Best,

 patrick

 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 14:37
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction

 Hi there!
 Well, first of all i think you have an error in your query, if I'm not 
 mistaken.
 You say http://localhost:8080/solr/select/?q=id=18477975...
 but since you are referring to the field called id, you must say:
 http://localhost:8080/solr/select/?q=id:18477975...
 (use colon instead of the equals sign).
 I think that will do the trick.
 If not, try adding the debugQuery=on at the end of your request url, 
 to see debug output on how the query is parsed and if/how any 
 documents are matched against your query.
 Hope this helps.

 Cheers,
   Aleksander



 On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick 
 [EMAIL PROTECTED] wrote:

 Hi Aleksander,

 Thanx for clearing this up. I am confident that this is a way to 
 explore for me as I'm just starting to grasp the matter. Do you know 
 why I'm not getting any results with the query posted earlier then? 
 It gives me the folowing only:

 lst name=moreLikeThis
  result name=18477975 numFound=0 start=0/ /lst

 Instead of delivering details of the interestingTerms.

 Thanks in advance

 Patrick


 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 13:03
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction

 I do not agree with you at all. The concept of MoreLikeThis is based 
 on the fundamental idea of TF-IDF weighting, and not term frequency 
 alone.
 Please take a look at:
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simi
 l ar/MoreLikeThis.html As you can see, it is possible to use cut-off 
 thresholds to significantly reduce the number of unimportant terms, 
 and generate highly suitable queries based on the tf-idf frequency of 
 the term, since as you point out, high frequency terms alone tends to 
 be useless for querying, but taking the document frequency into 
 account drastically increases the importance of the term!

 In solr, use parameters to manipulate your desired results:
 http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e
 2
 2ec5d1519c456b2c
 For instance:
 mlt.mintf - Minimum Term Frequency - the frequency below which terms 
 will be ignored in the source doc.
 mlt.mindf - Minimum Document Frequency - the frequency at which words 
 will be ignored which do not occur in at least this many docs.
 You can also set thresholds for term length etc.

 Hope this gives you a better idea of things.
 - Aleks

 On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie 
 [EMAIL PROTECTED]
 wrote:

 Dear Partick, I had the same problem with MoreLikeThis function.

 After  briefly reading and analyzing the source code of moreLikeThis 
 function in solr, I conducted:

 MoreLikeThis uses term vectors to ranks all the terms from a 
 document by its frequency. According to its ranking, it will start 
 to generate queries, artificially, and search for documents.

 So, moreLikeThis will retrieve related documents by artificially 
 generating queries based on most frequent terms.

 There's a big problem with most

RE: Keyword extraction

2008-11-26 Thread Plaatje, Patrick
Hi All,
 
as an addition to my previous post, no interestingTerms are returned
when i execute the folowing url:
 
http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.interes
tingTerms=listmlt=truemlt.match.include=true
 
I get a moreLikeThis list though, any thoughts?
 
Best,
 
Patrick


RE: Keyword extraction

2008-11-26 Thread Plaatje, Patrick
Hi Aleksander,

Thanx for clearing this up. I am confident that this is a way to explore for me 
as I'm just starting to grasp the matter. Do you know why I'm not getting any 
results with the query posted earlier then? It gives me the folowing only:

lst name=moreLikeThis
result name=18477975 numFound=0 start=0/
/lst

Instead of delivering details of the interestingTerms.

Thanks in advance

Patrick


-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] 
Sent: woensdag 26 november 2008 13:03
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

I do not agree with you at all. The concept of MoreLikeThis is based on the 
fundamental idea of TF-IDF weighting, and not term frequency alone.
Please take a look at:  
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html
As you can see, it is possible to use cut-off thresholds to significantly 
reduce the number of unimportant terms, and generate highly suitable queries 
based on the tf-idf frequency of the term, since as you point out, high 
frequency terms alone tends to be useless for querying, but taking the document 
frequency into account drastically increases the importance of the term!

In solr, use parameters to manipulate your desired results:  
http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c
For instance:
mlt.mintf - Minimum Term Frequency - the frequency below which terms will be 
ignored in the source doc.
mlt.mindf - Minimum Document Frequency - the frequency at which words will be 
ignored which do not occur in at least this many docs.
You can also set thresholds for term length etc.

Hope this gives you a better idea of things.
- Aleks

On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
wrote:

 Dear Partick, I had the same problem with MoreLikeThis function.

 After  briefly reading and analyzing the source code of moreLikeThis 
 function in solr, I conducted:

 MoreLikeThis uses term vectors to ranks all the terms from a document 
 by its frequency. According to its ranking, it will start to generate 
 queries, artificially, and search for documents.

 So, moreLikeThis will retrieve related documents by artificially 
 generating queries based on most frequent terms.

 There's a big problem with most frequent terms  from documents. Most 
 frequent words are usually meaningless, or so called function words, 
 or, people from Information Retrieval like to call them stopwords. 
 However, ignoring  technical problems of implementation of 
 moreLikeThis function, this approach is very dangerous, since queries 
 are generated artificially based on a given document.
 Writting queries for retrieving a document is a human task, and it 
 assumes some knowledge (user knows what document he wants).

 I advice to use others approaches, depending on your expectation. For 
 example, you can extract similar documents just by searching for 
 documents with similar title (more like this doesn't work in this case).

 I hope it helps,
 Best Regards,
 Vitalie Scurtu
 --- On Wed, 11/26/08, Plaatje, Patrick [EMAIL PROTECTED]
 wrote:
 From: Plaatje, Patrick [EMAIL PROTECTED]
 Subject: RE:  Keyword extraction
 To: solr-user@lucene.apache.org
 Date: Wednesday, November 26, 2008, 10:52 AM

 Hi All,
 as an addition to my previous post, no interestingTerms are returned 
 when i execute the folowing url:
 http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.inter
 es tingTerms=listmlt=truemlt.match.include=true
 I get a moreLikeThis list though, any thoughts?
 Best,
 Patrick







--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no


RE: Keyword extraction

2008-11-26 Thread Plaatje, Patrick
Hi Aleksander,

This was a typo on my end, the original query included a semicolon instead of 
an equal sign. But I think it has to do with my field not being stored and not 
being identified as termVectors=true. I'm recreating the index now, and see 
if this fixes the problem.

Best,

patrick

-Original Message-
From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] 
Sent: woensdag 26 november 2008 14:37
To: solr-user@lucene.apache.org
Subject: Re: Keyword extraction

Hi there!
Well, first of all i think you have an error in your query, if I'm not mistaken.
You say http://localhost:8080/solr/select/?q=id=18477975...
but since you are referring to the field called id, you must say:
http://localhost:8080/solr/select/?q=id:18477975...
(use colon instead of the equals sign).
I think that will do the trick.
If not, try adding the debugQuery=on at the end of your request url, to see 
debug output on how the query is parsed and if/how any documents are matched 
against your query.
Hope this helps.

Cheers,
  Aleksander



On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick [EMAIL PROTECTED] wrote:

 Hi Aleksander,

 Thanx for clearing this up. I am confident that this is a way to 
 explore for me as I'm just starting to grasp the matter. Do you know 
 why I'm not getting any results with the query posted earlier then? It 
 gives me the folowing only:

 lst name=moreLikeThis
   result name=18477975 numFound=0 start=0/ /lst

 Instead of delivering details of the interestingTerms.

 Thanks in advance

 Patrick


 -Original Message-
 From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED]
 Sent: woensdag 26 november 2008 13:03
 To: solr-user@lucene.apache.org
 Subject: Re: Keyword extraction

 I do not agree with you at all. The concept of MoreLikeThis is based 
 on the fundamental idea of TF-IDF weighting, and not term frequency alone.
 Please take a look at:
 http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil
 ar/MoreLikeThis.html As you can see, it is possible to use cut-off 
 thresholds to significantly reduce the number of unimportant terms, 
 and generate highly suitable queries based on the tf-idf frequency of 
 the term, since as you point out, high frequency terms alone tends to 
 be useless for querying, but taking the document frequency into 
 account drastically increases the importance of the term!

 In solr, use parameters to manipulate your desired results:
 http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2
 2ec5d1519c456b2c
 For instance:
 mlt.mintf - Minimum Term Frequency - the frequency below which terms 
 will be ignored in the source doc.
 mlt.mindf - Minimum Document Frequency - the frequency at which words 
 will be ignored which do not occur in at least this many docs.
 You can also set thresholds for term length etc.

 Hope this gives you a better idea of things.
 - Aleks

 On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie [EMAIL PROTECTED]
 wrote:

 Dear Partick, I had the same problem with MoreLikeThis function.

 After  briefly reading and analyzing the source code of moreLikeThis 
 function in solr, I conducted:

 MoreLikeThis uses term vectors to ranks all the terms from a document 
 by its frequency. According to its ranking, it will start to generate 
 queries, artificially, and search for documents.

 So, moreLikeThis will retrieve related documents by artificially 
 generating queries based on most frequent terms.

 There's a big problem with most frequent terms  from documents. 
 Most frequent words are usually meaningless, or so called function 
 words, or, people from Information Retrieval like to call them stopwords.
 However, ignoring  technical problems of implementation of 
 moreLikeThis function, this approach is very dangerous, since queries 
 are generated artificially based on a given document.
 Writting queries for retrieving a document is a human task, and it 
 assumes some knowledge (user knows what document he wants).

 I advice to use others approaches, depending on your expectation. For 
 example, you can extract similar documents just by searching for 
 documents with similar title (more like this doesn't work in this case).

 I hope it helps,
 Best Regards,
 Vitalie Scurtu
 --- On Wed, 11/26/08, Plaatje, Patrick 
 [EMAIL PROTECTED]
 wrote:
 From: Plaatje, Patrick [EMAIL PROTECTED]
 Subject: RE:  Keyword extraction
 To: solr-user@lucene.apache.org
 Date: Wednesday, November 26, 2008, 10:52 AM

 Hi All,
 as an addition to my previous post, no interestingTerms are returned 
 when i execute the folowing url:
 http://localhost:8080/solr/select/?q=id=18477975mlt.fl=textmlt.inte
 r es tingTerms=listmlt=truemlt.match.include=true
 I get a moreLikeThis list though, any thoughts?
 Best,
 Patrick







 --
 Aleksander M. Stensby
 Senior software developer
 Integrasco A/S
 www.integrasco.no




--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no


Keyword extraction

2008-11-25 Thread Plaatje, Patrick
Hi all,

Strugling with a question I recently got from a collegue: is it possible
to extract keywords from indexed content?

In my opinion it should be possible to find out on what words the
ranking of the indexed content is the highest (Lucene or Solr), but have
no clue where to begin. Anyone having suggestions?

Best,

Patrick