RE: MoreLikeThis - Odd results - what am I doing wrong?
Thanks David - I suppose it is an AWS question and thank you for the pointers. As a further input to the MLT question - it does seem that 3.6 behavior is different from 4.2 - the issue seems to be more in terms of the raw query that is generated. I will some more research and revert back with details. David Parks davidpark...@yahoo.com wrote: Isn't this an AWS security groups question? You should probably post this question on the AWS forums, but for the moment, here's the basic reading material - go set up your EC2 security groups and lock down your systems. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html If you just want to password protect Solr here are the instructions: http://wiki.apache.org/solr/SolrSecurity But I most certainly would not leave it open to the world even with a password (note that the basic password authentication sends passwords in clear text if you're not using HTTPS, best lock the thing down behind a firewall). Dave -Original Message- From: DC tech [mailto:dctech1...@gmail.com] Sent: Tuesday, April 02, 2013 1:02 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis - Odd results - what am I doing wrong? OK - so I have my SOLR instance running on AWS. Any suggestions on how to safely share the link? Right now, the whole SOLR instance is totally open. Gagandeep singh gagan.g...@gmail.com wrote: say debugQuery=truemlt=true and see the scores for the MLT query, not a sample query. You can use Amazon ec2 to bring up your solr, you should be able to get a micro instance for free trial. On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote: I did try the raw query against the *simi* field and those seem to return results in the order expected. For instance, Acura MDX has ( large, SUV, 4WD Luxury) in the simi field. Running a query with those words against the simi field returns the expected models (X5, Audi Q5, etc) and then the subsequent documents have decreasing relevance. So the basic query mechanism seems to be fine. The issue just seems to be with MoreLikeThis component and handler. I can post the index on a public SOLR instance - any suggestions? (or for hosting) On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com wrote: If you can bring up your solr setup on a public machine then im sure a lot of debugging can be done. Without that, i think what you should look at is the tf-idf scores of the terms like camry etc. Usually idf is the deciding factor into which results show at the top (tf should be 1 for your data). Enable debugQuery=true and look at explain section to see show score is getting calculated. You should try giving different boosts to class, type, drive, size to control the results. On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote: I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field name
Re: MoreLikeThis - Odd results - what am I doing wrong?
OK - so I have my SOLR instance running on AWS. Any suggestions on how to safely share the link? Right now, the whole SOLR instance is totally open. Gagandeep singh gagan.g...@gmail.com wrote: say debugQuery=truemlt=true and see the scores for the MLT query, not a sample query. You can use Amazon ec2 to bring up your solr, you should be able to get a micro instance for free trial. On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote: I did try the raw query against the *simi* field and those seem to return results in the order expected. For instance, Acura MDX has ( large, SUV, 4WD Luxury) in the simi field. Running a query with those words against the simi field returns the expected models (X5, Audi Q5, etc) and then the subsequent documents have decreasing relevance. So the basic query mechanism seems to be fine. The issue just seems to be with MoreLikeThis component and handler. I can post the index on a public SOLR instance - any suggestions? (or for hosting) On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com wrote: If you can bring up your solr setup on a public machine then im sure a lot of debugging can be done. Without that, i think what you should look at is the tf-idf scores of the terms like camry etc. Usually idf is the deciding factor into which results show at the top (tf should be 1 for your data). Enable debugQuery=true and look at explain section to see show score is getting calculated. You should try giving different boosts to class, type, drive, size to control the results. On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote: I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field name=comment type=text_general indexed=true stored=true termVectors=true multiValued=true/ field name=size type=string indexed=true stored=true termVectors=true multiValued=false/ * * *Copy Fields* copyField source=make dest=make_en / !-- Search -- copyField source=model dest=model_en / !-- Search -- copyField source=class dest=class_en / !-- Search -- copyField source=type dest=type_en / !-- Search -- copyField source=drive dest=drive_en / !-- Search -- copyField source=comment dest=comment_en / !-- Search -- copyField source=size dest=size_en / !-- Search -- copyField source=id dest=text / !-- Glob -- copyField source=make dest=text / !-- Glob -- copyField source=model dest=text / !-- Glob -- copyField source=class dest=text / !-- Glob -- copyField source=type dest=text / !-- Glob -- copyField source=drive dest=text / !-- Glob -- copyField source=comment dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- *copyField source=class dest=simi_en / !-- similarity --*
RE: MoreLikeThis - Odd results - what am I doing wrong?
Isn't this an AWS security groups question? You should probably post this question on the AWS forums, but for the moment, here's the basic reading material - go set up your EC2 security groups and lock down your systems. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html If you just want to password protect Solr here are the instructions: http://wiki.apache.org/solr/SolrSecurity But I most certainly would not leave it open to the world even with a password (note that the basic password authentication sends passwords in clear text if you're not using HTTPS, best lock the thing down behind a firewall). Dave -Original Message- From: DC tech [mailto:dctech1...@gmail.com] Sent: Tuesday, April 02, 2013 1:02 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis - Odd results - what am I doing wrong? OK - so I have my SOLR instance running on AWS. Any suggestions on how to safely share the link? Right now, the whole SOLR instance is totally open. Gagandeep singh gagan.g...@gmail.com wrote: say debugQuery=truemlt=true and see the scores for the MLT query, not a sample query. You can use Amazon ec2 to bring up your solr, you should be able to get a micro instance for free trial. On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote: I did try the raw query against the *simi* field and those seem to return results in the order expected. For instance, Acura MDX has ( large, SUV, 4WD Luxury) in the simi field. Running a query with those words against the simi field returns the expected models (X5, Audi Q5, etc) and then the subsequent documents have decreasing relevance. So the basic query mechanism seems to be fine. The issue just seems to be with MoreLikeThis component and handler. I can post the index on a public SOLR instance - any suggestions? (or for hosting) On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com wrote: If you can bring up your solr setup on a public machine then im sure a lot of debugging can be done. Without that, i think what you should look at is the tf-idf scores of the terms like camry etc. Usually idf is the deciding factor into which results show at the top (tf should be 1 for your data). Enable debugQuery=true and look at explain section to see show score is getting calculated. You should try giving different boosts to class, type, drive, size to control the results. On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote: I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field name=comment type=text_general indexed=true stored=true termVectors=true multiValued=true/ field name=size type=string indexed=true stored=true termVectors=true multiValued=false/ * * *Copy Fields* copyField source=make dest=make_en / !-- Search -- copyField source=model dest=model_en / !-- Search -- copyField source
MoreLikeThis - Odd results - what am I doing wrong?
I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field name=comment type=text_general indexed=true stored=true termVectors=true multiValued=true/ field name=size type=string indexed=true stored=true termVectors=true multiValued=false/ * * *Copy Fields* copyField source=make dest=make_en / !-- Search -- copyField source=model dest=model_en / !-- Search -- copyField source=class dest=class_en / !-- Search -- copyField source=type dest=type_en / !-- Search -- copyField source=drive dest=drive_en / !-- Search -- copyField source=comment dest=comment_en / !-- Search -- copyField source=size dest=size_en / !-- Search -- copyField source=id dest=text / !-- Glob -- copyField source=make dest=text / !-- Glob -- copyField source=model dest=text / !-- Glob -- copyField source=class dest=text / !-- Glob -- copyField source=type dest=text / !-- Glob -- copyField source=drive dest=text / !-- Glob -- copyField source=comment dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- *copyField source=class dest=simi_en / !-- similarity --* *copyField source=type dest=simi_en / !-- similarity --* *copyField source=drive dest=simi_en / !-- similarity --* *copyField source=size dest=simi_en / !-- similarity --* Note that the simi field ends up with values like make, class, size and drive: - Luxury SUV 4WD Large - Standard Sedan Front Familt *5. MLT Setup* a. mlt.FL = *text* QF=*text* Works but results are obviously not good (make is not a good similarity indicator) http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text b. mlt.FL = *simi* QF=*simi* Does not work at all (0 results) http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simimlt.qf=simi c. mlt.FL = *simi,text * QF=*simi^10 text^.1* Works with decent results in most cases http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simi,textmlt.qf=simi ^10%20text^.01 Works for getting similarity for Acura MDX (Luxury SUV 4WD Large) But for Toyota Camry - it finds hybrid family cars (Prius) ahead of Honda. * * image.pngid,make,model,class,type,drive,comment,size,size_i 1,Acura ,ILX 2.0L,Luxury,Sedan,Front,,Mini,2 2,Acura ,MDX,Luxury,SUV,4wd,,Large,5 3,Acura ,RDX,Luxury,SUV,4wd,,Small,3 4,Acura ,RLX,Luxury,Sedan,AWD,,Large,5 5,Acura ,TL,Luxury,Sedan,Front,,Family,4 6,Acura ,TSX,Luxury,Sedan,Front,,Small,3 7,Acura ,ZDX,Luxury,SUV,4wd,,Large,5 8,Audi ,A3 2.0T,Luxury,Sedan,AWD,,Mini,2 9,Audi ,A4,Luxury,Sedan,AWD,,Small,3 10,Audi ,A5 2.0T,Luxury,Sedan,AWD,,Family,4 11,Audi ,A6 3.0T,Luxury,Sedan,AWD,,Family,4 12,Audi ,A7,Luxury,Sedan,AWD,,Large,5 13,Audi ,A8,Luxury,Sedan,AWD,,Largest,7 14,Audi ,Allroad,Luxury,Wagon,AWD,,Large,5 15,Audi ,Q5 2.0T,Luxury,SUV,4wd,,Large,5 16,Audi ,Q7,Luxury,SUV,4wd,,Largest,7 17,Audi ,R8,Luxury,Sports,RWD,,Largest,7 18,Audi ,S4,Luxury,Sports,AWD,,Small,3 19,Audi
Re: MoreLikeThis - Odd results - what am I doing wrong?
If you can bring up your solr setup on a public machine then im sure a lot of debugging can be done. Without that, i think what you should look at is the tf-idf scores of the terms like camry etc. Usually idf is the deciding factor into which results show at the top (tf should be 1 for your data). Enable debugQuery=true and look at explain section to see show score is getting calculated. You should try giving different boosts to class, type, drive, size to control the results. On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote: I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field name=comment type=text_general indexed=true stored=true termVectors=true multiValued=true/ field name=size type=string indexed=true stored=true termVectors=true multiValued=false/ * * *Copy Fields* copyField source=make dest=make_en / !-- Search -- copyField source=model dest=model_en / !-- Search -- copyField source=class dest=class_en / !-- Search -- copyField source=type dest=type_en / !-- Search -- copyField source=drive dest=drive_en / !-- Search -- copyField source=comment dest=comment_en / !-- Search -- copyField source=size dest=size_en / !-- Search -- copyField source=id dest=text / !-- Glob -- copyField source=make dest=text / !-- Glob -- copyField source=model dest=text / !-- Glob -- copyField source=class dest=text / !-- Glob -- copyField source=type dest=text / !-- Glob -- copyField source=drive dest=text / !-- Glob -- copyField source=comment dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- *copyField source=class dest=simi_en / !-- similarity -- * *copyField source=type dest=simi_en / !-- similarity --* *copyField source=drive dest=simi_en / !-- similarity -- * *copyField source=size dest=simi_en / !-- similarity --* Note that the simi field ends up with values like make, class, size and drive: - Luxury SUV 4WD Large - Standard Sedan Front Familt *5. MLT Setup* a. mlt.FL = *text* QF=*text* Works but results are obviously not good (make is not a good similarity indicator) http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text b. mlt.FL = *simi* QF=*simi* Does not work at all (0 results) http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simimlt.qf=simi c. mlt.FL = *simi,text * QF=*simi^10 text^.1* Works with decent results in most cases http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simi,textmlt.qf=simi ^10%20text^.01 Works for getting similarity for Acura MDX (Luxury SUV 4WD Large) But for Toyota Camry - it finds hybrid family cars (Prius) ahead of Honda. * *
Re: MoreLikeThis - Odd results - what am I doing wrong?
I did try the raw query against the *simi* field and those seem to return results in the order expected. For instance, Acura MDX has ( large, SUV, 4WD Luxury) in the simi field. Running a query with those words against the simi field returns the expected models (X5, Audi Q5, etc) and then the subsequent documents have decreasing relevance. So the basic query mechanism seems to be fine. The issue just seems to be with MoreLikeThis component and handler. I can post the index on a public SOLR instance - any suggestions? (or for hosting) On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.comwrote: If you can bring up your solr setup on a public machine then im sure a lot of debugging can be done. Without that, i think what you should look at is the tf-idf scores of the terms like camry etc. Usually idf is the deciding factor into which results show at the top (tf should be 1 for your data). Enable debugQuery=true and look at explain section to see show score is getting calculated. You should try giving different boosts to class, type, drive, size to control the results. On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote: I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field name=comment type=text_general indexed=true stored=true termVectors=true multiValued=true/ field name=size type=string indexed=true stored=true termVectors=true multiValued=false/ * * *Copy Fields* copyField source=make dest=make_en / !-- Search -- copyField source=model dest=model_en / !-- Search -- copyField source=class dest=class_en / !-- Search -- copyField source=type dest=type_en / !-- Search -- copyField source=drive dest=drive_en / !-- Search -- copyField source=comment dest=comment_en / !-- Search -- copyField source=size dest=size_en / !-- Search -- copyField source=id dest=text / !-- Glob -- copyField source=make dest=text / !-- Glob -- copyField source=model dest=text / !-- Glob -- copyField source=class dest=text / !-- Glob -- copyField source=type dest=text / !-- Glob -- copyField source=drive dest=text / !-- Glob -- copyField source=comment dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- *copyField source=class dest=simi_en / !-- similarity --* *copyField source=type dest=simi_en / !-- similarity -- * *copyField source=drive dest=simi_en / !-- similarity --* *copyField source=size dest=simi_en / !-- similarity -- * Note that the simi field ends up with values like make, class, size and drive: - Luxury SUV 4WD Large - Standard Sedan Front Familt *5. MLT Setup* a. mlt.FL = *text* QF=*text* Works but results are obviously not good (make is not a good similarity indicator) http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text b. mlt.FL =
Re: MoreLikeThis - Odd results - what am I doing wrong?
say debugQuery=truemlt=true and see the scores for the MLT query, not a sample query. You can use Amazon ec2 to bring up your solr, you should be able to get a micro instance for free trial. On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote: I did try the raw query against the *simi* field and those seem to return results in the order expected. For instance, Acura MDX has ( large, SUV, 4WD Luxury) in the simi field. Running a query with those words against the simi field returns the expected models (X5, Audi Q5, etc) and then the subsequent documents have decreasing relevance. So the basic query mechanism seems to be fine. The issue just seems to be with MoreLikeThis component and handler. I can post the index on a public SOLR instance - any suggestions? (or for hosting) On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com wrote: If you can bring up your solr setup on a public machine then im sure a lot of debugging can be done. Without that, i think what you should look at is the tf-idf scores of the terms like camry etc. Usually idf is the deciding factor into which results show at the top (tf should be 1 for your data). Enable debugQuery=true and look at explain section to see show score is getting calculated. You should try giving different boosts to class, type, drive, size to control the results. On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote: I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field name=comment type=text_general indexed=true stored=true termVectors=true multiValued=true/ field name=size type=string indexed=true stored=true termVectors=true multiValued=false/ * * *Copy Fields* copyField source=make dest=make_en / !-- Search -- copyField source=model dest=model_en / !-- Search -- copyField source=class dest=class_en / !-- Search -- copyField source=type dest=type_en / !-- Search -- copyField source=drive dest=drive_en / !-- Search -- copyField source=comment dest=comment_en / !-- Search -- copyField source=size dest=size_en / !-- Search -- copyField source=id dest=text / !-- Glob -- copyField source=make dest=text / !-- Glob -- copyField source=model dest=text / !-- Glob -- copyField source=class dest=text / !-- Glob -- copyField source=type dest=text / !-- Glob -- copyField source=drive dest=text / !-- Glob -- copyField source=comment dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- copyField source=size dest=text / !-- Glob -- *copyField source=class dest=simi_en / !-- similarity --* *copyField source=type dest=simi_en / !-- similarity -- * *copyField source=drive dest=simi_en / !-- similarity --* *copyField source=size dest=simi_en / !-- similarity