subject:"MoreLikeThis \- Odd results \- what am I doing wrong\?"

RE: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-03 Thread DC tech

Thanks David - I suppose it is an AWS question and thank you for the pointers. 

As a further input to the MLT question - it does seem that 3.6 behavior is 
different from 4.2 - the issue seems to be more in terms of the raw query that 
is generated. 
I will some more research and revert back with details. 

David Parks davidpark...@yahoo.com wrote:

Isn't this an AWS security groups question? You should probably post this 
question on the AWS forums, but for the moment, here's the basic reading 
material - go set up your EC2 security groups and lock down your systems.

   
 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html

If you just want to password protect Solr here are the instructions:

   http://wiki.apache.org/solr/SolrSecurity

But I most certainly would not leave it open to the world even with a password 
(note that the basic password authentication sends passwords in clear text if 
you're not using HTTPS, best lock the thing down behind a firewall).

Dave


-Original Message-
From: DC tech [mailto:dctech1...@gmail.com] 
Sent: Tuesday, April 02, 2013 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis - Odd results - what am I doing wrong?

OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh gagan.g...@gmail.com wrote:

say debugQuery=truemlt=true and see the scores for the MLT query, not 
a sample query. You can use Amazon ec2 to bring up your solr, you 
should be able to get a micro instance for free trial.


On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote:

 I did try the raw query against the *simi* field and those seem to 
 return results in the order expected.
 For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
 Running a query with those words against the simi field returns the 
 expected models (X5, Audi Q5, etc) and then the subsequent documents 
 have decreasing relevance. So the basic query mechanism seems to be fine.

 The issue just seems to be with MoreLikeThis component and handler.
 I can post the index on a public SOLR instance - any suggestions? (or 
 for
 hosting)


 On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh 
 gagan.g...@gmail.com
 wrote:

  If you can bring up your solr setup on a public machine then im 
  sure a
 lot
  of debugging can be done. Without that, i think what you should 
  look at
 is
  the tf-idf scores of the terms like camry etc. Usually idf is the 
  deciding factor into which results show at the top (tf should be 1 
  for
 your
  data).
  Enable debugQuery=true and look at explain section to see show 
  score is getting calculated.
 
  You should try giving different boosts to class, type, drive, size 
  to control the results.
 
 
  On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:
 
  I am running some experiments on more like this and the results 
  seem rather odd - I am doing something wrong but just cannot figure out 
  what.
  Basically, the similarity results are decent - but not great.
 
  *Issue 1  = Quality*
  Toyota Camry : finds Altima (good) but then next one is Camry 
  Hybrid whereas it should have found Accord.
  I have normalized the data into a simi field which has only the 
  attributes that I care about.
  Without the simi field, I could not get mlt.qf boosts to work well
 enough
  to return results
 
  *Issue 2*
  Some fields do not work at all. For instance, text+simi (in 
  mlt.fl)
 works
  whereas just simi does not.
  So some weirdness that am just not understanding.
 
  Would be grateful for your guidance !
 
 
  Here is the setup:
  *1. SOLR Version*
  solr-spec 4.2.0.2013.03.06.22.32.13
  solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
  lucene-spec 4.2.0
  lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
 
  *2. Machine Information*
  Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
  19.0-b09)
  Windows 7 Home 64 Bit with 4 GB RAM
 
  *3. Sample Data *
  I created this 'dummy' data of cars  - the idea being that these 
  would
 be
  sufficient and simple to generate similarity and understand how it 
  would work.
  There are 181 rows in the data set (I have attached it for 
  reference in CSV format)
 
  [image: Inline image 1]
 
  *4. SCHEMA*
  *Field Definitions*
 field name=id type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=make type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=model type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=class type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=type type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=drive type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-02 Thread DC tech

OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh gagan.g...@gmail.com wrote:

say debugQuery=truemlt=true and see the scores for the MLT query, not a
sample query. You can use Amazon ec2 to bring up your solr, you should be
able to get a micro instance for free trial.


On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote:

 I did try the raw query against the *simi* field and those seem to return
 results in the order expected.
 For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
 Running a query with those words against the simi field returns the
 expected models (X5, Audi Q5, etc) and then the subsequent documents have
 decreasing relevance. So the basic query mechanism seems to be fine.

 The issue just seems to be with MoreLikeThis component and handler.
 I can post the index on a public SOLR instance - any suggestions? (or for
 hosting)


 On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com
 wrote:

  If you can bring up your solr setup on a public machine then im sure a
 lot
  of debugging can be done. Without that, i think what you should look at
 is
  the tf-idf scores of the terms like camry etc. Usually idf is the
  deciding factor into which results show at the top (tf should be 1 for
 your
  data).
  Enable debugQuery=true and look at explain section to see show score is
  getting calculated.
 
  You should try giving different boosts to class, type, drive, size to
  control the results.
 
 
  On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:
 
  I am running some experiments on more like this and the results seem
  rather odd - I am doing something wrong but just cannot figure out what.
  Basically, the similarity results are decent - but not great.
 
  *Issue 1  = Quality*
  Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
  whereas it should have found Accord.
  I have normalized the data into a simi field which has only the
  attributes that I care about.
  Without the simi field, I could not get mlt.qf boosts to work well
 enough
  to return results
 
  *Issue 2*
  Some fields do not work at all. For instance, text+simi (in mlt.fl)
 works
  whereas just simi does not.
  So some weirdness that am just not understanding.
 
  Would be grateful for your guidance !
 
 
  Here is the setup:
  *1. SOLR Version*
  solr-spec 4.2.0.2013.03.06.22.32.13
  solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
  lucene-spec 4.2.0
  lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
 
  *2. Machine Information*
  Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
  19.0-b09)
  Windows 7 Home 64 Bit with 4 GB RAM
 
  *3. Sample Data *
  I created this 'dummy' data of cars  - the idea being that these would
 be
  sufficient and simple to generate similarity and understand how it would
  work.
  There are 181 rows in the data set (I have attached it for reference in
  CSV format)
 
  [image: Inline image 1]
 
  *4. SCHEMA*
  *Field Definitions*
 field name=id type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=make type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=model type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=class type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=type type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=drive type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=comment type=text_general indexed=true
 stored=true
  termVectors=true multiValued=true/
 field name=size type=string indexed=true stored=true
  termVectors=true multiValued=false/
  *
  *
  *Copy Fields*
  copyField   source=make dest=make_en   /  !-- Search  --
  copyField   source=model dest=model_en   /  !-- Search  --
  copyField   source=class dest=class_en   /  !-- Search  --
  copyField   source=type dest=type_en   /  !-- Search  --
  copyField   source=drive dest=drive_en   /  !-- Search  --
  copyField   source=comment dest=comment_en   /  !-- Search
  --
  copyField   source=size dest=size_en   /  !-- Search  --
  copyField   source=id dest=text   /  !-- Glob  --
  copyField   source=make dest=text   /  !-- Glob  --
  copyField   source=model dest=text   /  !-- Glob  --
  copyField   source=class dest=text   /  !-- Glob  --
  copyField   source=type dest=text   /  !-- Glob  --
  copyField   source=drive dest=text   /  !-- Glob  --
  copyField   source=comment dest=text   /  !-- Glob  --
  copyField   source=size dest=text   /  !-- Glob  --
  copyField   source=size dest=text   /  !-- Glob  --
  *copyField   source=class dest=simi_en   /  !-- similarity
   --*

RE: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-02 Thread David Parks

Isn't this an AWS security groups question? You should probably post this 
question on the AWS forums, but for the moment, here's the basic reading 
material - go set up your EC2 security groups and lock down your systems.


http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html

If you just want to password protect Solr here are the instructions:

http://wiki.apache.org/solr/SolrSecurity

But I most certainly would not leave it open to the world even with a password 
(note that the basic password authentication sends passwords in clear text if 
you're not using HTTPS, best lock the thing down behind a firewall).

Dave


-Original Message-
From: DC tech [mailto:dctech1...@gmail.com] 
Sent: Tuesday, April 02, 2013 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis - Odd results - what am I doing wrong?

OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh gagan.g...@gmail.com wrote:

say debugQuery=truemlt=true and see the scores for the MLT query, not 
a sample query. You can use Amazon ec2 to bring up your solr, you 
should be able to get a micro instance for free trial.


On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote:

 I did try the raw query against the *simi* field and those seem to 
 return results in the order expected.
 For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
 Running a query with those words against the simi field returns the 
 expected models (X5, Audi Q5, etc) and then the subsequent documents 
 have decreasing relevance. So the basic query mechanism seems to be fine.

 The issue just seems to be with MoreLikeThis component and handler.
 I can post the index on a public SOLR instance - any suggestions? (or 
 for
 hosting)


 On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh 
 gagan.g...@gmail.com
 wrote:

  If you can bring up your solr setup on a public machine then im 
  sure a
 lot
  of debugging can be done. Without that, i think what you should 
  look at
 is
  the tf-idf scores of the terms like camry etc. Usually idf is the 
  deciding factor into which results show at the top (tf should be 1 
  for
 your
  data).
  Enable debugQuery=true and look at explain section to see show 
  score is getting calculated.
 
  You should try giving different boosts to class, type, drive, size 
  to control the results.
 
 
  On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:
 
  I am running some experiments on more like this and the results 
  seem rather odd - I am doing something wrong but just cannot figure out 
  what.
  Basically, the similarity results are decent - but not great.
 
  *Issue 1  = Quality*
  Toyota Camry : finds Altima (good) but then next one is Camry 
  Hybrid whereas it should have found Accord.
  I have normalized the data into a simi field which has only the 
  attributes that I care about.
  Without the simi field, I could not get mlt.qf boosts to work well
 enough
  to return results
 
  *Issue 2*
  Some fields do not work at all. For instance, text+simi (in 
  mlt.fl)
 works
  whereas just simi does not.
  So some weirdness that am just not understanding.
 
  Would be grateful for your guidance !
 
 
  Here is the setup:
  *1. SOLR Version*
  solr-spec 4.2.0.2013.03.06.22.32.13
  solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
  lucene-spec 4.2.0
  lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
 
  *2. Machine Information*
  Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
  19.0-b09)
  Windows 7 Home 64 Bit with 4 GB RAM
 
  *3. Sample Data *
  I created this 'dummy' data of cars  - the idea being that these 
  would
 be
  sufficient and simple to generate similarity and understand how it 
  would work.
  There are 181 rows in the data set (I have attached it for 
  reference in CSV format)
 
  [image: Inline image 1]
 
  *4. SCHEMA*
  *Field Definitions*
 field name=id type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=make type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=model type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=class type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=type type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=drive type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=comment type=text_general indexed=true
 stored=true
  termVectors=true multiValued=true/
 field name=size type=string indexed=true stored=true
  termVectors=true multiValued=false/
  *
  *
  *Copy Fields*
  copyField   source=make dest=make_en   /  !-- Search  --
  copyField   source=model dest=model_en   /  !-- Search  --
  copyField   source

MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread dc tech

I am running some experiments on more like this and the results seem rather
odd - I am doing something wrong but just cannot figure out what.
Basically, the similarity results are decent - but not great.

*Issue 1  = Quality*
Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
whereas it should have found Accord.
I have normalized the data into a simi field which has only the attributes
that I care about.
Without the simi field, I could not get mlt.qf boosts to work well enough
to return results

*Issue 2*
Some fields do not work at all. For instance, text+simi (in mlt.fl) works
whereas just simi does not.
So some weirdness that am just not understanding.

Would be grateful for your guidance !


Here is the setup:
*1. SOLR Version*
solr-spec 4.2.0.2013.03.06.22.32.13
solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
lucene-spec 4.2.0
lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29

*2. Machine Information*
Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09)
Windows 7 Home 64 Bit with 4 GB RAM

*3. Sample Data *
I created this 'dummy' data of cars  - the idea being that these would be
sufficient and simple to generate similarity and understand how it would
work.
There are 181 rows in the data set (I have attached it for reference in CSV
format)

[image: Inline image 1]

*4. SCHEMA*
*Field Definitions*
   field name=id type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=make type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=model type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=class type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=type type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=drive type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=comment type=text_general indexed=true stored=true
termVectors=true multiValued=true/
   field name=size type=string indexed=true stored=true
termVectors=true multiValued=false/
*
*
*Copy Fields*
copyField   source=make dest=make_en   /  !-- Search  --
copyField   source=model dest=model_en   /  !-- Search  --
copyField   source=class dest=class_en   /  !-- Search  --
copyField   source=type dest=type_en   /  !-- Search  --
copyField   source=drive dest=drive_en   /  !-- Search  --
copyField   source=comment dest=comment_en   /  !-- Search  --
copyField   source=size dest=size_en   /  !-- Search  --
copyField   source=id dest=text   /  !-- Glob  --
copyField   source=make dest=text   /  !-- Glob  --
copyField   source=model dest=text   /  !-- Glob  --
copyField   source=class dest=text   /  !-- Glob  --
copyField   source=type dest=text   /  !-- Glob  --
copyField   source=drive dest=text   /  !-- Glob  --
copyField   source=comment dest=text   /  !-- Glob  --
copyField   source=size dest=text   /  !-- Glob  --
copyField   source=size dest=text   /  !-- Glob  --
*copyField   source=class dest=simi_en   /  !-- similarity  --*
*copyField   source=type dest=simi_en   /  !-- similarity  --*
*copyField   source=drive dest=simi_en   /  !-- similarity  --*
*copyField   source=size dest=simi_en   /  !-- similarity  --*

Note that the simi field ends up with values like  make, class, size and
drive:
- Luxury SUV 4WD Large
- Standard Sedan Front Familt


*5. MLT Setup*
a. mlt.FL  = *text* QF=*text*  Works but results are obviously not good
(make is not a good similarity indicator)
http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text

b. mlt.FL  = *simi* QF=*simi*  Does not work at all (0 results)
http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simimlt.qf=simi

c.  mlt.FL  = *simi,text * QF=*simi^10 text^.1*   Works with decent results
in most cases
http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simi,textmlt.qf=simi
^10%20text^.01
Works for getting similarity for Acura MDX (Luxury SUV 4WD Large)
But for Toyota Camry - it finds hybrid family cars (Prius) ahead of Honda.


*
*
image.pngid,make,model,class,type,drive,comment,size,size_i
1,Acura ,ILX 2.0L,Luxury,Sedan,Front,,Mini,2
2,Acura ,MDX,Luxury,SUV,4wd,,Large,5
3,Acura ,RDX,Luxury,SUV,4wd,,Small,3
4,Acura ,RLX,Luxury,Sedan,AWD,,Large,5
5,Acura ,TL,Luxury,Sedan,Front,,Family,4
6,Acura ,TSX,Luxury,Sedan,Front,,Small,3
7,Acura ,ZDX,Luxury,SUV,4wd,,Large,5
8,Audi ,A3 2.0T,Luxury,Sedan,AWD,,Mini,2
9,Audi ,A4,Luxury,Sedan,AWD,,Small,3
10,Audi ,A5 2.0T,Luxury,Sedan,AWD,,Family,4
11,Audi ,A6 3.0T,Luxury,Sedan,AWD,,Family,4
12,Audi ,A7,Luxury,Sedan,AWD,,Large,5
13,Audi ,A8,Luxury,Sedan,AWD,,Largest,7
14,Audi ,Allroad,Luxury,Wagon,AWD,,Large,5
15,Audi ,Q5 2.0T,Luxury,SUV,4wd,,Large,5
16,Audi ,Q7,Luxury,SUV,4wd,,Largest,7
17,Audi ,R8,Luxury,Sports,RWD,,Largest,7
18,Audi ,S4,Luxury,Sports,AWD,,Small,3
19,Audi

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread Gagandeep singh

If you can bring up your solr setup on a public machine then im sure a lot
of debugging can be done. Without that, i think what you should look at is
the tf-idf scores of the terms like camry etc. Usually idf is the
deciding factor into which results show at the top (tf should be 1 for your
data).
Enable debugQuery=true and look at explain section to see show score is
getting calculated.

You should try giving different boosts to class, type, drive, size to
control the results.


On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:

 I am running some experiments on more like this and the results seem
 rather odd - I am doing something wrong but just cannot figure out what.
 Basically, the similarity results are decent - but not great.

 *Issue 1  = Quality*
 Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
 whereas it should have found Accord.
 I have normalized the data into a simi field which has only the attributes
 that I care about.
 Without the simi field, I could not get mlt.qf boosts to work well enough
 to return results

 *Issue 2*
 Some fields do not work at all. For instance, text+simi (in mlt.fl) works
 whereas just simi does not.
 So some weirdness that am just not understanding.

 Would be grateful for your guidance !


 Here is the setup:
 *1. SOLR Version*
 solr-spec 4.2.0.2013.03.06.22.32.13
 solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
 lucene-spec 4.2.0
 lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29

 *2. Machine Information*
 Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09)
 Windows 7 Home 64 Bit with 4 GB RAM

 *3. Sample Data *
 I created this 'dummy' data of cars  - the idea being that these would be
 sufficient and simple to generate similarity and understand how it would
 work.
 There are 181 rows in the data set (I have attached it for reference in
 CSV format)

 [image: Inline image 1]

 *4. SCHEMA*
 *Field Definitions*
field name=id type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=make type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=model type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=class type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=type type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=drive type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=comment type=text_general indexed=true stored=true
 termVectors=true multiValued=true/
field name=size type=string indexed=true stored=true
 termVectors=true multiValued=false/
 *
 *
 *Copy Fields*
 copyField   source=make dest=make_en   /  !-- Search  --
 copyField   source=model dest=model_en   /  !-- Search  --
 copyField   source=class dest=class_en   /  !-- Search  --
 copyField   source=type dest=type_en   /  !-- Search  --
 copyField   source=drive dest=drive_en   /  !-- Search  --
 copyField   source=comment dest=comment_en   /  !-- Search  --
 copyField   source=size dest=size_en   /  !-- Search  --
 copyField   source=id dest=text   /  !-- Glob  --
 copyField   source=make dest=text   /  !-- Glob  --
 copyField   source=model dest=text   /  !-- Glob  --
 copyField   source=class dest=text   /  !-- Glob  --
 copyField   source=type dest=text   /  !-- Glob  --
 copyField   source=drive dest=text   /  !-- Glob  --
 copyField   source=comment dest=text   /  !-- Glob  --
 copyField   source=size dest=text   /  !-- Glob  --
 copyField   source=size dest=text   /  !-- Glob  --
 *copyField   source=class dest=simi_en   /  !-- similarity  --
 *
 *copyField   source=type dest=simi_en   /  !-- similarity  --*
 *copyField   source=drive dest=simi_en   /  !-- similarity  --
 *
 *copyField   source=size dest=simi_en   /  !-- similarity  --*

 Note that the simi field ends up with values like  make, class, size and
 drive:
 - Luxury SUV 4WD Large
 - Standard Sedan Front Familt


 *5. MLT Setup*
 a. mlt.FL  = *text* QF=*text*  Works but results are obviously not good
 (make is not a good similarity indicator)

 http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text

 b. mlt.FL  = *simi* QF=*simi*  Does not work at all (0 results)

 http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simimlt.qf=simi

 c.  mlt.FL  = *simi,text * QF=*simi^10 text^.1*   Works with decent
 results in most cases

 http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simi,textmlt.qf=simi
 ^10%20text^.01
 Works for getting similarity for Acura MDX (Luxury SUV 4WD Large)
 But for Toyota Camry - it finds hybrid family cars (Prius) ahead of Honda.


 *
 *

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread dc tech

I did try the raw query against the *simi* field and those seem to return
results in the order expected.
For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
Running a query with those words against the simi field returns the
expected models (X5, Audi Q5, etc) and then the subsequent documents have
decreasing relevance. So the basic query mechanism seems to be fine.

The issue just seems to be with MoreLikeThis component and handler.
I can post the index on a public SOLR instance - any suggestions? (or for
hosting)


On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.comwrote:

 If you can bring up your solr setup on a public machine then im sure a lot
 of debugging can be done. Without that, i think what you should look at is
 the tf-idf scores of the terms like camry etc. Usually idf is the
 deciding factor into which results show at the top (tf should be 1 for your
 data).
 Enable debugQuery=true and look at explain section to see show score is
 getting calculated.

 You should try giving different boosts to class, type, drive, size to
 control the results.


 On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:

 I am running some experiments on more like this and the results seem
 rather odd - I am doing something wrong but just cannot figure out what.
 Basically, the similarity results are decent - but not great.

 *Issue 1  = Quality*
 Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
 whereas it should have found Accord.
 I have normalized the data into a simi field which has only the
 attributes that I care about.
 Without the simi field, I could not get mlt.qf boosts to work well enough
 to return results

 *Issue 2*
 Some fields do not work at all. For instance, text+simi (in mlt.fl) works
 whereas just simi does not.
 So some weirdness that am just not understanding.

 Would be grateful for your guidance !


 Here is the setup:
 *1. SOLR Version*
 solr-spec 4.2.0.2013.03.06.22.32.13
 solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
 lucene-spec 4.2.0
 lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29

 *2. Machine Information*
 Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
 19.0-b09)
 Windows 7 Home 64 Bit with 4 GB RAM

 *3. Sample Data *
 I created this 'dummy' data of cars  - the idea being that these would be
 sufficient and simple to generate similarity and understand how it would
 work.
 There are 181 rows in the data set (I have attached it for reference in
 CSV format)

 [image: Inline image 1]

 *4. SCHEMA*
 *Field Definitions*
field name=id type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=make type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=model type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=class type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=type type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=drive type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=comment type=text_general indexed=true stored=true
 termVectors=true multiValued=true/
field name=size type=string indexed=true stored=true
 termVectors=true multiValued=false/
 *
 *
 *Copy Fields*
 copyField   source=make dest=make_en   /  !-- Search  --
 copyField   source=model dest=model_en   /  !-- Search  --
 copyField   source=class dest=class_en   /  !-- Search  --
 copyField   source=type dest=type_en   /  !-- Search  --
 copyField   source=drive dest=drive_en   /  !-- Search  --
 copyField   source=comment dest=comment_en   /  !-- Search  --
 copyField   source=size dest=size_en   /  !-- Search  --
 copyField   source=id dest=text   /  !-- Glob  --
 copyField   source=make dest=text   /  !-- Glob  --
 copyField   source=model dest=text   /  !-- Glob  --
 copyField   source=class dest=text   /  !-- Glob  --
 copyField   source=type dest=text   /  !-- Glob  --
 copyField   source=drive dest=text   /  !-- Glob  --
 copyField   source=comment dest=text   /  !-- Glob  --
 copyField   source=size dest=text   /  !-- Glob  --
 copyField   source=size dest=text   /  !-- Glob  --
 *copyField   source=class dest=simi_en   /  !-- similarity
  --*
 *copyField   source=type dest=simi_en   /  !-- similarity  --
 *
 *copyField   source=drive dest=simi_en   /  !-- similarity
  --*
 *copyField   source=size dest=simi_en   /  !-- similarity  --
 *

 Note that the simi field ends up with values like  make, class, size
 and drive:
 - Luxury SUV 4WD Large
 - Standard Sedan Front Familt


 *5. MLT Setup*
 a. mlt.FL  = *text* QF=*text*  Works but results are obviously not good
 (make is not a good similarity indicator)

 http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text

 b. mlt.FL  =

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread Gagandeep singh

say debugQuery=truemlt=true and see the scores for the MLT query, not a
sample query. You can use Amazon ec2 to bring up your solr, you should be
able to get a micro instance for free trial.


On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote:

 I did try the raw query against the *simi* field and those seem to return
 results in the order expected.
 For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
 Running a query with those words against the simi field returns the
 expected models (X5, Audi Q5, etc) and then the subsequent documents have
 decreasing relevance. So the basic query mechanism seems to be fine.

 The issue just seems to be with MoreLikeThis component and handler.
 I can post the index on a public SOLR instance - any suggestions? (or for
 hosting)


 On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com
 wrote:

  If you can bring up your solr setup on a public machine then im sure a
 lot
  of debugging can be done. Without that, i think what you should look at
 is
  the tf-idf scores of the terms like camry etc. Usually idf is the
  deciding factor into which results show at the top (tf should be 1 for
 your
  data).
  Enable debugQuery=true and look at explain section to see show score is
  getting calculated.
 
  You should try giving different boosts to class, type, drive, size to
  control the results.
 
 
  On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:
 
  I am running some experiments on more like this and the results seem
  rather odd - I am doing something wrong but just cannot figure out what.
  Basically, the similarity results are decent - but not great.
 
  *Issue 1  = Quality*
  Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
  whereas it should have found Accord.
  I have normalized the data into a simi field which has only the
  attributes that I care about.
  Without the simi field, I could not get mlt.qf boosts to work well
 enough
  to return results
 
  *Issue 2*
  Some fields do not work at all. For instance, text+simi (in mlt.fl)
 works
  whereas just simi does not.
  So some weirdness that am just not understanding.
 
  Would be grateful for your guidance !
 
 
  Here is the setup:
  *1. SOLR Version*
  solr-spec 4.2.0.2013.03.06.22.32.13
  solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
  lucene-spec 4.2.0
  lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
 
  *2. Machine Information*
  Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
  19.0-b09)
  Windows 7 Home 64 Bit with 4 GB RAM
 
  *3. Sample Data *
  I created this 'dummy' data of cars  - the idea being that these would
 be
  sufficient and simple to generate similarity and understand how it would
  work.
  There are 181 rows in the data set (I have attached it for reference in
  CSV format)
 
  [image: Inline image 1]
 
  *4. SCHEMA*
  *Field Definitions*
 field name=id type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=make type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=model type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=class type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=type type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=drive type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=comment type=text_general indexed=true
 stored=true
  termVectors=true multiValued=true/
 field name=size type=string indexed=true stored=true
  termVectors=true multiValued=false/
  *
  *
  *Copy Fields*
  copyField   source=make dest=make_en   /  !-- Search  --
  copyField   source=model dest=model_en   /  !-- Search  --
  copyField   source=class dest=class_en   /  !-- Search  --
  copyField   source=type dest=type_en   /  !-- Search  --
  copyField   source=drive dest=drive_en   /  !-- Search  --
  copyField   source=comment dest=comment_en   /  !-- Search
  --
  copyField   source=size dest=size_en   /  !-- Search  --
  copyField   source=id dest=text   /  !-- Glob  --
  copyField   source=make dest=text   /  !-- Glob  --
  copyField   source=model dest=text   /  !-- Glob  --
  copyField   source=class dest=text   /  !-- Glob  --
  copyField   source=type dest=text   /  !-- Glob  --
  copyField   source=drive dest=text   /  !-- Glob  --
  copyField   source=comment dest=text   /  !-- Glob  --
  copyField   source=size dest=text   /  !-- Glob  --
  copyField   source=size dest=text   /  !-- Glob  --
  *copyField   source=class dest=simi_en   /  !-- similarity
   --*
  *copyField   source=type dest=simi_en   /  !-- similarity
  --
  *
  *copyField   source=drive dest=simi_en   /  !-- similarity
   --*
  *copyField   source=size dest=simi_en   /  !-- similarity

RE: MoreLikeThis - Odd results - what am I doing wrong?

Re: MoreLikeThis - Odd results - what am I doing wrong?

RE: MoreLikeThis - Odd results - what am I doing wrong?

MoreLikeThis - Odd results - what am I doing wrong?

Re: MoreLikeThis - Odd results - what am I doing wrong?

Re: MoreLikeThis - Odd results - what am I doing wrong?

Re: MoreLikeThis - Odd results - what am I doing wrong?

7 matches

Site Navigation

Mail list logo

Footer information