Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-15 Thread Dc Tech
Thank you Jan and Charlie. 

I should say that in terms of posting to the community regarding Elastic vs 
Solr - this is probably the most civil and helpful community that I have been a 
part of - and your answers have only reinforced that  notion !!

Thank you for your responses. I am glad to hear that both can do most of it, 
which was my gut feeling as well. 

Charlie, to your point - the team probably feels that Elastic  is easier to get 
started with hence the preference, as well as the hosting options (with the 
caveats you noted). Agree with you completely that tech is not the real issue. 

Jan,  agree with  the points you made on team skills.  On our previous 
proprietary engine - that was in fact the biggest issue - the engine was 
powerful enough and had good references.  However, we were not able to exploit 
it to good effect.  

Thank you again. 

> 
> On Jan 15, 2020, at 5:10 AM, Jan Høydahl  wrote:
> 
> Hi,
> 
> Choosing the solr community mailing list to ask advice for whether to choose 
> ES - you already know what to expect, not?
> More often than not the choice comes down to policy, standardization, what 
> skills you have in the house etc rather than ticking off feature checkboxes.
> Sometimes company values also may drive a choice, i.e. Solr is 100% Apache 
> and not open core, which may matter if you plan to get involved in the 
> community, and contribute features or patches.
> 
> However, if I were in your shoes as architect to evaluate tech stack, and 
> there was not a clear choice based on the above, I’d do what projects 
> normally do, to ask yourself what you really need from the engine. Maybe you 
> have some features in your requirement list that makes one a much better 
> choice over the other. Or maybe after that exercise you are still wondering 
> what to choose, in which case you just follow your gut feeling and make a 
> choice :)
> 
> Jan
> 
>> 15. jan. 2020 kl. 10:07 skrev Charlie Hull :
>> 
>>> On 15/01/2020 04:02, Dc Tech wrote:
>>> I am SOLR fant and had implemented it in our company over 10 years ago.
>>> I moved away from that role and the new search team in the meanwhile
>>> implemented a proprietary (and expensive) nosql style search engine. That
>>> the project did not go well, and now I am back to project and reviewing the
>>> technology stack.
>>> 
>>> Some of the team think that ElasticSearch could be a good option,
>>> especially since we can easily get hosted versions with AWS where we have
>>> all the contractual stuff sorted out.
>> You can, but you should be aware that:
>> 1. Amazon's hosted Elasticsearch isn't great, often lags behind the current 
>> version, doesn't allow plugins etc.
>> 2.  Amazon and Elastic are currently engaged in legal battles over who is 
>> the most open sourcey,who allegedly copied code that was 'open' but 
>> commercially licensed, who would like to capture the hosted search 
>> market...not sure how this will pan out (Google for details)
>> 3. You can also buy fully hosted Solr from several places.
>>> Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
>>> graph, and all the knobs and dials for relevancy tuning), Elastic may be
>>> sufficient for our needs. It does not seem to have LTR out of the box but
>>> the relevancy tuning knobs and dials seem to be similar to what SOLR has.
>> Yes, they're basically the same under the hood (unsurprising as they're both 
>> based on Lucene). If you need LTR there's an ES plugin for that (disclaimer, 
>> my new employer built and maintains it: 
>> https://github.com/o19s/elasticsearch-learning-to-rank). I've lost track of 
>> the amount of times I've been asked 'Elasticsearch or Solr, which should I 
>> choose?' and my current thoughts are:
>> 
>> 1. Don't switch from one to the other for the sake of it.  Switching search 
>> engines rarely addresses underlying issues (content quality, team skills, 
>> relevance tuning methodology)
>> 2. Elasticsearch is easier to get started with, but at some point you'll 
>> need to learn how it all works
>> 3. Solr is harder to get started with, but you'll know more about how it all 
>> works earlier
>> 4. Both can be used for most search projects, most features are the same, 
>> both can scale.
>> 5. Lots of Elasticsearch projects (and developers) are focused on logs, 
>> which is often not really a 'search' project.
>> 
>>> 
>>> The corpus size is not a challenge  - we have about one million document,
>>> of which about 1/2 have full text, while the test are simpler (i.e. company
>>> directory etc.).
>>>

Coming back to search after some time... SOLR or Elastic for text search?

2020-01-14 Thread Dc Tech
I am SOLR fant and had implemented it in our company over 10 years ago.
I moved away from that role and the new search team in the meanwhile
implemented a proprietary (and expensive) nosql style search engine. That
the project did not go well, and now I am back to project and reviewing the
technology stack.

Some of the team think that ElasticSearch could be a good option,
especially since we can easily get hosted versions with AWS where we have
all the contractual stuff sorted out.

Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
graph, and all the knobs and dials for relevancy tuning), Elastic may be
sufficient for our needs. It does not seem to have LTR out of the box but
the relevancy tuning knobs and dials seem to be similar to what SOLR has.

The corpus size is not a challenge  - we have about one million document,
of which about 1/2 have full text, while the test are simpler (i.e. company
directory etc.).
The query volumes are also quite low (max 5/second at peak).
We have implemented the content ingestion and processing pipelines already
in python and SPARK, so most of the data will be pushed in using APIs.

I would really appreciate any guidance from the community !!


Re: Nested documents vs. flattening document structure?

2018-03-06 Thread Dc Tech
Thank  you Erick.
That was my instinct as well.



On Tue, Mar 6, 2018 at 10:05 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Flattening the nested documents is usually preferred if at all
> possible. Nested documents to, indeed, have a series of restrictions
> that often make them harder to work with than flattened docs.
>
> Best,
> Erick
>
> On Tue, Mar 6, 2018 at 6:48 AM, Dc Tech <dctech1...@gmail.com> wrote:
> > We are evaluating using nested documents vs. simply flattening the
> document.
> >
> > Looking through the documentation, it is not very clear to me if the
> nested
> > documents are fully mature, and support the full richness  of SOLR
> > (streaming, mature faceting) etc...
> >
> > Any opinions or guidance on that?
> >
> >
> > For *flattening*, we are thinking of setting up three groups of fields:
> > 1. Fields for search - 3-4 groups of fields that glom together the
> document
> > fields in order of boosting priority (e.g. f1 has just the title , f2 has
> > title+authors)
> > 2. Fields for faceting if needed
> > 3. and Fields for  display (or the original document fields) e.g.
> > author_name|author_unique_id...
>


Nested documents vs. flattening document structure?

2018-03-06 Thread Dc Tech
We are evaluating using nested documents vs. simply flattening the document.

Looking through the documentation, it is not very clear to me if the nested
documents are fully mature, and support the full richness  of SOLR
(streaming, mature faceting) etc...

Any opinions or guidance on that?


For *flattening*, we are thinking of setting up three groups of fields:
1. Fields for search - 3-4 groups of fields that glom together the document
fields in order of boosting priority (e.g. f1 has just the title , f2 has
title+authors)
2. Fields for faceting if needed
3. and Fields for  display (or the original document fields) e.g.
author_name|author_unique_id...


Re: Boost parameter with query function - how to pass in complex params?

2013-04-07 Thread dc tech
Yonik,
Many thanks.
The OR is still not working... here is the full URL
1. Honda or Toyota individually work
http://localhost:8983/solr/cars/select?fl=text,scoredefType=edismaxq=suvboost=query($boostq,1)boostq=honda
http://localhost:8983/solr/cars/select?fl=text,scoredefType=edismaxq=suvboost=query($boostq,1)boostq=toyota
I can see the scores increasing on the matching models.

2. But the OR does not work
http://localhost:8983/solr/cars/select?fl=text,scoredefType=edismaxq=suvboost=query($boostq,1)boostq=toyota%20or%20honda
The scores stay at the baseline suggesting no match on the boostQ.


3. For reference, the bq parameter works fine.

From a use case perspective, the idea was to pass in user preferences into
the BoostQ e.g. projects  the user has worked on etc.when matching documents






















On Sat, Apr 6, 2013 at 10:19 AM, Yonik Seeley yo...@lucidworks.com wrote:

 On Sat, Apr 6, 2013 at 9:42 AM, dc tech dctech1...@gmail.com wrote:
  See example below
  1. Search for SUVs and boost   Honda models
  q=suvboost=query({! v='honda'},1)
 
  2. Search for SUVs and boost   Honda OR  toyota model
 
  a) Using OR in the query does NOT work
 q=suvboost=query({! v='honda or toyota'},1)

 The or needs to be uppercase OR.

 It might also be easier to compose and read like this:
 q=suv
 boost=query($boostQ)
 boostQ=honda OR toyota

 OF course something simpler like this might also serve your primary goal:
 q=+suv (honda OR toyota)^10


 -Yonik
 http://lucidworks.com



FYI - Excel to generate schema and SolrConfig

2013-04-07 Thread dc tech
To minimize my own typing when setting up SOLR schema or config, I created
a simple Excel that can reduce the amount of typing that is required.

Please feel free to use it if you find it useful.


solr_schema_shared.xlsx
Description: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet


Re: Boost parameter with query function - how to pass in complex params?

2013-04-07 Thread dc tech
Yonik:
Pasted the wrong URL as I was trying various things.

I did not work with OR
http://localhost:8983/solr/cars/select?fl=text,scoredefType=edismaxq=suvboost=query($boostq,1)boostq=toyota%20OR%20hondadebug=true

See dumps below.

Many thanks.


INPUT
lst name=responseHeader
int name=status0/int
int name=QTime882/int
lst name=params
str name=boostqtoyota OR honda/str
str name=fltext,score/str
str name=qsuv/str
str name=boostquery($boostq,1)/str
str name=debugtrue/str
str name=defTypeedismax/str
/lst
/lst

DEBUG:
lst name=debug
str name=rawquerystringsuv/str
str name=querystringsuv/str
str name=parsedquery
BoostedQuery(boost(+(text:suv),query(text:toyota text:honda,def=1.0)))
/str
str name=parsedquery_toString
boost(+(text:suv),query(text:toyota text:honda,def=1.0))
/str
lst name=explain




On Sun, Apr 7, 2013 at 9:07 AM, Yonik Seeley yo...@lucidworks.com wrote:

 On Sun, Apr 7, 2013 at 8:39 AM, dc tech dctech1...@gmail.com wrote:
  Yonik,
  Many thanks.
  The OR is still not working... here is the full URL
  1. Honda or Toyota individually work
 
 http://localhost:8983/solr/cars/select?fl=text,scoredefType=edismaxq=suvboost=query($boostq,1)boostq=honda
 
 http://localhost:8983/solr/cars/select?fl=text,scoredefType=edismaxq=suvboost=query($boostq,1)boostq=toyota
  I can see the scores increasing on the matching models.
 
  2. But the OR does not work
 
 http://localhost:8983/solr/cars/select?fl=text,scoredefType=edismaxq=suvboost=query($boostq,1)boostq=toyota%20or%20honda

 I still see a lowercase or in there that should be uppercase.

 You can also add debug=query to see exactly what query is generated.

 -Yonik
 http://lucidworks.com



Re: using edismax without velocity

2013-04-06 Thread DC tech
Definitely in 4.x release. Did you try it and found a problem?



Boost parameter with query function - how to pass in complex params?

2013-04-06 Thread dc tech
See example below
1. Search for SUVs and boost   Honda models
q=suvboost=query({! v='honda'},1)

2. Search for SUVs and boost   Honda OR  toyota model

a) Using OR in the query does NOT work
   q=suvboost=query({! v='honda or toyota'},1)

b) Using two query functions and summing the boosts DOES work
Works:   q=suvboost=sum(query({!v='honda'},1),query({!v='toyota'},1))

Any thoughts?


RE: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-03 Thread DC tech
Thanks David - I suppose it is an AWS question and thank you for the pointers. 

As a further input to the MLT question - it does seem that 3.6 behavior is 
different from 4.2 - the issue seems to be more in terms of the raw query that 
is generated. 
I will some more research and revert back with details. 

David Parks davidpark...@yahoo.com wrote:

Isn't this an AWS security groups question? You should probably post this 
question on the AWS forums, but for the moment, here's the basic reading 
material - go set up your EC2 security groups and lock down your systems.

   
 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html

If you just want to password protect Solr here are the instructions:

   http://wiki.apache.org/solr/SolrSecurity

But I most certainly would not leave it open to the world even with a password 
(note that the basic password authentication sends passwords in clear text if 
you're not using HTTPS, best lock the thing down behind a firewall).

Dave


-Original Message-
From: DC tech [mailto:dctech1...@gmail.com] 
Sent: Tuesday, April 02, 2013 1:02 PM
To: solr-user@lucene.apache.org
Subject: Re: MoreLikeThis - Odd results - what am I doing wrong?

OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh gagan.g...@gmail.com wrote:

say debugQuery=truemlt=true and see the scores for the MLT query, not 
a sample query. You can use Amazon ec2 to bring up your solr, you 
should be able to get a micro instance for free trial.


On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote:

 I did try the raw query against the *simi* field and those seem to 
 return results in the order expected.
 For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
 Running a query with those words against the simi field returns the 
 expected models (X5, Audi Q5, etc) and then the subsequent documents 
 have decreasing relevance. So the basic query mechanism seems to be fine.

 The issue just seems to be with MoreLikeThis component and handler.
 I can post the index on a public SOLR instance - any suggestions? (or 
 for
 hosting)


 On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh 
 gagan.g...@gmail.com
 wrote:

  If you can bring up your solr setup on a public machine then im 
  sure a
 lot
  of debugging can be done. Without that, i think what you should 
  look at
 is
  the tf-idf scores of the terms like camry etc. Usually idf is the 
  deciding factor into which results show at the top (tf should be 1 
  for
 your
  data).
  Enable debugQuery=true and look at explain section to see show 
  score is getting calculated.
 
  You should try giving different boosts to class, type, drive, size 
  to control the results.
 
 
  On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:
 
  I am running some experiments on more like this and the results 
  seem rather odd - I am doing something wrong but just cannot figure out 
  what.
  Basically, the similarity results are decent - but not great.
 
  *Issue 1  = Quality*
  Toyota Camry : finds Altima (good) but then next one is Camry 
  Hybrid whereas it should have found Accord.
  I have normalized the data into a simi field which has only the 
  attributes that I care about.
  Without the simi field, I could not get mlt.qf boosts to work well
 enough
  to return results
 
  *Issue 2*
  Some fields do not work at all. For instance, text+simi (in 
  mlt.fl)
 works
  whereas just simi does not.
  So some weirdness that am just not understanding.
 
  Would be grateful for your guidance !
 
 
  Here is the setup:
  *1. SOLR Version*
  solr-spec 4.2.0.2013.03.06.22.32.13
  solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
  lucene-spec 4.2.0
  lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
 
  *2. Machine Information*
  Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
  19.0-b09)
  Windows 7 Home 64 Bit with 4 GB RAM
 
  *3. Sample Data *
  I created this 'dummy' data of cars  - the idea being that these 
  would
 be
  sufficient and simple to generate similarity and understand how it 
  would work.
  There are 181 rows in the data set (I have attached it for 
  reference in CSV format)
 
  [image: Inline image 1]
 
  *4. SCHEMA*
  *Field Definitions*
 field name=id type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=make type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=model type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=class type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=type type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=drive type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-04-02 Thread DC tech
OK - so I have my SOLR instance running on AWS. 
Any suggestions on how to safely share the link?  Right now, the whole SOLR 
instance is totally open. 



Gagandeep singh gagan.g...@gmail.com wrote:

say debugQuery=truemlt=true and see the scores for the MLT query, not a
sample query. You can use Amazon ec2 to bring up your solr, you should be
able to get a micro instance for free trial.


On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote:

 I did try the raw query against the *simi* field and those seem to return
 results in the order expected.
 For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
 Running a query with those words against the simi field returns the
 expected models (X5, Audi Q5, etc) and then the subsequent documents have
 decreasing relevance. So the basic query mechanism seems to be fine.

 The issue just seems to be with MoreLikeThis component and handler.
 I can post the index on a public SOLR instance - any suggestions? (or for
 hosting)


 On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com
 wrote:

  If you can bring up your solr setup on a public machine then im sure a
 lot
  of debugging can be done. Without that, i think what you should look at
 is
  the tf-idf scores of the terms like camry etc. Usually idf is the
  deciding factor into which results show at the top (tf should be 1 for
 your
  data).
  Enable debugQuery=true and look at explain section to see show score is
  getting calculated.
 
  You should try giving different boosts to class, type, drive, size to
  control the results.
 
 
  On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:
 
  I am running some experiments on more like this and the results seem
  rather odd - I am doing something wrong but just cannot figure out what.
  Basically, the similarity results are decent - but not great.
 
  *Issue 1  = Quality*
  Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
  whereas it should have found Accord.
  I have normalized the data into a simi field which has only the
  attributes that I care about.
  Without the simi field, I could not get mlt.qf boosts to work well
 enough
  to return results
 
  *Issue 2*
  Some fields do not work at all. For instance, text+simi (in mlt.fl)
 works
  whereas just simi does not.
  So some weirdness that am just not understanding.
 
  Would be grateful for your guidance !
 
 
  Here is the setup:
  *1. SOLR Version*
  solr-spec 4.2.0.2013.03.06.22.32.13
  solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
  lucene-spec 4.2.0
  lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29
 
  *2. Machine Information*
  Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
  19.0-b09)
  Windows 7 Home 64 Bit with 4 GB RAM
 
  *3. Sample Data *
  I created this 'dummy' data of cars  - the idea being that these would
 be
  sufficient and simple to generate similarity and understand how it would
  work.
  There are 181 rows in the data set (I have attached it for reference in
  CSV format)
 
  [image: Inline image 1]
 
  *4. SCHEMA*
  *Field Definitions*
 field name=id type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=make type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=model type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=class type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=type type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=drive type=string indexed=true stored=true
  termVectors=true multiValued=false/
 field name=comment type=text_general indexed=true
 stored=true
  termVectors=true multiValued=true/
 field name=size type=string indexed=true stored=true
  termVectors=true multiValued=false/
  *
  *
  *Copy Fields*
  copyField   source=make dest=make_en   /  !-- Search  --
  copyField   source=model dest=model_en   /  !-- Search  --
  copyField   source=class dest=class_en   /  !-- Search  --
  copyField   source=type dest=type_en   /  !-- Search  --
  copyField   source=drive dest=drive_en   /  !-- Search  --
  copyField   source=comment dest=comment_en   /  !-- Search
  --
  copyField   source=size dest=size_en   /  !-- Search  --
  copyField   source=id dest=text   /  !-- Glob  --
  copyField   source=make dest=text   /  !-- Glob  --
  copyField   source=model dest=text   /  !-- Glob  --
  copyField   source=class dest=text   /  !-- Glob  --
  copyField   source=type dest=text   /  !-- Glob  --
  copyField   source=drive dest=text   /  !-- Glob  --
  copyField   source=comment dest=text   /  !-- Glob  --
  copyField   source=size dest=text   /  !-- Glob  --
  copyField   source=size dest=text   /  !-- Glob  --
  *copyField   source=class dest=simi_en   /  !-- similarity

MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread dc tech
I am running some experiments on more like this and the results seem rather
odd - I am doing something wrong but just cannot figure out what.
Basically, the similarity results are decent - but not great.

*Issue 1  = Quality*
Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
whereas it should have found Accord.
I have normalized the data into a simi field which has only the attributes
that I care about.
Without the simi field, I could not get mlt.qf boosts to work well enough
to return results

*Issue 2*
Some fields do not work at all. For instance, text+simi (in mlt.fl) works
whereas just simi does not.
So some weirdness that am just not understanding.

Would be grateful for your guidance !


Here is the setup:
*1. SOLR Version*
solr-spec 4.2.0.2013.03.06.22.32.13
solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
lucene-spec 4.2.0
lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29

*2. Machine Information*
Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09)
Windows 7 Home 64 Bit with 4 GB RAM

*3. Sample Data *
I created this 'dummy' data of cars  - the idea being that these would be
sufficient and simple to generate similarity and understand how it would
work.
There are 181 rows in the data set (I have attached it for reference in CSV
format)

[image: Inline image 1]

*4. SCHEMA*
*Field Definitions*
   field name=id type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=make type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=model type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=class type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=type type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=drive type=string indexed=true stored=true
termVectors=true multiValued=false/
   field name=comment type=text_general indexed=true stored=true
termVectors=true multiValued=true/
   field name=size type=string indexed=true stored=true
termVectors=true multiValued=false/
*
*
*Copy Fields*
copyField   source=make dest=make_en   /  !-- Search  --
copyField   source=model dest=model_en   /  !-- Search  --
copyField   source=class dest=class_en   /  !-- Search  --
copyField   source=type dest=type_en   /  !-- Search  --
copyField   source=drive dest=drive_en   /  !-- Search  --
copyField   source=comment dest=comment_en   /  !-- Search  --
copyField   source=size dest=size_en   /  !-- Search  --
copyField   source=id dest=text   /  !-- Glob  --
copyField   source=make dest=text   /  !-- Glob  --
copyField   source=model dest=text   /  !-- Glob  --
copyField   source=class dest=text   /  !-- Glob  --
copyField   source=type dest=text   /  !-- Glob  --
copyField   source=drive dest=text   /  !-- Glob  --
copyField   source=comment dest=text   /  !-- Glob  --
copyField   source=size dest=text   /  !-- Glob  --
copyField   source=size dest=text   /  !-- Glob  --
*copyField   source=class dest=simi_en   /  !-- similarity  --*
*copyField   source=type dest=simi_en   /  !-- similarity  --*
*copyField   source=drive dest=simi_en   /  !-- similarity  --*
*copyField   source=size dest=simi_en   /  !-- similarity  --*

Note that the simi field ends up with values like  make, class, size and
drive:
- Luxury SUV 4WD Large
- Standard Sedan Front Familt


*5. MLT Setup*
a. mlt.FL  = *text* QF=*text*  Works but results are obviously not good
(make is not a good similarity indicator)
http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text

b. mlt.FL  = *simi* QF=*simi*  Does not work at all (0 results)
http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simimlt.qf=simi

c.  mlt.FL  = *simi,text * QF=*simi^10 text^.1*   Works with decent results
in most cases
http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=simi,textmlt.qf=simi
^10%20text^.01
Works for getting similarity for Acura MDX (Luxury SUV 4WD Large)
But for Toyota Camry - it finds hybrid family cars (Prius) ahead of Honda.


*
*
image.pngid,make,model,class,type,drive,comment,size,size_i
1,Acura ,ILX 2.0L,Luxury,Sedan,Front,,Mini,2
2,Acura ,MDX,Luxury,SUV,4wd,,Large,5
3,Acura ,RDX,Luxury,SUV,4wd,,Small,3
4,Acura ,RLX,Luxury,Sedan,AWD,,Large,5
5,Acura ,TL,Luxury,Sedan,Front,,Family,4
6,Acura ,TSX,Luxury,Sedan,Front,,Small,3
7,Acura ,ZDX,Luxury,SUV,4wd,,Large,5
8,Audi ,A3 2.0T,Luxury,Sedan,AWD,,Mini,2
9,Audi ,A4,Luxury,Sedan,AWD,,Small,3
10,Audi ,A5 2.0T,Luxury,Sedan,AWD,,Family,4
11,Audi ,A6 3.0T,Luxury,Sedan,AWD,,Family,4
12,Audi ,A7,Luxury,Sedan,AWD,,Large,5
13,Audi ,A8,Luxury,Sedan,AWD,,Largest,7
14,Audi ,Allroad,Luxury,Wagon,AWD,,Large,5
15,Audi ,Q5 2.0T,Luxury,SUV,4wd,,Large,5
16,Audi ,Q7,Luxury,SUV,4wd,,Largest,7
17,Audi ,R8,Luxury,Sports,RWD,,Largest,7
18,Audi ,S4,Luxury,Sports,AWD,,Small,3
19,Audi 

Re: MoreLikeThis - Odd results - what am I doing wrong?

2013-03-31 Thread dc tech
I did try the raw query against the *simi* field and those seem to return
results in the order expected.
For instance, Acura MDX has  ( large, SUV, 4WD   Luxury) in the simi field.
Running a query with those words against the simi field returns the
expected models (X5, Audi Q5, etc) and then the subsequent documents have
decreasing relevance. So the basic query mechanism seems to be fine.

The issue just seems to be with MoreLikeThis component and handler.
I can post the index on a public SOLR instance - any suggestions? (or for
hosting)


On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.comwrote:

 If you can bring up your solr setup on a public machine then im sure a lot
 of debugging can be done. Without that, i think what you should look at is
 the tf-idf scores of the terms like camry etc. Usually idf is the
 deciding factor into which results show at the top (tf should be 1 for your
 data).
 Enable debugQuery=true and look at explain section to see show score is
 getting calculated.

 You should try giving different boosts to class, type, drive, size to
 control the results.


 On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote:

 I am running some experiments on more like this and the results seem
 rather odd - I am doing something wrong but just cannot figure out what.
 Basically, the similarity results are decent - but not great.

 *Issue 1  = Quality*
 Toyota Camry : finds Altima (good) but then next one is Camry Hybrid
 whereas it should have found Accord.
 I have normalized the data into a simi field which has only the
 attributes that I care about.
 Without the simi field, I could not get mlt.qf boosts to work well enough
 to return results

 *Issue 2*
 Some fields do not work at all. For instance, text+simi (in mlt.fl) works
 whereas just simi does not.
 So some weirdness that am just not understanding.

 Would be grateful for your guidance !


 Here is the setup:
 *1. SOLR Version*
 solr-spec 4.2.0.2013.03.06.22.32.13
 solr-impl 4.2.0 1453694   rmuir - 2013-03-06 22:32:13
 lucene-spec 4.2.0
 lucene-impl 4.2.0 1453694 -  rmuir - 2013-03-06 22:25:29

 *2. Machine Information*
 Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23
 19.0-b09)
 Windows 7 Home 64 Bit with 4 GB RAM

 *3. Sample Data *
 I created this 'dummy' data of cars  - the idea being that these would be
 sufficient and simple to generate similarity and understand how it would
 work.
 There are 181 rows in the data set (I have attached it for reference in
 CSV format)

 [image: Inline image 1]

 *4. SCHEMA*
 *Field Definitions*
field name=id type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=make type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=model type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=class type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=type type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=drive type=string indexed=true stored=true
 termVectors=true multiValued=false/
field name=comment type=text_general indexed=true stored=true
 termVectors=true multiValued=true/
field name=size type=string indexed=true stored=true
 termVectors=true multiValued=false/
 *
 *
 *Copy Fields*
 copyField   source=make dest=make_en   /  !-- Search  --
 copyField   source=model dest=model_en   /  !-- Search  --
 copyField   source=class dest=class_en   /  !-- Search  --
 copyField   source=type dest=type_en   /  !-- Search  --
 copyField   source=drive dest=drive_en   /  !-- Search  --
 copyField   source=comment dest=comment_en   /  !-- Search  --
 copyField   source=size dest=size_en   /  !-- Search  --
 copyField   source=id dest=text   /  !-- Glob  --
 copyField   source=make dest=text   /  !-- Glob  --
 copyField   source=model dest=text   /  !-- Glob  --
 copyField   source=class dest=text   /  !-- Glob  --
 copyField   source=type dest=text   /  !-- Glob  --
 copyField   source=drive dest=text   /  !-- Glob  --
 copyField   source=comment dest=text   /  !-- Glob  --
 copyField   source=size dest=text   /  !-- Glob  --
 copyField   source=size dest=text   /  !-- Glob  --
 *copyField   source=class dest=simi_en   /  !-- similarity
  --*
 *copyField   source=type dest=simi_en   /  !-- similarity  --
 *
 *copyField   source=drive dest=simi_en   /  !-- similarity
  --*
 *copyField   source=size dest=simi_en   /  !-- similarity  --
 *

 Note that the simi field ends up with values like  make, class, size
 and drive:
 - Luxury SUV 4WD Large
 - Standard Sedan Front Familt


 *5. MLT Setup*
 a. mlt.FL  = *text* QF=*text*  Works but results are obviously not good
 (make is not a good similarity indicator)

 http://localhost:8983/solr/cars/select/?q=id:2mlt=truefl=textmlt.fl=textmlt.qf=text

 b. mlt.FL

Re: solr benchmarks

2011-01-03 Thread dc tech
Tri:
What is the volume of content (# of documents) and index size you are
expecting? What about the document complexity in terms of # of fields, what
are you storing in the index, complexity of the queries etc?

We have used SOLR with 10m documents with 1-3 second response times on the
front end  - this is with minimal tuning, 4-5 facet fields and large blobs
of content in the index and jRuby on Rails and complex queries and under low
load conditions (hence caches are probably not warmed much).

We have external search application almost fully powered by SOLR (except for
web crawl) and the response is of the typically less than 1 second with
about 100k documents. Solr time is probably 100-200 ms of this.

My sense is that SOLR is as fast as it gets and scales very, very well. On
the user group, I have seen reference to people using SOLR for 100m
documents or more. It would be useful to get your use case(s).





On Mon, Jan 3, 2011 at 10:44 AM, Jak Akdemir jakde...@gmail.com wrote:

 Hi,
 You can find benchmark results but these are not directly based on index
 size vs. response time
 http://wiki.apache.org/solr/SolrPerformanceData

 On Sat, Jan 1, 2011 at 4:06 AM, Tri Nguyen tringuye...@yahoo.com wrote:

  Hi,
 
  I remember going through some page that had graphs of response times
 based
  on index size for solr.
 
  Anyone know of such pages?
 
  Internally, we have some requirements for response times and I'm trying
 to
  figure out when to shard the index.
 
  Thanks,
 
  Tri



Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
I am a little confused - how did 180k documents become 100m index documents?
We use have over 20 indices (for different content sets), one with 5m
documents (about a couple of pages each) and another with 100k+ docs.
We can index the 5m collection in a couple of days (limitation is in
the source) which is 100k documents an hour without breaking a sweat.



On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 Hi,

 When indexing large amounts of data I hit a problem whereby Solr
 becomes unresponsive
 and doesn't recover (even when left overnight!). I think i've hit some
 GC problems/tuning
 is required of GC and I wanted to know if anyone has ever hit this problem.
 I can replicate this error (albeit taking longer to do so) using
 Solr/Lucene analysers
 only so I thought other people might have hit this issue before over
 large data sets

 Background on my problem follows -- but I guess my main question is -- can
 Solr
 become so overwhelmed by update posts that it becomes completely
 unresponsive??

 Right now I think the problem is that the java GC is hanging but I've
 been working
 on this all week and it took a while to figure out it might be
 GC-based / wasn't a
 direct result of my custom analysers so i'd appreciate any advice anyone has
 about indexing large document collections.

 I also have a second questions for those in the know -- do we have a chance
 of indexing/searching over our large dataset with what little hardware
 we already
 have available??

 thanks in advance :)

 bec

 a bit of background: ---

 I've got a large collection of articles we want to index/search over
 -- about 180k
 in total. Each article has say 500-1000 sentences and each sentence has
 about
 15 fields, many of which are multi-valued and we store most fields as well
 for
 display/highlighting purposes. So I'd guess over 100 million index
 documents.

 In our small test collection of 700 articles this results in a single index
 of
 about 13GB.

 Our pipeline processes PDF files through to Solr native xml which we call
 index.xml files i.e. in adddoc... format ready to post straight to
 Solr's
 update handler.

 We create the index.xml files as we pull in information from
 a few sources and creation of these files from their original PDF form is
 farmed out across a grid and is quite time-consuming so we distribute this
 process rather than creating index.xml files on the fly...

 We do a lot of linguistic processing and to enable search functionality
 of our resulting terms requires analysers that split terms/ join terms
 together
 i.e. custom analysers that perform string operations and are quite
 time-consuming/
 have large overhead compared to most analysers (they take approx
 20-30% more time
 and use twice as many short-lived objects than the text field type).

 Right now i'm working on my new Imac:
 quad-core 2.8 GHz intel Core i7
 16 GB 1067 MHz DDR3 RAM
 2TB hard-drive (about half free)
 Version 10.6.4 OSX

 Production environment:
 2 linux boxes each with:
 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
 16GB RAM

 I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
 right now).

 I setup Solr to use autocommit as we'll have several document collections /
 post
 to Solr from different data sets:

  !-- autocommit pending docs if certain criteria are met.  Future
 versions may expand the available
  criteria --
 autoCommit
   maxDocs50/maxDocs !-- every 1000 articles --
   maxTime90/maxTime !-- every 15 minutes --
 /autoCommit

 I also have
   useCompoundFilefalse/useCompoundFile
 ramBufferSizeMB1024/ramBufferSizeMB
 mergeFactor10/mergeFactor
 -

 *** First question:
 Has anyone else found that Solr hangs/becomes unresponsive after too
 many documents are indexed at once i.e. Solr can't keep up with the post
 rate?

 I've got LCF crawling my local test set (file system connection
 required only) and
 posting documents to Solr using 6GB of RAM. As I said above, these documents
 are in native Solr XML format (adddoc) with one file per article so
 each
 add contains all the sentence-level documents for the article.

 With LCF I post about 2.5/3k articles (files) per hour -- so about
 2.5k*500 /3600 =
 350 docs per second post-rate -- is this normal/expected??

 Eventually, after about 3000 files (an hour or so) Solr starts to
 hang/becomes
 unresponsive and with Jconsole/GC logging I can see that the Old-Gen space
 is
 about 90% full and the following is the end of the solr log file-- where you
 can see GC has been called:
 --
 3012.290: [GC Before GC:
 Statistics for BinaryTreeDictionary:
 
 Total Free Space: 53349392
 Max   Chunk Size: 3200168
 Number of Blocks: 66
 Av.  Block  Size: 808324
 Tree  Height: 13
 Before GC:
 Statistics for BinaryTreeDictionary:
 
 Total Free 

Re: Indexing Hanging during GC?

2010-08-12 Thread dc tech
1) I assume you are doing batching interspersed with commits
2) Why do you need sentence level Lucene docs?
3) Are your custom handlers/parsers a part of SOLR jvm? Would not be
surprised if you a memory/connection leak their (or it is not
releasing some resource explicitly)

In general, we have NEVER had a problem in loading Solr.

On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 sorry -- i used the term documents too loosely!

 180k scientific articles with between 500-1000 sentences each
 and we index sentence-level index documents
 so i'm guessing about 100 million lucene index documents in total.

 an update on my progress:

 i used GC settings of:
 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled
   -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8
 -XX:CMSInitiatingOccupancyFraction=70

 which allowed the indexing process to run to 11.5k articles and
 for about 2hours before I got the same kind of hanging/unresponsive Solr
 with
 this as the tail of the solr logs:

 Before GC:
 Statistics for BinaryTreeDictionary:
 
 Total Free Space: 2416734
 Max   Chunk Size: 2412032
 Number of Blocks: 3
 Av.  Block  Size: 805578
 Tree  Height: 3
 5980.480: [ParNew: 1887488K-1887488K(1887488K), 0.193 secs]5980.480:
 [CMS

 I also saw (in jconsole) that the number of threads rose from the
 steady 32 used for the
 2 hours to 72 before Solr finally became unresponsive...

 i've got the following GC info params switched on (as many as i could
 find!):
 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
   -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
   -XX:PrintFLSStatistics=1

 with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875
 million fairly small
 docs per hour!! this produced an index of about 40GB to give you an
 idea of index
 size...

 because i've already got the documents in solr native xml format
 i.e. one file per article each with adddoc.../doc
 i.e. posting each set of sentence docs per article in every LCF file post...
 this means that LCF can throw documents at Solr very fast and i think
 i'm
 breaking it GC-wise.

 i'm going to try adding in System.gc() calls to see if this runs ok
 (albeit slower)...
 otherwise i'm pretty much at a loss as to what could be causing this GC
 issue/
 solr hanging if it's not a GC issue...

 thanks :)

 bec

 On 12 August 2010 21:42, dc tech dctech1...@gmail.com wrote:
 I am a little confused - how did 180k documents become 100m index
 documents?
 We use have over 20 indices (for different content sets), one with 5m
 documents (about a couple of pages each) and another with 100k+ docs.
 We can index the 5m collection in a couple of days (limitation is in
 the source) which is 100k documents an hour without breaking a sweat.



 On 8/12/10, Rebecca Watson bec.wat...@gmail.com wrote:
 Hi,

 When indexing large amounts of data I hit a problem whereby Solr
 becomes unresponsive
 and doesn't recover (even when left overnight!). I think i've hit some
 GC problems/tuning
 is required of GC and I wanted to know if anyone has ever hit this
 problem.
 I can replicate this error (albeit taking longer to do so) using
 Solr/Lucene analysers
 only so I thought other people might have hit this issue before over
 large data sets

 Background on my problem follows -- but I guess my main question is --
 can
 Solr
 become so overwhelmed by update posts that it becomes completely
 unresponsive??

 Right now I think the problem is that the java GC is hanging but I've
 been working
 on this all week and it took a while to figure out it might be
 GC-based / wasn't a
 direct result of my custom analysers so i'd appreciate any advice anyone
 has
 about indexing large document collections.

 I also have a second questions for those in the know -- do we have a
 chance
 of indexing/searching over our large dataset with what little hardware
 we already
 have available??

 thanks in advance :)

 bec

 a bit of background: ---

 I've got a large collection of articles we want to index/search over
 -- about 180k
 in total. Each article has say 500-1000 sentences and each sentence has
 about
 15 fields, many of which are multi-valued and we store most fields as
 well
 for
 display/highlighting purposes. So I'd guess over 100 million index
 documents.

 In our small test collection of 700 articles this results in a single
 index
 of
 about 13GB.

 Our pipeline processes PDF files through to Solr native xml which we call
 index.xml files i.e. in adddoc... format ready to post straight to
 Solr's
 update handler.

 We create the index.xml files as we pull in information from
 a few sources and creation of these files from their original PDF form is
 farmed out across a grid and is quite time-consuming so we distribute
 this
 process rather than creating index.xml files on the fly...

 We do a lot of linguistic processing

Re: Facet Fields - ID vs. Display Value

2010-08-09 Thread dc tech
I think depends on what you need:
1) Simple,unique category - use display facet
2) Categories may be duplicate from display perspective (eg authors) :
store display#id in facet field but show only display
3) Internationalization requirements - store I'd but have ui pull and
display the translated labels

On 8/9/10, Frank A fsa...@gmail.com wrote:
 What I meant (which I realize now wasn't very clear) was if I have
 something like categoryID and categorylabel - is the normal practice
 to define categoryID as the facet field and then have the UI layer
 display the label?  Or would it be normal to directly use
 categorylabel as the facet field?



 On Mon, Aug 9, 2010 at 6:01 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 Hi Frank,

 I'm not sure what you mean by that.
 If the question is about what should be shown in the UI, it should be
 something
 pretty and human-readable, such as the original facet string value,
 assuming it
 was nice and clean.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: Frank A fsa...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Mon, August 9, 2010 5:19:57 PM
 Subject: Facet Fields - ID vs. Display Value

 Is there a general best practice on whether facet fields should be on
 IDs  or Display values?

 -Frank




-- 
Sent from my mobile device


Re: want to display elevated results on my display result screen differently.

2010-08-03 Thread dc tech
Have you looked at the relevance scores? I would speculate  elevate
matches would have constant, high score.

On 8/3/10, Vishal.Arora vis...@value-one.com wrote:

 Suppose i have elevate.xml file and i elevate the ID :- Artist:11650  and
 Artist:510 when i search for corgan
 this is elevate File
   elevate
   query text=corgan
   doc id=Artist:11650 /!-- the Smashing 
 Pumpkins --
   doc id=Artist:510 /!-- Green Day --
   doc id=Artist:35656 exclude=true /!-- 
 Starchildren --
   /query
   !-- others queries... --
   /elevate


 Is there any way (query parameter) which give us clue which ids are elevated
 when actual search done for corgan

 When we search than the result xml structure is same as normal search
 without elevation. I want to display elevated results on my display result
 screen differently.

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Show-elevated-Result-Differently-tp1002081p1018879.html
 Sent from the Solr - User mailing list archive at Nabble.com.


-- 
Sent from my mobile device


Re: Solr searching performance issues, using large documents

2010-07-29 Thread dc tech
Are you storing the entire log file text in SOLR? That's almost 3gb of
text that you are storing in the SOLR. Try to
1) Is this first time performance or on repaat queries with the same fields?
2) Optimze the index and test performance again
3) index without storing the text and see what the performance looks like.


On 7/29/10, Peter Spam ps...@mac.com wrote:
 Any ideas?  I've got 5000 documents with an average size of 850k each, and
 it sometimes takes 2 minutes for a query to come back when highlighting is
 turned on!  Help!


 -Pete

 On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:

 From the mailing list archive, Koji wrote:

 1. Provide another field for highlighting and use copyField to copy
 plainText to the highlighting field.

 and Lance wrote:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html

 If you want to highlight field X, doing the
 termOffsets/termPositions/termVectors will make highlighting that field
 faster. You should make a separate field and apply these options to that
 field.

 Now: doing a copyfield adds a value to a multiValued field. For a text
 field, you get a multi-valued text field. You should only copy one value
 to the highlighted field, so just copyField the document to your special
 field. To enforce this, I would add multiValued=false to that field,
 just to avoid mistakes.

 So, all_text should be indexed without the term* attributes, and should
 not be stored. Then your document stored in a separate field that you use
 for highlighting and has the term* attributes.

 I've been experimenting with this, and here's what I've tried:

   field name=body type=text_pl indexed=true stored=false
 multiValued=true termVectors=true termPositions=true termOff
 sets=true /
   field name=body_all type=text_pl indexed=false stored=true
 multiValued=true /
   copyField source=body dest=body_all/

 ... but it's still very slow (10+ seconds).  Why is it better to have two
 fields (one indexed but not stored, and the other not indexed but stored)
 rather than just one field that's both indexed and stored?


 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors

 If you aren't always using all the stored fields, then enabling lazy
 field loading can be a huge boon, especially if compressed fields are
 used.

 What does this mean?  How do you load a field lazily?

 Thanks for your time, guys - this has started to become frustrating, since
 it works so well, but is very slow!


 -Pete

 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:

 Data set: About 4,000 log files (will eventually grow to millions).
 Average log file is 850k.  Largest log file (so far) is about 70MB.

 Problem: When I search for common terms, the query time goes from under
 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
 disable highlighting, performance improves a lot, but is still slow for
 some queries (7 seconds).  Thanks in advance for any ideas!


 -Peter


 -

 4GB RAM server
 % java -Xms2048M -Xmx3072M -jar start.jar

 -

 schema.xml changes:

   fieldType name=text_pl class=solr.TextField
 analyzer
   tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=0
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0/
 /analyzer
   /fieldType

 ...

  field name=body type=text_pl indexed=true stored=true
 multiValued=false termVectors=true termPositions=true
 termOffsets=true /
   field name=timestamp type=date indexed=true stored=true
 default=NOW multiValued=false/
  field name=version type=string indexed=true stored=true
 multiValued=false/
  field name=device type=string indexed=true stored=true
 multiValued=false/
  field name=filename type=string indexed=true stored=true
 multiValued=false/
  field name=filesize type=long indexed=true stored=true
 multiValued=false/
  field name=pversion type=int indexed=true stored=true
 multiValued=false/
  field name=first2md5 type=string indexed=false stored=true
 multiValued=false/
  field name=ckey type=string indexed=true stored=true
 multiValued=false/

 ...

 dynamicField name=* type=ignored multiValued=true /
 defaultSearchFieldbody/defaultSearchField
 solrQueryParser defaultOperator=AND/

 -

 solrconfig.xml changes:

   maxFieldLength2147483647/maxFieldLength
   ramBufferSizeMB128/ramBufferSizeMB

 -

 The query:

 rowStr = 

Re: Solr using 1500 threads - is that normal?

2010-07-28 Thread dc tech
1,500 threads seems extreme by any standards so there is something
happening in your install. Even with appservers for web apps,
typically 100 would be a fair # of threads.


On 7/28/10, Christos Constantinou ch...@simpleweb.co.uk wrote:
 Hi,

 Solr seems to be crashing after a JVM exception that new threads cannot be
 created. I am writing in hope of advice from someone that has experienced
 this before. The exception that is causing the problem is:

 Exception in thread btpool0-5 java.lang.OutOfMemoryError: unable to create
 new native thread

 The memory that is allocated to Solr is 3072MB, which should be enough
 memory for a ~6GB data set. The documents are not big either, they have
 around 10 fields of which only one stores large text ranging between 1k-50k.

 The top command at the time of the crash shows Solr using around 1500
 threads, which I assume it is not normal. Could it be that the threads are
 crashing one by one and new ones are created to cope with the queries?

 In the log file, right after the the exception, there are several thousand
 commits before the server stalls completely. Normally, the log file would
 report 20-30 document existence queries per second, then 1 commit per 5-30
 seconds, and some more infrequent faceted document searches on the data.
 However after the exception, there are only commits until the end of the log
 file.

 I am wondering if anyone has experienced this before or if it is some sort
 of known bug from Solr 1.4? Is there a way to increase the details of the
 exception in the logfile?

 I am attaching the output of a grep Exception command on the logfile.

 Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:19:32 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:20:18 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:20:48 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:22:43 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:28:50 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:33:19 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:35:08 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:35:58 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:35:59 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:44:31 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of maxWarmingSearchers=2, try again later.
 Jul 28, 2010 8:51:49 AM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
 exceeded limit of 

Re: Performance issues when querying on large documents

2010-07-24 Thread dc tech
Are you storing the full 1,000 pages in the index? If so, that is
probably not helping either.

On 7/23/10, ahammad ahmed.ham...@gmail.com wrote:

 Hello,

 I have an index with lots of different types of documents. One of those
 types basically contains extracts of PDF docs. Some of those PDFs can have
 1000+ pages, so there would be a lot of stuff to search through.

 I am experiencing really terrible performance when querying. My whole index
 has about 270k documents, but less than 1000 of those are the PDF extracts.
 The slow querying occurs when I search only on those PDF extracts (by
 specifying filters), and return 100 results. The 100 results definitely adds
 to the issue, but even cutting that down can be slow.

 Is there a way to improve querying with such large results? To give an idea,
 querying for a single word can take a little over a minute, which isn't
 really viable for an application that revolves around searching. For now, I
 have limited the results to 20, which makes the query execute in roughly
 10-15 seconds. However, I would like to have the option of returning 100
 results.

 Thanks a lot.


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Performance-issues-when-querying-on-large-documents-tp990590p990590.html
 Sent from the Solr - User mailing list archive at Nabble.com.


-- 
Sent from my mobile device


Re: Personalized Search

2010-05-21 Thread dc tech
In our specific case, we would get the user's folders and then do a
function query that provides a boost if the document.folder is in {my
folder list}.

Another approach that will work for our intranet use is to add the
userids in a multi-valued field as others have suggested.



On 5/20/10, MitchK mitc...@web.de wrote:

 Hi dc,



 - at query time, specify boosts for 'my items' items

 Do you mean something like document-boost or do you want to include
 something like
 OR myItemId:100^100
 ?

 Can you tell us how you would specify document-boostings at query-time? Or
 are you querying something like a boolean field (i.e. isFavorite:true^10) or
 a numeric field?

 Kind regards
 - Mitch
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Personalized-Search-tp831070p832062.html
 Sent from the Solr - User mailing list archive at Nabble.com.


-- 
Sent from my mobile device


Re: Personalized Search

2010-05-21 Thread dc tech
Excluding favorited items is an easier problem
- get the results
- get exclude list from db
- scan results and exclude the items in the item list

You'd have to do some code to manage 'holes' in the result list ie
fetch more etc.

You could marry this with the solr batch based approach to reduce the holes :
- Every night, update the item.users field. This can be simple string
type of  field.
- query with negative criteria ie
   content:search_term AND -users:userid
- then do the steps outlined earlier

On 5/21/10, Rih tanrihae...@gmail.com wrote:

 - keep the SOLR index independent of bought/like

 - have a db table with user prefs on a per item basis


 I have the same idea this far.

 at query time, specify boosts for 'my items' items


 I believe this works if you want to sort results by faved/not faved. But how
 does it scale if users already favorited/liked hundreds of item? The query
 can be quite long.

 Looking forward to your idea.



 On Thu, May 20, 2010 at 6:37 PM, dc tech dctech1...@gmail.com wrote:

 Another approach would be to do query time boosts of 'my' items under
 the assumption that count is limited:
 - keep the SOLR index independent of bought/like
 - have a db table with user prefs on a per item basis
 - at query time, specify boosts for 'my items' items

 We are planning to do this in the context of document management where
 documents in 'my (used/favorited ) folders' provide a boost factor
 to the results.



 On 5/20/10, findbestopensource findbestopensou...@gmail.com wrote:
  Hi Rih,
 
  You going to include either of the two field bought or like to per
  member/visitor OR a unique field per member / visitor?
 
  If it's one or two common fields are included then there will not be any
  impact in performance. If you want to include unique field then you need
 to
  consider multi value field otherwise you certainly hit the wall.
 
  Regards
  Aditya
  www.findbestopensource.com
 
 
 
 
  On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote:
 
  Has anybody done personalized search with Solr? I'm thinking of
 including
  fields such as bought or like per member/visitor via dynamic fields
 to
  a
  product search schema. Another option is to have a multi-value field
 that
  can contain user IDs. What are the possible performance issues with
  this
  setup?
 
  Looking forward to your ideas.
 
  Rih
 
 

 --
 Sent from my mobile device



-- 
Sent from my mobile device


Re: Personalized Search

2010-05-20 Thread dc tech
Another approach would be to do query time boosts of 'my' items under
the assumption that count is limited:
- keep the SOLR index independent of bought/like
- have a db table with user prefs on a per item basis
- at query time, specify boosts for 'my items' items

We are planning to do this in the context of document management where
documents in 'my (used/favorited ) folders' provide a boost factor
to the results.



On 5/20/10, findbestopensource findbestopensou...@gmail.com wrote:
 Hi Rih,

 You going to include either of the two field bought or like to per
 member/visitor OR a unique field per member / visitor?

 If it's one or two common fields are included then there will not be any
 impact in performance. If you want to include unique field then you need to
 consider multi value field otherwise you certainly hit the wall.

 Regards
 Aditya
 www.findbestopensource.com




 On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote:

 Has anybody done personalized search with Solr? I'm thinking of including
 fields such as bought or like per member/visitor via dynamic fields to
 a
 product search schema. Another option is to have a multi-value field that
 can contain user IDs. What are the possible performance issues with this
 setup?

 Looking forward to your ideas.

 Rih



-- 
Sent from my mobile device


Re: Score cutoff

2010-05-04 Thread dc tech
Michael,
The cutoff filter would be very useful for us as well. We want to use
it for more like this feature where only the top n similar docs tend
to be reallt similar.



On 5/4/10, Michael Kuhlmann michael.kuhlm...@zalando.de wrote:
 Am 03.05.2010 23:32, schrieb Satish Kumar:
 Hi,

 Can someone give clues on how to implement this feature? This is a very
 important requirement for us, so any help is greatly appreciated.


 Hi,

 I just implemented exactly this feature. You need to patch Solr to make
 this work.

 We at Zalando are planning to set up a technology blog where we'll offer
 such tools, but at the moment this is not done. I can make a patch out
 of my work and send it to you today.

 Greetings,
 Michael

 On Tue, Apr 27, 2010 at 5:54 PM, Satish Kumar 
 satish.kumar.just.d...@gmail.com wrote:

 Hi,

 For some of our queries, the top xx (five or so) results are of very high
 quality and results after xx are very poor. The difference in score for
 the
 high quality and poor quality results is high. For example, 3.5 for high
 quality and 0.8 for poor quality. We want to exclude results with score
 value that is less than 60% or so of the first result. Is there a filter
 that does this? If not, can someone please give some hints on how to
 implement this (we want to do this as part of solr relevance ranking so
 that
 the facet counts, etc will be correct).


 Thanks,
 Satish





-- 
Sent from my mobile device


SOLR Based Search - Response Times - what do you consider slow or fast?

2010-05-04 Thread dc tech
We are using SOLR in a production setup with a jRuby  on Rails front end
with about  20 different instances of SOLR running on heavy duty hardware.
The setup is load balanced front end (jRoR) on a pair of machines and the
SOLR backends on a different machine. We have plenty of memory and CPU and
the machines are not particularly loaded (5% CPUs). Loads are in the range
of 12,000 to 16,000 searches a day so not a huge number. Our overall
response  (front end + SOLR) averages 0.5s to 0.7s with SOLR typicall taking
about 100 - 300 ms.

How does this compare with your experience? Would you say the performance is
good/bad/ugly?