date:20110601

Thanks Shashi, this is oddly coincidental with another issue being put
into Solr (SOLR-2193) to help solve some of the NRT issues, the timing
is impeccable.

At a base however Solr uses Lucene, as does ES.  I think the main
advantage of ES is the auto-sharding etc.  I think it uses a gossip
protocol to capitalize on this however... Hmm...

On Tue, May 31, 2011 at 10:01 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 Here is a very interesting comparison

 http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/


 -Original Message-
 From: Mark
 Sent: May-31-11 10:33 PM
 To: solr-user@lucene.apache.org
 Subject: Solr vs ElasticSearch

 I've been hearing more and more about ElasticSearch. Can anyone give me a
 rough overview on how these two technologies differ. What are the
 strengths/weaknesses of each. Why would one choose one of the other?

 Thanks

Re: Solr vs ElasticSearch

2011-06-01 Thread bryan rasmussen

Well, I recently chose it for a personal project and the deciding
thing for me was that it had nice integration to couchdb.

Thanks,
Bryan Rasmussen
On Wed, Jun 1, 2011 at 4:33 AM, Mark static.void@gmail.com wrote
 I've been hearing more and more about ElasticSearch. Can anyone give me a
 rough overview on how these two technologies differ. What are the
 strengths/weaknesses of each. Why would one choose one of the other?

 Thanks

Query problem in Solr

2011-06-01 Thread Kurt Sultana

 Hi all,

We're using Solr to search on a Shop index and a Product index. Currently a
Shop has a field `shop_keyword` which also contains the keywords of the
products assigned to it. The shop keywords are separated by a space.
Consequently, if there is a product which has a keyword apple and another
which has orange, a search for shops having `Apple AND Orange` would
return the shop for these products.

However, this is incorrect since we want that a search for shops having
`Apple AND Orange` returns shop(s) having products with both apple and
orange as keywords.

We tried solving this problem, by making shop keywords multi-valued and
assigning the keywords of every product of the shop as a new value in shop
keywords. However as was confirmed in another post
http://markmail.org/thread/xce4qyzs5367yplo#query:+page:1+mid:76eerw5yqev2aanu+state:results,
Solr does not support all words must match in the same value of a
multi-valued field.

(Hope I explained myself well)

How can we go about this? Ideally, we shouldn't change our search
infrastructure dramatically.

Thanks!

Krt_Malta

Synonyms valid only in specific categories of data

2011-06-01 Thread Spyros Kapnissis

Hello to all,


I have a collection of text phrases in more than 20 languages that I'm indexing 
in solr. Each phrase belongs to one of about 30 different phrase categories. I 
have specified different fields for each language and added a synonym filter at 
query time. I would however like the synonym filter to take into account the 
category as well. So, a specific synonym should be valid and used only in one 
or 
more categories per language. (the category is indexed in another field).  

Is this somehow possible in the current SynonymFilterFactory implementation? 

Hope it makes sense. 

Thank you,
Spyros

Re: Synonyms valid only in specific categories of data

2011-06-01 Thread lee carroll

I don't think you can assign a synonyms file dynamically to a field.
you would need to create multiple fields for each lang / cat phrases
and have their own synonyms file referenced for each field. that would
be a lot of fields.



On 1 June 2011 09:59, Spyros Kapnissis ska...@yahoo.com wrote:
 Hello to all,


 I have a collection of text phrases in more than 20 languages that I'm 
 indexing
 in solr. Each phrase belongs to one of about 30 different phrase categories. I
 have specified different fields for each language and added a synonym filter 
 at
 query time. I would however like the synonym filter to take into account the
 category as well. So, a specific synonym should be valid and used only in one 
 or
 more categories per language. (the category is indexed in another field).

 Is this somehow possible in the current SynonymFilterFactory implementation?

 Hope it makes sense.

 Thank you,
 Spyros

Re: Anyway to know changed documents?

2011-06-01 Thread pravesh

If your index size if smaller (a few 100 MBs), you can consider the SOLR's
operational script tools provided with distribution to sync indexes from
Master to Slave servers. It will update(copies) the latest index snapshot
from Master to Slave(s). SOLR wiki provides good info on how to set them as
Cron, so, no manual intervention is required. BTW, SOLR1.4+ ,also has
feature where only the changed segment gets synched(but then index need not
be optimized)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyway-to-know-changed-documents-tp3009527p3010015.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Anyway to know changed documents?

2011-06-01 Thread dangldang


Thanks pravesh ^_^
You said  BTW, SOLR1.4+ ,also has feature where only the changed segment gets 
synched.
Can you give me a document or some detail information please ? I've looked up 
at online documents but didn't find any information .
Thanks very much  .




发件人： pravesh 
发送时间： 2011-06-01  17:44:55 
收件人： solr-user 
抄送： 
主题： Re: Anyway to know changed documents? 
If your index size if smaller (a few 100 MBs), you can consider the SOLR's
operational script tools provided with distribution to sync indexes from
Master to Slave servers. It will update(copies) the latest index snapshot
from Master to Slave(s). SOLR wiki provides good info on how to set them as
Cron, so, no manual intervention is required. BTW, SOLR1.4+ ,also has
feature where only the changed segment gets synched(but then index need not
be optimized)
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyway-to-know-changed-documents-tp3009527p3010015.html
Sent from the Solr - User mailing list archive at Nabble.com.
.

Re: Query problem in Solr

2011-06-01 Thread pravesh

We're using Solr to search on a Shop index and a Product index
Do you have 2 separate indexes (using distributed shard search)?? I'm sure
you are actually having only single index.


 Currently a Shop has a field `shop_keyword` which also contains the
 keywords of the products assigned to it.

You mean, for a shop, you are first concatenating all keywords of all
products and then saving in shop_keywords field for the shop?? In this case
there is no way u can identify which keyword occurs in which product in ur
index.
You might need to change the index structure, may be, when u post documents,
then post a single document for a single product(with fields like
title,price,shop-id, etc), instead of single document for a single shop.
Hope I make myself clear

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-problem-in-Solr-tp3009812p3010072.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re: Anyway to know changed documents?

2011-06-01 Thread pravesh

SOLR wiki will provide help on this. You might be interested in pure Java
based replication too. I'm not sure,whether SOLR operational will have this
feature(synch'ing only changed segments). You might need to change
configuration in searchconfig.xml

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Anyway-to-know-changed-documents-tp3009527p3010085.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr vs ElasticSearch

2011-06-01 Thread Upayavira



On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Mark,
 
 Nice email address.  I personally have no idea, maybe ask Shay Banon
 to post an answer?  I think it's possible to make Solr more elastic,
 eg, it's currently difficult to make it move cores between servers
 without a lot of manual labor.

I'm likely to try playing with moving cores between hosts soon. In
theory it shouldn't be hard. We'll see what the practice is like!

Upayavira
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Obtaining query AST?

2011-06-01 Thread Darren Govoni

Thats pretty awesome. Thanks Renaud!

On Tue, 2011-05-31 at 22:56 +0100, Renaud Delbru wrote:

 Hi,
 
 have a look at the flexible query parser of lucene (contrib package) 
 [1]. It provides a framework to easily create different parsing logic. 
 You should be able to access the AST and to modify as you want how it 
 can be translated into a Lucene query (look at processors and pipeline 
 processors).
 One time you have your own query parser, then it is straightforward to 
 plug it into Solr.
 
 [1] http://lucene.apache.org/java/3_1_0/api/contrib-queryparser/index.html

Re: Problem with caps and star symbol

2011-06-01 Thread Saumitra Chowdhury

Thanks for your point. I was really tripping that issue. But Now I need a
bit help more.
As far I have noticed that in the case of a value like *role_delete* ,
WordDelimiterFilterFactory
index two words like *role* and *delete* and in both search result with
the term *role* and *delete* will
include that document.

Now In the case of the value like *role_delete* I want to index all four
terms like [ *role_delete, roledelete, role, delete ].*
In total both the original and processed word by WordDelimiterFilterFactory
will be indexed.

Is it possible ?? Does any additional filter with WordDelimiterFilterFactory
 can do that ?? Or
any filter can do such like operation ??

On Tue, May 31, 2011 at 8:07 PM, Erick Erickson erickerick...@gmail.comwrote:

 I think you're tripping over the issue that wildcards aren't analyzed, they
 don't go through your analysis chain. So the casing matters. Try
 lowercasing
 the input and I believe you'll see more like what you expect...

 Best
 Erick

 On Mon, May 30, 2011 at 12:07 AM, Saumitra Chowdhury
 saumi...@smartitengineering.com wrote:
  I am sending some xml to understand the scenario.
  Indexed term = ROLE_DELETE
  Search Term = roledelete
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : roledelete/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=1 start=0
 
  Indexed term = ROLE_DELETE
  Search Term = role
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime5/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : role/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=1 start=0
  doc
  str name=creationDateMon May 30 13:09:14 BDST 2011/str
  str name=displayNameGlobal Role for Deletion/str
  str name=idrole:9223372036854775802/str
  str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str
  str name=nameROLE_DELETE/str
  /doc
  /result
  /response
  doc
  str name=creationDateMon May 30 13:09:14 BDST 2011/str
  str name=displayNameGlobal Role for Deletion/str
  str name=idrole:9223372036854775802/str
  str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str
  str name=nameROLE_DELETE/str
  /doc
  /result
  /response
 
 
  Indexed term = ROLE_DELETE
  Search Term = role*
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : role*/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=1 start=0
  doc
  str name=creationDateMon May 30 13:09:14 BDST 2011/str
  str name=displayNameGlobal Role for Deletion/str
  str name=idrole:9223372036854775802/str
  str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str
  str name=nameROLE_DELETE/str
  /doc
  /result
  /response
 
 
  Indexed term = ROLE_DELETE
  Search Term = Role*
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : Role*/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=0 start=0/
  /response
 
 
  Indexed term = ROLE_DELETE
  Search Term = ROLE_DELETE*
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : ROLE_DELETE*/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=0 start=0/
  /response
  I am also adding a analysis html.
 
 
  On Mon, May 30, 2011 at 7:19 AM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  I'd start by looking at the analysis page from the Solr admin page. That
  will give you an idea of the transformations the various steps carry
 out,
  it's invaluable!
 
  Best
  Erick
  On May 26, 2011 12:53 AM, Saumitra Chowdhury 
  saumi...@smartitengineering.com wrote:
   Hi all ,
   In my schema.xml i am using WordDelimiterFilterFactory,
   LowerCaseFilterFactory, StopFilterFactory for index analyzer and an
   extra
   SynonymFilterFactory for query analyzer. I am indexing a field name
   '*name*'.Now
   if a value with all caps like NAME_BILL is indexed I am able get
 this
   as
   search result with the term  *name_bill *,  *NAME_BILL *, 
   *namebill
  *,
   *namebill** ,  *nameb**  ... But for the term like following  *
   NAME_BILL** ,  *name_bill** ,  *namebill** ,  *NAME**  the
 result
   does mot show this document. Can anyone please explain why this is
   happening? .In fact star  *  is not giving any result in many
   cases specially if it is used after full value of a field.
  
   Portion of my schema is given below.
  
   fieldType name=text_ws class=solr.TextField
  positionIncrementGap=100
   -
   analyzer
   tokenizer

Re: Solr memory consumption

My  OS  is  also CentOS (5.4). If it were 10gb all the time it would be
ok, but it grows for 13-15gb, and hurts other services =\


 It could be environment specific (specific of your top command
 implementation, OS, etc)

 I have on CentOS 2986m virtual memory showing although -Xmx2g

 You have 10g virtual although -Xmx6g 

 Don't trust it too much... top command may count OS buffers for opened
 files, network sockets, JVM DLLs itself, etc (which is outside Java GC
 responsibility); additionally to JVM memory... it counts all memory, not
 sure... if you don't have big values for 99.9%wa (which means WAIT I/O -
 disk swap usage) everyhing is fine...



 -Original Message-
 From: Denis Kuzmenok 
 Sent: May-31-11 4:18 PM
 To: solr-user@lucene.apache.org
 Subject: Solr memory consumption

 I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see this
 in top after 6-8 hours and still raising:

 17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
 -Xms3g -Xmx6g -D64
 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar
 start.jar
   
 Are there any ways to limit memory for sure?

 Thanks

London open source search social - 13th June

2011-06-01 Thread Richard Marr

Hi guys,

Just to let you know we're meeting up to talk all-things-search on Monday
13th June. There's usually a good mix of backgrounds and experience levels
so if you're free and in the London area then it'd be good to see you there.

Details:
7pm - The Elgin - 96 Ladbrooke Grove
http://www.meetup.com/london-search-social/events/20387881/



Greetings search geeks!

We've booked the next meetup for the 13th June. As usual, the plan is to
meet up and geek out over a friendly beer.

I know my co-organiser René has been working on some interesting search
projects, and I've recently left Empora to work on my own project so by June
I should hopefully have some war stories about using @elasticsearch in
production. The format is completely open though so please bring your own
topics if you've got them.

Hope to see you there!

--
Richard Marr



-- 
Richard Marr

Re: Anyway to know changed documents?

You may be interested in Solr's replication feature? 
http://wiki.apache.org/solr/SolrReplication


On 6/1/2011 2:07 AM,  wrote:

Hi everyone,
If I have two server ,their indexes should be synchronized. I changed A's 
index via HTTP send document objects, Is there any config or some plug-ins to 
let solr know which objects are changed and can push it B ?
   Any suggestion will be appreciate.
   Thanks :)

Re: Anyway to know changed documents?


On 6/1/2011 6:12 AM, pravesh wrote:

SOLR wiki will provide help on this. You might be interested in pure Java
based replication too. I'm not sure,whether SOLR operational will have this
feature(synch'ing only changed segments). You might need to change
configuration in searchconfig.xml


Yes, this feature is there in the Java/HTTP based replication since Solr 1.4

Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances

2011-06-01 Thread Tanguy Moal


Lee,

Thank you very much for your answer.

Using the signature field as the uniqueKey is effectively what I was 
doing, so the overwriteDupes=true parameter in my solrconfig was 
somehow redundant, although I wasn't aware of it! =D


In practice it works perfectly and that's the nice part.

By the way, I wonder what happens when we enter in the following code 
snippet when the id field is the same as the signature field, from 
addDoc@DirectUpdateHandler2(AddUpdateCommand) :

  if(del) { // ensure id remains unique
  BooleanQuery bq = new BooleanQuery();
  bq.add(new BooleanClause(new TermQuery(updateTerm), 
Occur.MUST_NOT));

  bq.add(new BooleanClause(new TermQuery(idTerm), Occur.MUST));
  writer.deleteDocuments(bq);
}

May be all my problems started from here...

I'll try to reproduce using a different uniqueKey field and turning 
overwriteDupes back to on to see if the problem was because of the 
signature field being the same as the uniqueKey field *and* having 
overwriteDupes on, when I'll have some time. If so, maybe that a simple 
configuration check should be performed to avoid the issue. Otherwise it 
means that having overwriteDupes turned on simply doesn't scale and that 
should be added to the wiki's Deduplication page, IMHO.


Thank you again.
Regards,

--
Tanguy

On 31/05/2011 14:58, lee carroll wrote:

Tanguy

You might have tried this already but can you set overwritedupes to
false and set the signiture key to be the id. That way solr
will manage updates?

from the wiki

http://wiki.apache.org/solr/Deduplication

!-- An example dedup update processor that creates the id field on the fly
based on the hash code of some other fields.  This example has
overwriteDupes
set to false since we are using the id field as the
signatureField and Solr
will maintain uniqueness based on that anyway. --

HTH

Lee


On 30 May 2011 08:32, Tanguy Moaltanguy.m...@gmail.com  wrote:

Hello,

Sorry for re-posting this but it seems my message got lost in the mailing 
list's messages stream without hitting anyone's attention... =D

Shortly, has anyone already experienced dramatic indexing slowdowns during 
large bulk imports with overwriteDupes turned on and a fairly high duplicates 
rate (around 4-8x) ?

It seems to produce a lot of deletions, which in turn appear to make the 
merging of segments pretty slow, by fairly increasing the number of little 
reads operations occuring simultaneously with the regular large write 
operations of the merge. Added to the poor IO performances of a commodity SATA 
drive, indexing takes ages.

I temporarily bypassed that limitation by disabling the overwriting of 
duplicates, but that changes the way I request the index, requiring me to turn 
on field collapsing at search time.

Is this a known limitation ?

Has anyone a few hints on how to optimize the handling of index time 
deduplication ?

More details on my setup and the state of my understanding are in my previous 
message here-after.

Thank you very much in advance.

Regards,

Tanguy

On 05/25/11 15:35, Tanguy Moal wrote:

Dear list,

I'm posting here after some unsuccessful investigations.
In my setup I push documents to Solr using the StreamingUpdateSolrServer.

I'm sending a comfortable initial amount of documents (~250M) and wished to 
perform overwriting of duplicated documents at index time, during the update, 
taking advantage of the UpdateProcessorChain.

At the beginning of the indexing stage, everything is quite fast; documents 
arrive at a rate of about 1000 doc/s.
The only extra processing during the import is computation of a couple of 
hashes that are used to identify uniquely documents given their content, using 
both stock (MD5Signature) and custom (derived from Lookup3Signature) update 
processors.
I send a commit command to the server every 500k documents sent.

During a first period, the server is CPU bound. After a short while (~10 
minutes), the rate at which documents are received starts to fall dramatically, 
the server being IO bound.
I've been firstly thinking of a normal speed decrease during the commit, while 
my push client is waiting for the flush to occur. That would have been a normal 
slowdown.

The thing that retained my attention was the fact that unexpectedly, the server 
was performing a lot of small reads, way more the number writes, which seem to 
be larger.
The combination of the many small reads with the constant amount of bigger 
writes seem to be creating a lot of IO contention on my commodity SATA drive, 
and the ETA of my built index started to increase scarily =D

I then restarted the JVM with JMX enabled so I could start investigating a 
little bit more. I've the realized that the UpdateHandler was performing many 
reads while processing the update request.

Are there any known limitations around the UpdateProcessorChain, when 
overwriteDupes is set to true ?
I turned that off, which of course breaks the

Re: Index vs. Query Time Aware Filters

Could you post one of your pairs of definitions? Because
I don't recognize queryMode and a web search doesn't turn
anything up, so I'm puzzled.

Best
Erick

On Wed, Jun 1, 2011 at 1:13 AM, Mike Schultz mike.schu...@gmail.com wrote:
 We have very long schema files for each of our language dependent query
 shards.  One thing that is doubling the configuration length of our main
 text processing field definition is that we have to repeat the exact same
 filter chain for query time version EXCEPT with a queryMode=true parameter.

 Is there a way for a filter to figure out if it's the index vs. query time
 version?

 A similar wish would be for the filter to be able to figure out the name of
 the field currently being indexed.  This would allow a filter to set a
 parameter at runtime based on fieldname, instead of boilerplate copying the
 same filterchain definition in schema.xml EXCEPT for one parameter.  The
 motivation is again to reduce errors and increase readability of the schema
 file.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Index-vs-Query-Time-Aware-Filters-tp3009450p3009450.html
 Sent from the Solr - User mailing list archive at Nabble.com.

If I read this correctly, one approach is to specify an
increment gap in a multiValued field, then search for phrases
with a slop less than that increment gap. i.e.
incrementGap=100 in your definition, and search for
apple orange~99

If this is gibberish, please post some examples and we'll
try something else.

Best
Erick

On Wed, Jun 1, 2011 at 4:21 AM, Kurt Sultana kurtanat...@gmail.com wrote:
Hi all,

We're using Solr to search on a Shop index and a Product index. Currently a
Shop has a field `shop_keyword` which also contains the keywords of the
products assigned to it. The shop keywords are separated by a space.
Consequently, if there is a product which has a keyword apple and another
which has orange, a search for shops having `Apple AND Orange` would
return the shop for these products.

However, this is incorrect since we want that a search for shops having
`Apple AND Orange` returns shop(s) having products with both apple and
orange as keywords.

We tried solving this problem, by making shop keywords multi-valued and
assigning the keywords of every product of the shop as a new value in shop
keywords. However as was confirmed in another post
http://markmail.org/thread/xce4qyzs5367yplo#query:+page:1+mid:76eerw5yqev2aanu+state:results,
Solr does not support all words must match in the same value of a
multi-valued field.

(Hope I explained myself well)

How can we go about this? Ideally, we shouldn't change our search
infrastructure dramatically.

Thanks!

Krt_Malta

Re: Problem with caps and star symbol

Take a look here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

I think you want generateWordParts=1, catenateWords=1 and preserveOriginal=1,
but check it out with the admin/analysis page.

Oh, and your index-time and query-time patterns for WDFF will probably
be different, see
the example schema.

Best
Erick

On Wed, Jun 1, 2011 at 7:40 AM, Saumitra Chowdhury
saumi...@smartitengineering.com wrote:
 Thanks for your point. I was really tripping that issue. But Now I need a
 bit help more.
 As far I have noticed that in the case of a value like *role_delete* ,
 WordDelimiterFilterFactory
 index two words like *role* and *delete* and in both search result with
 the term *role* and *delete* will
 include that document.

 Now In the case of the value like *role_delete* I want to index all four
 terms like [ *role_delete, roledelete, role, delete ].*
 In total both the original and processed word by WordDelimiterFilterFactory
 will be indexed.

 Is it possible ?? Does any additional filter with WordDelimiterFilterFactory
  can do that ?? Or
 any filter can do such like operation ??

 On Tue, May 31, 2011 at 8:07 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 I think you're tripping over the issue that wildcards aren't analyzed, they
 don't go through your analysis chain. So the casing matters. Try
 lowercasing
 the input and I believe you'll see more like what you expect...

 Best
 Erick

 On Mon, May 30, 2011 at 12:07 AM, Saumitra Chowdhury
 saumi...@smartitengineering.com wrote:
  I am sending some xml to understand the scenario.
  Indexed term = ROLE_DELETE
  Search Term = roledelete
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : roledelete/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=1 start=0
 
  Indexed term = ROLE_DELETE
  Search Term = role
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime5/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : role/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=1 start=0
  doc
  str name=creationDateMon May 30 13:09:14 BDST 2011/str
  str name=displayNameGlobal Role for Deletion/str
  str name=idrole:9223372036854775802/str
  str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str
  str name=nameROLE_DELETE/str
  /doc
  /result
  /response
  doc
  str name=creationDateMon May 30 13:09:14 BDST 2011/str
  str name=displayNameGlobal Role for Deletion/str
  str name=idrole:9223372036854775802/str
  str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str
  str name=nameROLE_DELETE/str
  /doc
  /result
  /response
 
 
  Indexed term = ROLE_DELETE
  Search Term = role*
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : role*/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=1 start=0
  doc
  str name=creationDateMon May 30 13:09:14 BDST 2011/str
  str name=displayNameGlobal Role for Deletion/str
  str name=idrole:9223372036854775802/str
  str name=lastModifiedDateMon May 30 13:09:14 BDST 2011/str
  str name=nameROLE_DELETE/str
  /doc
  /result
  /response
 
 
  Indexed term = ROLE_DELETE
  Search Term = Role*
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : Role*/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=0 start=0/
  /response
 
 
  Indexed term = ROLE_DELETE
  Search Term = ROLE_DELETE*
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
  str name=indenton/str
  str name=start0/str
  str name=qname : ROLE_DELETE*/str
  str name=version2.2/str
  str name=rows10/str
  /lst
  /lst
  result name=response numFound=0 start=0/
  /response
  I am also adding a analysis html.
 
 
  On Mon, May 30, 2011 at 7:19 AM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  I'd start by looking at the analysis page from the Solr admin page. That
  will give you an idea of the transformations the various steps carry
 out,
  it's invaluable!
 
  Best
  Erick
  On May 26, 2011 12:53 AM, Saumitra Chowdhury 
  saumi...@smartitengineering.com wrote:
   Hi all ,
   In my schema.xml i am using WordDelimiterFilterFactory,
   LowerCaseFilterFactory, StopFilterFactory for index analyzer and an
   extra
   SynonymFilterFactory for query analyzer. I am indexing a field name
   '*name*'.Now
   if a value with all caps like NAME_BILL is indexed I am able get
 this
   as
   search result with the term  *name_bill *,  *NAME_BILL *, 
   *namebill
  *,
   *namebill** ,  *nameb**  ... But for the

Re: collapse component with pivot faceting

You might have more luck going the other way, applying the
field collapsing patch to trunk. This is currently being worked
on, see:
https://issues.apache.org/jira/browse/SOLR-2564

Best
Erick

On Wed, Jun 1, 2011 at 12:22 AM, Isha Garg isha.g...@orkash.com wrote:
 Hi,
          Actually  currently I am using solr version 3.0 . I applied the
 field collapsing patch of solr . The field collapsing work fine with
 collapse.facet=after for any facet.field but when I  try to use  facet.pivot
 query after collapse.facet=after it does nt show any results. Also pivot
 faceting feature is not present in solr 3.0.
 So which pivot faceting patch should I use with solr 3.0,solr 4.0 support
 the pivot faceting but it does not have field collapsing feature.Can anyone
 guide me regarding which Solr version support both field collapsing and
 pivot faceting .


 Thanks in Advance!
 Isha Garg




 On Tuesday 31 May 2011 07:39 PM, Erick Erickson wrote:

 Please provide a more detailed request. This is so general that it's hard
 to
 respond. What is the use-case you're trying to understand/implement?

 You might review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best
 Erick

 On Mon, May 30, 2011 at 4:31 AM, Isha Gargisha.g...@orkash.com  wrote:


 Hi All!

         Can anyone tell me how pivot faceting works in combination with
 field collapsing.?
 Please guide me in this respect.


 Thanks!
 Isha Garg

Re: Edgengram

2011-06-01 Thread Brian Lamb

Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer

analyzer type=query
tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

Hi Brian, I don't know if I understand what you are trying to achieve.
You
want the term query abcdefg to have an idf of 1 insead of 7? I think
using
the KeywordTokenizerFilterFactory at query time should work. I would be
something like:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer type=index

tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer
/fieldType

this way, at query time abcdefg won't be turned to a ab abc abcd abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás

On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.com
wrote:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
/analyzer
/fieldType

I believe I used that link when I initially set up the field and it
worked
great (and I'm still using it in other places). In this particular
example
however it does not appear to be practical for me. I mentioned that I
have
a
similarity class that returns 1 for the idf and in the case of an
edgengram,
it returns 1 * length of the search string.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com
bmdakshinamur...@gmail.com wrote:

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't
end
up
matching parts of your query.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type
abcdefg.
That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am
created
the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson
erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types
aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using
abcdefg*
to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

Hi all,

I'm running into some confusion with the way edgengram works. I
have
the
field set up as:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory
minGramSize=1
maxGramSize=100 side=front /
/analyzer
/fieldType

I've also set up my own similarity class that returns 1 as the
idf
score.
What I've found this does is if I match a string abcdefg
against a
field
containing abcdefghijklmnop, then the idf will score that as
a
7:

7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
abcdefg=2)

I get why that's happening, but is there a way to avoid that?
Do
I
need
to
do a new field type to achieve the desired affect?

Thanks,

Brian Lamb

--
Thanks and Regards,
DakshinaMurthy BM

Re: Synonyms valid only in specific categories of data

2011-06-01 Thread Spyros Kapnissis

Yes that would probably be a lot of fields.. I guess a way would be to extend 
the SynonymFilter and change the format of the synonyms.txt file to take the 
categories into account. 


Thanks again for your answer.




From: lee carroll lee.a.carr...@googlemail.com
To: solr-user@lucene.apache.org
Sent: Wednesday, June 1, 2011 12:23 PM
Subject: Re: Synonyms valid only in specific categories of data

I don't think you can assign a synonyms file dynamically to a field.
you would need to create multiple fields for each lang / cat phrases
and have their own synonyms file referenced for each field. that would
be a lot of fields.



On 1 June 2011 09:59, Spyros Kapnissis ska...@yahoo.com wrote:
 Hello to all,


 I have a collection of text phrases in more than 20 languages that I'm 
 indexing
 in solr. Each phrase belongs to one of about 30 different phrase categories. I
 have specified different fields for each language and added a synonym filter 
 at
 query time. I would however like the synonym filter to take into account the
 category as well. So, a specific synonym should be valid and used only in one 
 or
 more categories per language. (the category is indexed in another field).

 Is this somehow possible in the current SynonymFilterFactory implementation?

 Hope it makes sense.

 Thank you,
 Spyros

Re: Solr vs ElasticSearch

 I'm likely to try playing with moving cores between hosts soon. In
 theory it shouldn't be hard. We'll see what the practice is like!

Right, in theory it's quite simple, in practice I've setup a master,
then a slave, then had to add replication to both, then call create
core, then replicate, then unload core on the master.  It's
nightmarish to setup.  The problem is, it freezes each core into a
respective role, so if I wanted to then 'move' the slave, I can't
because it's still setup as a slave.

On Wed, Jun 1, 2011 at 4:14 AM, Upayavira u...@odoko.co.uk wrote:


 On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 Mark,

 Nice email address.  I personally have no idea, maybe ask Shay Banon
 to post an answer?  I think it's possible to make Solr more elastic,
 eg, it's currently difficult to make it move cores between servers
 without a lot of manual labor.

 I'm likely to try playing with moving cores between hosts soon. In
 theory it shouldn't be hard. We'll see what the practice is like!

 Upayavira
 ---
 Enterprise Search Consultant at Sourcesense UK,
 Making Sense of Open Source

Re: Solr vs ElasticSearch

2011-06-01 Thread Upayavira



On Wed, 01 Jun 2011 07:52 -0700, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
  I'm likely to try playing with moving cores between hosts soon. In
  theory it shouldn't be hard. We'll see what the practice is like!
 
 Right, in theory it's quite simple, in practice I've setup a master,
 then a slave, then had to add replication to both, then call create
 core, then replicate, then unload core on the master.  It's
 nightmarish to setup.  The problem is, it freezes each core into a
 respective role, so if I wanted to then 'move' the slave, I can't
 because it's still setup as a slave.

Yep, I'm expecting it to require some changes to both the
CoreAdminHandler and the ReplicationHandler.

Probably the ReplicationHandler would need a 'one-off' replication
command. And some way to delete the core when it has been transferred.

Upayavira
 
 On Wed, Jun 1, 2011 at 4:14 AM, Upayavira u...@odoko.co.uk wrote:
 
 
  On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
  Mark,
 
  Nice email address.  I personally have no idea, maybe ask Shay Banon
  to post an answer?  I think it's possible to make Solr more elastic,
  eg, it's currently difficult to make it move cores between servers
  without a lot of manual labor.
 
  I'm likely to try playing with moving cores between hosts soon. In
  theory it shouldn't be hard. We'll see what the practice is like!
 
  Upayavira
  ---
  Enterprise Search Consultant at Sourcesense UK,
  Making Sense of Open Source
 
 
 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Solr vs ElasticSearch

 And some way to delete the core when it has been transferred.

Right, I manually added that to CoreAdminHandler.  I opened an issue
to try to solve this problem: SOLR-2569

On Wed, Jun 1, 2011 at 8:26 AM, Upayavira u...@odoko.co.uk wrote:


 On Wed, 01 Jun 2011 07:52 -0700, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  I'm likely to try playing with moving cores between hosts soon. In
  theory it shouldn't be hard. We'll see what the practice is like!

 Right, in theory it's quite simple, in practice I've setup a master,
 then a slave, then had to add replication to both, then call create
 core, then replicate, then unload core on the master.  It's
 nightmarish to setup.  The problem is, it freezes each core into a
 respective role, so if I wanted to then 'move' the slave, I can't
 because it's still setup as a slave.

 Yep, I'm expecting it to require some changes to both the
 CoreAdminHandler and the ReplicationHandler.

 Probably the ReplicationHandler would need a 'one-off' replication
 command. And some way to delete the core when it has been transferred.

 Upayavira

 On Wed, Jun 1, 2011 at 4:14 AM, Upayavira u...@odoko.co.uk wrote:
 
 
  On Tue, 31 May 2011 19:38 -0700, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
  Mark,
 
  Nice email address.  I personally have no idea, maybe ask Shay Banon
  to post an answer?  I think it's possible to make Solr more elastic,
  eg, it's currently difficult to make it move cores between servers
  without a lot of manual labor.
 
  I'm likely to try playing with moving cores between hosts soon. In
  theory it shouldn't be hard. We'll see what the practice is like!
 
  Upayavira
  ---
  Enterprise Search Consultant at Sourcesense UK,
  Making Sense of Open Source
 
 

 ---
 Enterprise Search Consultant at Sourcesense UK,
 Making Sense of Open Source

Re: Edgengram

Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.

LowerCaseTokenizerFactory will give you more than one term. e.g.
the string Intelligence can't be MeaSurEd will give you 5 terms,
any of which may match. i.e.
intelligence, can, t, be, measured.
whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
would give you exactly one token:
intelligence can't be measured.

So searching for measured would get a hit in the first case but
not in the second. Searching for intellig* would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer

analyzer type=query
tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer type=index

this way, at query time abcdefg won't be turned to a ab abc abcd abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás

On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.com
wrote:

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com
bmdakshinamur...@gmail.com wrote:

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't
end
up
matching parts of your query.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

The contents of the field do not contain spaces and since I am
created
the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson
erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types
aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using
abcdefg*
to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

Hi all,

I'm running into some confusion with the way edgengram works. I
have
the
field set up as:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory

Re: Solr vs ElasticSearch


On 6/1/2011 10:52 AM, Jason Rutherglen wrote:

nightmarish to setup. The problem is, it freezes each core into a
respective role, so if I wanted to then 'move' the slave, I can't
because it's still setup as a slave.


Don't know if this helps or not, but you CAN set up a core as both a 
master and a slave. Normally this is to make it a repeater, still 
always taking from the same upstream and sending downstream. But there 
might be a way to hack it for your needs without actually changing Java 
code, a core _can_ be both a master and slave simultaneously, and there 
might be a way to change it's masterURL (where it pulls from when acting 
as a slave) without restarting the core too.  You can supply a 'custom' 
(not configured) masterURL in a manual 'pull' command (over HTTP), but 
of course usually slaves poll rather than be directed by manual 'pull' 
commands.

Re: Solr vs ElasticSearch


On 6/1/2011 11:26 AM, Upayavira wrote:


Probably the ReplicationHandler would need a 'one-off' replication
command...


It's got one already, if you mean a command you can issue to a slave to 
tell it to pull replication right now.  The thing is, you can only issue 
this command if the core is configured as a slave.  You can turn off 
polling though.


You can include a custom masterURL in the one-off pull command, which 
over-rides whatever masterURL is configured in the core --- but you 
still need a masterURL configured in the core, or Solr will complain on 
startup if the core is configured as slave without a masterURL. (And if 
it's not configured as a slave, you can't issue the one-off pull command).


This is all from my experience on 1.4, don't know if things change in 
3.1, probably not.

Re: Solr vs ElasticSearch

Jonathan,

This is all true, however it ends up being hacky (this is from
experience) and the core on the source needs to be deleted.  Feel free
to post to the issue.

Jason

On Wed, Jun 1, 2011 at 8:44 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 On 6/1/2011 10:52 AM, Jason Rutherglen wrote:

 nightmarish to setup. The problem is, it freezes each core into a
 respective role, so if I wanted to then 'move' the slave, I can't
 because it's still setup as a slave.

 Don't know if this helps or not, but you CAN set up a core as both a master
 and a slave. Normally this is to make it a repeater, still always taking
 from the same upstream and sending downstream. But there might be a way to
 hack it for your needs without actually changing Java code, a core _can_ be
 both a master and slave simultaneously, and there might be a way to change
 it's masterURL (where it pulls from when acting as a slave) without
 restarting the core too.  You can supply a 'custom' (not configured)
 masterURL in a manual 'pull' command (over HTTP), but of course usually
 slaves poll rather than be directed by manual 'pull' commands.

Re: What's your query result cache's stats?

2011-06-01 Thread Shawn Heisey


On 5/31/2011 3:02 PM, Markus Jelsma wrote:

Hi,

I've seen the stats page many times, of quite a few installations and even
more servers. There's one issue that keeps bothering me: the cumulative hit
ratio of the query result cache, it's almost never higher than 50%.

What are your stats? How do you deal with it?


Below are my stats.

I will be lowering my warmcounts dramatically when I respin for 3.1.  
The 28 second warm time is too high for me.  I don't think it's going to 
make a lot of difference in performance.  I think most of the warming 
benefit is realized after the first few queries.


queryResultCache:
Concurrent LRU Cache(maxSize=1024, initialSize=1024, minSize=921, 
acceptableSize=972, cleanupThread=true, autowarmCount=64, 
regenerator=org.apache.solr.search.SolrIndexSearcher$3@60c0c8b5)


lookups : 932
hits : 528
hitratio : 0.56
inserts : 403
evictions : 0
size : 449
warmupTime : 28198
cumulative_lookups : 980357
cumulative_hits : 622726
cumulative_hitratio : 0.63
cumulative_inserts : 369692
cumulative_evictions : 83711

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


documentCache:
LRU Cache(maxSize=16384, initialSize=4096)

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


filterCache:
LRU Cache(maxSize=512, initialSize=512, autowarmCount=32, 
regenerator=org.apache.solr.search.SolrIndexSearcher$2@6910b640)


lookups : 859
hits : 464
hitratio : 0.54
inserts : 465
evictions : 0
size : 464
warmupTime : 27747
cumulative_lookups : 682600
cumulative_hits : 355130
cumulative_hitratio : 0.52
cumulative_inserts : 327479
cumulative_evictions : 161624

Re: What's your query result cache's stats?

I believe you need SOME query cache even with low hit counts, for things 
like a user paging through results. You want the query to still be in 
the cache when they go to the next page or what have you. Other 
operations like this may depend on the query cache too for good 
performance.


So even with a low hit rate, you still want enough query cache that all 
the current queries, all the queries someone is in the middle of doing 
something with and may do more with can stay in the cache. (what things 
those are can depend on your particular client interface).   So the 
cache hit count may not actually be a good guide to sizing your query 
cache.


Correct me if I'm wrong, but this is what I've been thinking.

On 6/1/2011 12:03 PM, Shawn Heisey wrote:

On 5/31/2011 3:02 PM, Markus Jelsma wrote:

Hi,

I've seen the stats page many times, of quite a few installations and 
even
more servers. There's one issue that keeps bothering me: the 
cumulative hit

ratio of the query result cache, it's almost never higher than 50%.

What are your stats? How do you deal with it?


Below are my stats.

I will be lowering my warmcounts dramatically when I respin for 3.1.  
The 28 second warm time is too high for me.  I don't think it's going 
to make a lot of difference in performance.  I think most of the 
warming benefit is realized after the first few queries.


queryResultCache:
Concurrent LRU Cache(maxSize=1024, initialSize=1024, minSize=921, 
acceptableSize=972, cleanupThread=true, autowarmCount=64, 
regenerator=org.apache.solr.search.SolrIndexSearcher$3@60c0c8b5)


lookups : 932
hits : 528
hitratio : 0.56
inserts : 403
evictions : 0
size : 449
warmupTime : 28198
cumulative_lookups : 980357
cumulative_hits : 622726
cumulative_hitratio : 0.63
cumulative_inserts : 369692
cumulative_evictions : 83711

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


documentCache:
LRU Cache(maxSize=16384, initialSize=4096)

lookups : 68543
hits : 57286
hitratio : 0.83
inserts : 11357
evictions : 0
size : 11357
warmupTime : 0
cumulative_lookups : 219118491
cumulative_hits : 179119106
cumulative_hitratio : 0.81
cumulative_inserts : 3385
cumulative_evictions : 32833254


filterCache:
LRU Cache(maxSize=512, initialSize=512, autowarmCount=32, 
regenerator=org.apache.solr.search.SolrIndexSearcher$2@6910b640)


lookups : 859
hits : 464
hitratio : 0.54
inserts : 465
evictions : 0
size : 464
warmupTime : 27747
cumulative_lookups : 682600
cumulative_hits : 355130
cumulative_hitratio : 0.52
cumulative_inserts : 327479
cumulative_evictions : 161624

Re: Solr memory consumption

Here  is output after about 24 hours running solr. Maybe there is some
way to limit memory consumption? :(


test@d6 ~/solr/example $ java -Xms3g-Xmx6g-D64
-Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar
2011-05-31 17:05:14.265:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2011-05-31 17:05:14.355:INFO::jetty-6.1-SNAPSHOT
2011-05-31 17:05:16.447:INFO::Started SocketConnector@0.0.0.0:4900
#
# A fatal error has been detected by the Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 32744 bytes for ChunkPool::allocate. 
Out of swap space?
#
#  Internal Error (allocation.cpp:117), pid=17485, tid=1090320704
#  Error: ChunkPool::allocate
#
# JRE version: 6.0_17-b17
# Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 )
# Derivative: IcedTea6 1.7.5
# Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010)
# An error report file with more information is saved as:
# /mnt/data/solr/example/hs_err_pid17485.log
#
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
#
Aborted


 I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see
 this in top after 6-8 hours and still raising:

 17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
 -Xms3g -Xmx6g -D64
 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar
   
 Are there any ways to limit memory for sure?

 Thanks

Re: Solr vs ElasticSearch

2011-06-01 Thread Upayavira

On Wed, 01 Jun 2011 11:47 -0400, Jonathan Rochkind rochk...@jhu.edu
wrote:
 On 6/1/2011 11:26 AM, Upayavira wrote:
 
  Probably the ReplicationHandler would need a 'one-off' replication
  command...
 
 It's got one already, if you mean a command you can issue to a slave to 
 tell it to pull replication right now.  The thing is, you can only issue 
 this command if the core is configured as a slave.  You can turn off 
 polling though.
 
 You can include a custom masterURL in the one-off pull command, which 
 over-rides whatever masterURL is configured in the core --- but you 
 still need a masterURL configured in the core, or Solr will complain on 
 startup if the core is configured as slave without a masterURL. (And if 
 it's not configured as a slave, you can't issue the one-off pull
 command).

Right, but this wouldn't be a slave - so I'd want to wire the
destination core so that it can accept a 'pull request' without being
correctly configured. Stuff to look at.

Upayavira

Re: Index vs. Query Time Aware Filters

2011-06-01 Thread Mike Schultz

I should have explained that the queryMode parameter is for our own custom
filter.  So the result is that we have 8 filters in our field definition. 
All the filter parameters (30 or so) of the query time and index time are
identical EXCEPT for our one custom filter which needs to know if it's in
query time or index time mode.  If we could determine inside our custom code
whether we're indexing or querying, then we could omit the query time
definition entirely and save about 50 lines of configuration and be much
less error prone.

One possible solution would be if we could get at the SolrCore from within a
filter.  Then at init time we could iterate through the filter chains and
determine when we find a factory == this.  (I've done this in other places
where it's useful to know the name of a ValueSourceParser for example)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-vs-Query-Time-Aware-Filters-tp3009450p3011556.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: K-Stemmer for Solr 3.1

2011-06-01 Thread Mark


Thanks. Ill have to create a Jira account to vote i guess.

We are already using KStemmer in 1.4.2 production and I would like to 
upgrade to 3.1. In the meantime, what is another stemmer I could use out 
of the box that would be have similar to KStemmer?


Thanks

On 5/28/11 10:02 AM, Steven A Rowe wrote:

Hi Mark,

Yonik Seeley indicated on LUCENE-152 that he is considering contributing 
Lucid's KStemmer version to Lucene:

https://issues.apache.org/jira/browse/LUCENE-152?focusedCommentId=13035647page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13035647

You can vote on the issue to communicate your interest.

Steve


-Original Message-
From: Mark [mailto:static.void@gmail.com]
Sent: Friday, May 27, 2011 7:31 PM
To: solr-user@lucene.apache.org
Subject: Re: K-Stemmer for Solr 3.1

Where can one find the KStemmer source for 4.0?

On 5/12/11 11:28 PM, Bernd Fehling wrote:

I backported a Lucid KStemmer version from solr 4.0 which I found
somewhere.
Just changed from
import org.apache.lucene.analysis.util.CharArraySet;  // solr4.0
to
import org.apache.lucene.analysis.CharArraySet;  // solr3.1

Bernd


Am 12.05.2011 16:32, schrieb Mark:

java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z

Would you mind explaining your modifications? Thanks

On 5/11/11 11:14 PM, Bernd Fehling wrote:

Am 12.05.2011 02:05, schrieb Mark:

It appears that the older version of the Lucid Works KStemmer is
incompatible with Solr 3.1. Has anyone been able to get this to
work? If not,
what are you using as an alternative?

Thanks

Lucid KStemmer works nice with Solr3.1 after some minor mods to
KStemFilter.java and KStemFilterFactory.java.
What problems do you have?

Bernd

Re: Solr memory consumption


Are you in fact out of swap space, as the java error suggested?

The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g 
eventually.  The JVM doesn't Garbage Collect until it's going to run out 
of heap space, until it gets to your Xmx.  It will keep using RAM until 
it reaches your Xmx.


If your Xmx is set so high you don't have enough RAM available, that 
will be a problem, you don't want to set Xmx like this. Ideally you 
don't even want to swap, but normally the OS will swap to give you 
enough RAM if neccesary -- if you don't have swap space for it to do 
that, to give the JVM the 6g you've configured it to take well, that 
seems to be what the Java error message is telling you. Of course 
sometimes error messages are misleading.


But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.  
This is just how the JVM works.


On 6/1/2011 12:15 PM, Denis Kuzmenok wrote:

Here  is output after about 24 hours running solr. Maybe there is some
way to limit memory consumption? :(


test@d6 ~/solr/example $ java -Xms3g-Xmx6g-D64
-Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar
2011-05-31 17:05:14.265:INFO::Logging to STDERR via org.mortbay.log.StdErrLog
2011-05-31 17:05:14.355:INFO::jetty-6.1-SNAPSHOT
2011-05-31 17:05:16.447:INFO::Started SocketConnector@0.0.0.0:4900
#
# A fatal error has been detected by the Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 32744 bytes for ChunkPool::allocate. 
Out of swap space?
#
#  Internal Error (allocation.cpp:117), pid=17485, tid=1090320704
#  Error: ChunkPool::allocate
#
# JRE version: 6.0_17-b17
# Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 )
# Derivative: IcedTea6 1.7.5
# Distribution: Custom build (Wed Oct 13 13:04:40 EDT 2010)
# An error report file with more information is saved as:
# /mnt/data/solr/example/hs_err_pid17485.log
#
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
#
Aborted



I  run  multiple-core  solr with flags: -Xms3g -Xmx6g -D64, but i see
this in top after 6-8 hours and still raising:
17485  test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java
-Xms3g -Xmx6g -D64
-Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar

Are there any ways to limit memory for sure?
Thanks

Re: Solr vs ElasticSearch

You _could_ configure it as a slave, if you plan to sometimes use it as 
a slave.  It can be configured as both a master and a slave. You can 
configure it as a slave, but turn off automatic polling.  And then issue 
one-off replicate commands whenever you want.


But yeah, it gets messy, your use case is definitely not what 
ReplicationHandler is expecting, definitely some Java improvements would 
be nice, agreed.


On 6/1/2011 12:20 PM, Upayavira wrote:

On Wed, 01 Jun 2011 11:47 -0400, Jonathan Rochkindrochk...@jhu.edu
wrote:

On 6/1/2011 11:26 AM, Upayavira wrote:

Probably the ReplicationHandler would need a 'one-off' replication
command...

It's got one already, if you mean a command you can issue to a slave to
tell it to pull replication right now.  The thing is, you can only issue
this command if the core is configured as a slave.  You can turn off
polling though.

You can include a custom masterURL in the one-off pull command, which
over-rides whatever masterURL is configured in the core --- but you
still need a masterURL configured in the core, or Solr will complain on
startup if the core is configured as slave without a masterURL. (And if
it's not configured as a slave, you can't issue the one-off pull
command).

Right, but this wouldn't be a slave - so I'd want to wire the
destination core so that it can accept a 'pull request' without being
correctly configured. Stuff to look at.

Upayavira

Re: Solr memory consumption

So what should i do to evoid that error?
I can use 10G on server, now i try to run with flags:
java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64

Or should i set xmx to lower numbers and what about other params?
Sorry, i don't know much about java/jvm =(



Wednesday, June 1, 2011, 7:29:50 PM, you wrote:

 Are you in fact out of swap space, as the java error suggested?

 The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g 
 eventually.  The JVM doesn't Garbage Collect until it's going to run out
 of heap space, until it gets to your Xmx.  It will keep using RAM until
 it reaches your Xmx.

 If your Xmx is set so high you don't have enough RAM available, that 
 will be a problem, you don't want to set Xmx like this. Ideally you 
 don't even want to swap, but normally the OS will swap to give you 
 enough RAM if neccesary -- if you don't have swap space for it to do 
 that, to give the JVM the 6g you've configured it to take well, that
 seems to be what the Java error message is telling you. Of course 
 sometimes error messages are misleading.

 But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
 This is just how the JVM works.

best way to update custom fieldcache after index commit?

2011-06-01 Thread oleole

Hi,

We use solr and lucene fieldcache like this
static DocTerms myfieldvalues =
org.apache.lucene.search.FieldCache.DEFAULT.getTerms(reader, myField);
which is initialized at first use and will stay in memory for fast retrieval
of field values based on DocID

The problem is after an index/commit, the lucene fieldcache is reloaded in
the new searcher, but this static list need to updated as well,  what is the
best way to handle this? Basically we want to update those custom filedcache
whenever there is a commit. The possible solution I can think of:

1) manually call an request handler to clean up those custom stuffs after
commit, which is a hack and ugly.
2) use some listener event (not sure whether I can use newSearcher event
listener in Solr); also there seems to be a lucene ticket (
https://issues.apache.org/jira/browse/LUCENE-2474, Allow to plug in a Cache
Eviction Listener to IndexReader to eagerly clean custom caches that use the
IndexReader (getFieldCacheKey)), not clear to me how to use it though

Any of your suggestion/comments is much appreciated. Thanks!

oleole

Re: Solr memory consumption


There is no simple answer.

All I can say is you don't usually want to use an Xmx that's more than 
you actually have available RAM, and _can't_ use more than you have 
available ram+swap, and the Java error seems to be suggesting you are 
using more than is available in ram+swap. That may not be what's going 
on, JVM memory issues are indeed confusing.


Why don't you start smaller, and see what happens.  But if you end up 
needing more RAM for your Solr than you have available on the server, 
then you're just going to need more RAM.


You may have to learn something about java/jvm to do memory tuning for 
Solr. Or, just start with the default parameters from the Solr example 
jetty, and if you don't run into any problems, then great.  Starting 
with the example jetty shipped with Solr would be the easiest way to get 
started for someone who doesn't know much about Java/JVM.


On 6/1/2011 12:37 PM, Denis Kuzmenok wrote:

So what should i do to evoid that error?
I can use 10G on server, now i try to run with flags:
java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64

Or should i set xmx to lower numbers and what about other params?
Sorry, i don't know much about java/jvm =(



Wednesday, June 1, 2011, 7:29:50 PM, you wrote:


Are you in fact out of swap space, as the java error suggested?
The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
eventually.  The JVM doesn't Garbage Collect until it's going to run out
of heap space, until it gets to your Xmx.  It will keep using RAM until
it reaches your Xmx.
If your Xmx is set so high you don't have enough RAM available, that
will be a problem, you don't want to set Xmx like this. Ideally you
don't even want to swap, but normally the OS will swap to give you
enough RAM if neccesary -- if you don't have swap space for it to do
that, to give the JVM the 6g you've configured it to take well, that
seems to be what the Java error message is telling you. Of course
sometimes error messages are misleading.
But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
This is just how the JVM works.

Re: Solr memory consumption

Overall  memory on server is 24G, and 24G of swap, mostly all the time
swap  is  free and is not used at all, that's why no free swap sound
strange to me..


 There is no simple answer.

 All I can say is you don't usually want to use an Xmx that's more than
 you actually have available RAM, and _can't_ use more than you have 
 available ram+swap, and the Java error seems to be suggesting you are 
 using more than is available in ram+swap. That may not be what's going
 on, JVM memory issues are indeed confusing.

 Why don't you start smaller, and see what happens.  But if you end up 
 needing more RAM for your Solr than you have available on the server, 
 then you're just going to need more RAM.

 You may have to learn something about java/jvm to do memory tuning for
 Solr. Or, just start with the default parameters from the Solr example
 jetty, and if you don't run into any problems, then great.  Starting 
 with the example jetty shipped with Solr would be the easiest way to get
 started for someone who doesn't know much about Java/JVM.

 On 6/1/2011 12:37 PM, Denis Kuzmenok wrote:
 So what should i do to evoid that error?
 I can use 10G on server, now i try to run with flags:
 java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64

 Or should i set xmx to lower numbers and what about other params?
 Sorry, i don't know much about java/jvm =(



 Wednesday, June 1, 2011, 7:29:50 PM, you wrote:

 Are you in fact out of swap space, as the java error suggested?
 The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
 eventually.  The JVM doesn't Garbage Collect until it's going to run out
 of heap space, until it gets to your Xmx.  It will keep using RAM until
 it reaches your Xmx.
 If your Xmx is set so high you don't have enough RAM available, that
 will be a problem, you don't want to set Xmx like this. Ideally you
 don't even want to swap, but normally the OS will swap to give you
 enough RAM if neccesary -- if you don't have swap space for it to do
 that, to give the JVM the 6g you've configured it to take well, that
 seems to be what the Java error message is telling you. Of course
 sometimes error messages are misleading.
 But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
 This is just how the JVM works.

Re: Solr memory consumption

2011-06-01 Thread Markus Jelsma

PermSize and MaxPermSize don't need to be higher than 64M.  You should read on 
JVM tuning. The permanent generation is only used for the code that's being 
executed. 

 So what should i do to evoid that error?
 I can use 10G on server, now i try to run with flags:
 java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64
 
 Or should i set xmx to lower numbers and what about other params?
 Sorry, i don't know much about java/jvm =(
 
 Wednesday, June 1, 2011, 7:29:50 PM, you wrote:
  Are you in fact out of swap space, as the java error suggested?
  
  The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
  eventually.  The JVM doesn't Garbage Collect until it's going to run out
  of heap space, until it gets to your Xmx.  It will keep using RAM until
  it reaches your Xmx.
  
  If your Xmx is set so high you don't have enough RAM available, that
  will be a problem, you don't want to set Xmx like this. Ideally you
  don't even want to swap, but normally the OS will swap to give you
  enough RAM if neccesary -- if you don't have swap space for it to do
  that, to give the JVM the 6g you've configured it to take well, that
  seems to be what the Java error message is telling you. Of course
  sometimes error messages are misleading.
  
  But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
  This is just how the JVM works.

Re: Solr memory consumption


Could be related to your crazy high MaxPermSize like Marcus said.

I'm no JVM tuning expert either. Few people are, it's confusing. So if 
you don't understand it either, why are you trying to throw in very 
non-standard parameters you don't understand?  Just start with whatever 
the Solr example jetty has, and only change things if you have a reason 
to (that you understand).


On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:

Overall  memory on server is 24G, and 24G of swap, mostly all the time
swap  is  free and is not used at all, that's why no free swap sound
strange to me..



There is no simple answer.
All I can say is you don't usually want to use an Xmx that's more than
you actually have available RAM, and _can't_ use more than you have
available ram+swap, and the Java error seems to be suggesting you are
using more than is available in ram+swap. That may not be what's going
on, JVM memory issues are indeed confusing.
Why don't you start smaller, and see what happens.  But if you end up
needing more RAM for your Solr than you have available on the server,
then you're just going to need more RAM.
You may have to learn something about java/jvm to do memory tuning for
Solr. Or, just start with the default parameters from the Solr example
jetty, and if you don't run into any problems, then great.  Starting
with the example jetty shipped with Solr would be the easiest way to get
started for someone who doesn't know much about Java/JVM.
On 6/1/2011 12:37 PM, Denis Kuzmenok wrote:

So what should i do to evoid that error?
I can use 10G on server, now i try to run with flags:
java -Xms6G -Xmx6G -XX:MaxPermSize=1G -XX:PermSize=512M -D64

Or should i set xmx to lower numbers and what about other params?
Sorry, i don't know much about java/jvm =(



Wednesday, June 1, 2011, 7:29:50 PM, you wrote:


Are you in fact out of swap space, as the java error suggested?
The way JVM's work always, if you tell it -Xmx6g, it WILL use all 6g
eventually.  The JVM doesn't Garbage Collect until it's going to run out
of heap space, until it gets to your Xmx.  It will keep using RAM until
it reaches your Xmx.
If your Xmx is set so high you don't have enough RAM available, that
will be a problem, you don't want to set Xmx like this. Ideally you
don't even want to swap, but normally the OS will swap to give you
enough RAM if neccesary -- if you don't have swap space for it to do
that, to give the JVM the 6g you've configured it to take well, that
seems to be what the Java error message is telling you. Of course
sometimes error messages are misleading.
But yes, if you set Xmx to 6G, the process WILL use all 6G eventually.
This is just how the JVM works.

Limit data stored from fmap.content with Solr cell

2011-06-01 Thread Greg Georges

Hello everyone,

I have just gotten extracting information from files with Solr Cell. Some of 
the files we are indexing are large, and have much content. I would like to 
limit the amount of data I index to a specified limit of characters (example 
300 chars) which I will use as a document preview. Is this possible to set as a 
parameter with the fmap.content param, of must I index it all and then do a 
copyfield but just with a specified number of characters? Thanks in advance

Greg

Re: Solr memory consumption

There  were  no  parameters  at  all,  and java hitted out of memory
almost  every day, then i tried to add parameters but nothing changed.
Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
because it's the last thing i didn't try yet :(


Wednesday, June 1, 2011, 9:00:56 PM, you wrote:

 Could be related to your crazy high MaxPermSize like Marcus said.

 I'm no JVM tuning expert either. Few people are, it's confusing. So if
 you don't understand it either, why are you trying to throw in very 
 non-standard parameters you don't understand?  Just start with whatever
 the Solr example jetty has, and only change things if you have a reason
 to (that you understand).

 On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
 Overall  memory on server is 24G, and 24G of swap, mostly all the time
 swap  is  free and is not used at all, that's why no free swap sound
 strange to me..

Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread beccax

Apologize if this question has already been raised.  I tried searching but
couldn't find the relevant posts.

We've indexed a bunch of documents by different authors.  Then for search
results, we'd like to show the authors that have 1 or more documents
matching the search keywords.  

The problem is right now our solr search method first paginates results to
100 documents per page, then we take the results and group by authors.  This
results in different number of authors per page.  (Some authors may only
have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say 25) per
page?

I mean alternatively we could just show all the documents themselves ordered
by author, but it's not the user experience we're looking for.

Thanks so much.  And please let me know if you need more details not
provided here.
B

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-different-of-search-results-per-page-due-to-pagination-then-grouping-tp3012168p3012168.html
Sent from the Solr - User mailing list archive at Nabble.com.

Change default scoring formula

2011-06-01 Thread ngaurav2005

Hi All,

I need to change the default scoring formula of solr. How shall I hack the
code to do so?
also, is there any way to stop solr to do its default scoring and sorting?

Thanks,
Gaurav

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012196.html
Sent from the Solr - User mailing list archive at Nabble.com.

Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Chris Cowan

About once a day a Solr/Jetty process gets hung on my server consuming 100% of 
one of the CPU's. Once this happens the server no longer responds to requests. 
I've looked through the logs to try and see if anything stands out but so far 
I've found nothing out of the ordinary. 

My current remedy is to log in and just kill the single processes that's hung. 
Once that happens everything goes back to normal and I'm good for a day or so.  
I'm currently  the running following:

solr-jetty-1.4.0+ds1-1ubuntu1

which is comprised of

Solr 1.4.0
Jetty 6.1.22
on Unbuntu 10.10

I'm pretty new to managing a Jetty/Solr instance so at this point I'm just 
looking for advice on how I should go about trouble shooting this problem.

Chris

Re: Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Bill Au

Taking a thread dump will take you what's going.

Bill

On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan chrisco...@plus3network.comwrote:

 About once a day a Solr/Jetty process gets hung on my server consuming 100%
 of one of the CPU's. Once this happens the server no longer responds to
 requests. I've looked through the logs to try and see if anything stands out
 but so far I've found nothing out of the ordinary.

 My current remedy is to log in and just kill the single processes that's
 hung. Once that happens everything goes back to normal and I'm good for a
 day or so.  I'm currently  the running following:

 solr-jetty-1.4.0+ds1-1ubuntu1

 which is comprised of

 Solr 1.4.0
 Jetty 6.1.22
 on Unbuntu 10.10

 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just
 looking for advice on how I should go about trouble shooting this problem.

 Chris

Re: CLOSE_WAIT after connecting to multiple shards from a primary shard

2011-06-01 Thread Mukunda Madhava

Hi Otis,
Sending to solr-user mailing list.

We see this CLOSE_WAIT connections even when i do a simple http request via
curl, that is, even when i do a simple curl using a primary and secondary
shard query, like for e.g.

curl 
http://primaryshardhost:8180/solr/core0/select?q=*%3A*shards=secondaryshardhost1:8090/solr/appgroup1_11053000_11053100


While fetching data it is in ESTABLISHED state

-sh-3.2$ netstat | grep ESTABLISHED | grep 8090
tcp0  0 primaryshardhost:36805 secondaryshardhost1:8090
ESTABLISHED

After the request has come back, it is in CLOSE_WAIT state

-sh-3.2$ netstat | grep CLOSE_WAIT | grep 8090
tcp1  0 primaryshardhost:36805 secondaryshardhost1:8090
CLOSE_WAIT

why does Solr keep the connection to the shards in CLOSE_WAIT?

Is this a feature of Solr? If we modify an OS property (I dont know how) to
cleanup the CLOSE_WAITs will it cause an issue with subsequent searches?

Can someone help me please?

thanks,
Mukunda

On Mon, May 30, 2011 at 5:59 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hi,

 A few things:
 1) why not send this to the Solr list?
 2) you talk about searching, but the code sample is about optimizing the
 index.

 3) I don't have SolrJ API in front of me, but isn't there is
 CommonsSolrServe
 ctor that takes in a URL instead of HttpClient instance?  Try that one.

 Otis
 -
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Mukunda Madhava mukunda...@gmail.com
  To: gene...@lucene.apache.org
  Sent: Mon, May 30, 2011 1:54:07 PM
  Subject: CLOSE_WAIT after connecting to multiple shards from a primary
 shard
 
  Hi,
  We are having a primary Solr shard, and multiple secondary shards.
  We
  query data from the secondary shards by specifying the shards param in
  the
  query params.
 
  But we found that after recieving the data, there  are large number of
  CLOSE_WAIT on the secondary shards from the primary  shards.
 
  Like for e.g.
 
  tcp1   0 primaryshardhost:56109  secondaryshardhost1:8090
  CLOSE_WAIT
  tcp1   0 primaryshardhost:51049  secondaryshardhost1:8090
  CLOSE_WAIT
  tcp1   0 primaryshardhost:49537  secondaryshardhost1:8089
  CLOSE_WAIT
  tcp1   0 primaryshardhost:44109  secondaryshardhost2:8090
  CLOSE_WAIT
  tcp1   0 primaryshardhost:32041  secondaryshardhost2:8090
  CLOSE_WAIT
  tcp1   0 primaryshardhost:48533  secondaryshardhost2:8089
  CLOSE_WAIT
 
 
  We open the Solr connections  as below..
 
  SimpleHttpConnectionManager cm =  new
  SimpleHttpConnectionManager(true);
   cm.closeIdleConnections(0L);
  HttpClient  httpClient = new HttpClient(cm);
  solrServer = new  CommonsHttpSolrServer(url,httpClient);
   solrServer.optimize();
 
  But still we see these issues. Any ideas?
  --
  Thanks,
  Mukunda
 




-- 
Thanks,
Mukunda

Re: Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Chris Cowan

I'm pretty green... is that something I  can do while the event is happening or 
is there something I need to configure to capture the dump ahead of time. 

I've tried to reproduce the problem by putting the server under load but that 
doesn't seem to be the issue.

Chris

On Jun 1, 2011, at 12:06 PM, Bill Au wrote:

 Taking a thread dump will take you what's going.
 
 Bill
 
 On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan 
 chrisco...@plus3network.comwrote:
 
 About once a day a Solr/Jetty process gets hung on my server consuming 100%
 of one of the CPU's. Once this happens the server no longer responds to
 requests. I've looked through the logs to try and see if anything stands out
 but so far I've found nothing out of the ordinary.
 
 My current remedy is to log in and just kill the single processes that's
 hung. Once that happens everything goes back to normal and I'm good for a
 day or so.  I'm currently  the running following:
 
 solr-jetty-1.4.0+ds1-1ubuntu1
 
 which is comprised of
 
 Solr 1.4.0
 Jetty 6.1.22
 on Unbuntu 10.10
 
 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just
 looking for advice on how I should go about trouble shooting this problem.
 
 Chris

Re: Debugging a Solr/Jetty Hung Process

2011-06-01 Thread Chris Cowan

Sorry ... I just found it. I will try that next time. I have a feeling it wont 
work since the server usually stops accepting connections.

Chris

On Jun 1, 2011, at 12:12 PM, Chris Cowan wrote:

 I'm pretty green... is that something I  can do while the event is happening 
 or is there something I need to configure to capture the dump ahead of time. 
 
 I've tried to reproduce the problem by putting the server under load but that 
 doesn't seem to be the issue.
 
 Chris
 
 On Jun 1, 2011, at 12:06 PM, Bill Au wrote:
 
 Taking a thread dump will take you what's going.
 
 Bill
 
 On Wed, Jun 1, 2011 at 3:04 PM, Chris Cowan 
 chrisco...@plus3network.comwrote:
 
 About once a day a Solr/Jetty process gets hung on my server consuming 100%
 of one of the CPU's. Once this happens the server no longer responds to
 requests. I've looked through the logs to try and see if anything stands out
 but so far I've found nothing out of the ordinary.
 
 My current remedy is to log in and just kill the single processes that's
 hung. Once that happens everything goes back to normal and I'm good for a
 day or so.  I'm currently  the running following:
 
 solr-jetty-1.4.0+ds1-1ubuntu1
 
 which is comprised of
 
 Solr 1.4.0
 Jetty 6.1.22
 on Unbuntu 10.10
 
 I'm pretty new to managing a Jetty/Solr instance so at this point I'm just
 looking for advice on how I should go about trouble shooting this problem.
 
 Chris

Re: Edgengram

2011-06-01 Thread Brian Lamb

I think in my case LowerCaseTokenizerFactory will be sufficient because
there will never be spaces in this particular field. But thank you for the
useful link!

Thanks,

Brian Lamb

On Wed, Jun 1, 2011 at 11:44 AM, Erick Erickson erickerick...@gmail.comwrote:

Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.

So searching for measured would get a hit in the first case but
not in the second. Searching for intellig* would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer

analyzer type=query
tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

...or also use the LowerCaseTokenizerFactory at query time for
consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer type=index

this way, at query time abcdefg won't be turned to a ab abc abcd
abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás

On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.com
wrote:

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com
bmdakshinamur...@gmail.com wrote:

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you
don't
end
up
matching parts of your query.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

In this particular case, I will be doing a solr search based on
user
preferences. So I will not be depending on the user to type
abcdefg.
That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am
created
the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson
erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types
aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might
be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using
abcdefg*
to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May

Re: Change default scoring formula

2011-06-01 Thread Tomás Fernández Löbbe

Hi Gaurav, not sure what your use case is (and if no sorting at all is ever
required, is Solr / Lucene what you need?).
You can certainly sort by a field (or more) in descendant or ascendant order
by using the sort parameter.
You can customize the scoring algorithm by overriding the DefaultSimilarity
class, but first make sure that this is what you need, as most use cases can
be implemented with the default similarity plus queries / filter queries /
function queries, etc.
Regards,

Tomás
On Wed, Jun 1, 2011 at 4:02 PM, ngaurav2005 ngaurav2...@gmail.com wrote:

 Hi All,

 I need to change the default scoring formula of solr. How shall I hack the
 code to do so?
 also, is there any way to stop solr to do its default scoring and sorting?

 Thanks,
 Gaurav

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012196.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping

There's no great way to do that.

One approach would be using facets, but that will just get you the
author names (as stored in fields), and not the documents under it. If
you really only want to show the author names, facets could work. One
issue with facets though is Solr won't tell you the total number of
facet values for your query, so it's tricky to provide next/prev paging
through them.

There is also a 'field collapsing' feature that I think is not in a
released Solr, but may be in the Solr repo. I'm not sure it will quite
do what you want either though, although it's related and worth a look.
http://wiki.apache.org/solr/FieldCollapsing

Another vaguely related thing that is also not yet in a released Solr,
is a 'join' function. That could possibly be used to do what you want,
although it'd be tricky too. https://issues.apache.org/jira/browse/SOLR-2272

Jonathan

On 6/1/2011 2:56 PM, beccax wrote:

Apologize if this question has already been raised. I tried searching but
couldn't find the relevant posts.

We've indexed a bunch of documents by different authors. Then for search
results, we'd like to show the authors that have 1 or more documents
matching the search keywords.

The problem is right now our solr search method first paginates results to
100 documents per page, then we take the results and group by authors. This
results in different number of authors per page. (Some authors may only
have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say 25) per
page?

I mean alternatively we could just show all the documents themselves ordered
by author, but it's not the user experience we're looking for.

Thanks so much. And please let me know if you need more details not
provided here.
B

--
View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-different-of-search-results-per-page-due-to-pagination-then-grouping-tp3012168p3012168.html
Sent from the Solr - User mailing list archive at Nabble.com.

Searching using a PDF

2011-06-01 Thread Brian Lamb

Is it possible to do a search based on a PDF file? I know its possible to
update the index with a PDF but can you do just a regular search with it?

Thanks,

Brian Lamb

Re: Debugging a Solr/Jetty Hung Process

First guess (and it really is just a guess) would be Java garbage 
collection taking over. There are some JVM parameters you can use to 
tune the GC process, especially if the machine is multi-core, making 
sure GC happens in a seperate thread is helpful.


But figuring out exactly what's going on requires confusing JVM 
debugging of which I am no expert at either.


On 6/1/2011 3:04 PM, Chris Cowan wrote:

About once a day a Solr/Jetty process gets hung on my server consuming 100% of 
one of the CPU's. Once this happens the server no longer responds to requests. 
I've looked through the logs to try and see if anything stands out but so far 
I've found nothing out of the ordinary.

My current remedy is to log in and just kill the single processes that's hung. 
Once that happens everything goes back to normal and I'm good for a day or so.  
I'm currently  the running following:

solr-jetty-1.4.0+ds1-1ubuntu1

which is comprised of

Solr 1.4.0
Jetty 6.1.22
on Unbuntu 10.10

I'm pretty new to managing a Jetty/Solr instance so at this point I'm just 
looking for advice on how I should go about trouble shooting this problem.

Chris

Re: best way to update custom fieldcache after index commit?

How are you implementing your custom cache? If you're defining
it in the solrconfig, couldn't you implement the regenerator? See:
http://wiki.apache.org/solr/SolrCaching#User.2BAC8-Generic_Caches

Best
Erick

On Wed, Jun 1, 2011 at 12:38 PM, oleole oleol...@gmail.com wrote:
 Hi,

 We use solr and lucene fieldcache like this
 static DocTerms myfieldvalues =
 org.apache.lucene.search.FieldCache.DEFAULT.getTerms(reader, myField);
 which is initialized at first use and will stay in memory for fast retrieval
 of field values based on DocID

 The problem is after an index/commit, the lucene fieldcache is reloaded in
 the new searcher, but this static list need to updated as well,  what is the
 best way to handle this? Basically we want to update those custom filedcache
 whenever there is a commit. The possible solution I can think of:

 1) manually call an request handler to clean up those custom stuffs after
 commit, which is a hack and ugly.
 2) use some listener event (not sure whether I can use newSearcher event
 listener in Solr); also there seems to be a lucene ticket (
 https://issues.apache.org/jira/browse/LUCENE-2474, Allow to plug in a Cache
 Eviction Listener to IndexReader to eagerly clean custom caches that use the
 IndexReader (getFieldCacheKey)), not clear to me how to use it though

 Any of your suggestion/comments is much appreciated. Thanks!

 oleole

Re: Limit data stored from fmap.content with Solr cell

If you can live with an across-the-board limit, you can set maxFieldLength
in your solrconfig.xml file. Note that this is in terms rather than
chars though...

Best
Erick

On Wed, Jun 1, 2011 at 2:22 PM, Greg Georges greg.geor...@biztree.com wrote:
 Hello everyone,

 I have just gotten extracting information from files with Solr Cell. Some of 
 the files we are indexing are large, and have much content. I would like to 
 limit the amount of data I index to a specified limit of characters (example 
 300 chars) which I will use as a document preview. Is this possible to set as 
 a parameter with the fmap.content param, of must I index it all and then do a 
 copyfield but just with a specified number of characters? Thanks in advance

 Greg

NRT facet search options comparison

2011-06-01 Thread Andy

Hi,

I need to provide NRT search with faceting. Been looking at the options out 
there. Wondered if anyone could clarify some questions I have and perhaps share 
your NRT experiences.

The various NRT options:

1) Solr
-Solr doesn't have NRT, yet. What is the expected time frame for NRT? Is it a 
few months or more like a year?
-How would Solr faceting work with NRT? My understanding is that faceting in 
Solr relies on caching, which doesn't go well with NRT updates. When NRT 
arrives, would facet performance take a huge drop when using with NRT because 
of this caching issue?

2) ElasticSearch
-ES supports NRT so that's great. Does anyone have experiences with ES that 
they could share? Does faceting work with NRT in ES? Any Solr features that are 
missing in ES?

3) Solr-RA
-Read in this list about Solr-RA, which has NRT support. Has anyone used it? 
Can you share your experiences?
-Again not sure if facet would work with Solr-RA NRT. Solr-RA is based on Solr, 
so faceting in Solr-RA relies on caching I suppose. Does NRT affect facet 
performance?

4) Zoie plugin for Solr
-Zoie is a NRT search library. I tried but couldn't get the Zoie plugin to work 
with Solr. Always got the error message of opening too many Searchers. Has 
anyone got this to work?

Any other options?

Thanks
Andy

Re: Searching using a PDF

I'm not quite sure what you mean by regular search. When
you index a PDF (Presumably through Tika or Solr Cell) the text
is indexed into your index and you can certainly search that. Additionally,
there may be meta data indexed in specific fields (e.g. author,
date modified, etc).

But what does search based on a PDF file mean in your context?

Best
Erick

On Wed, Jun 1, 2011 at 3:41 PM, Brian Lamb
brian.l...@journalexperts.com wrote:
 Is it possible to do a search based on a PDF file? I know its possible to
 update the index with a PDF but can you do just a regular search with it?

 Thanks,

 Brian Lamb

Re: Change default scoring formula

2011-06-01 Thread ngaurav2005

Thanks Tomas. Well I am sorting results by a function query. I donot want
solr to do extra effort in calculating score for each document and eat up my
cpu cycles. Also, I need to use if condition in score calculation, which I
emulated through map function, but map function do not accept a function
as one of the values. This causes me to write my own scoring algorithm.

Can you help me with the steps or link to any post which explains step by
step overriding(DefaultSimilarity class) default sorting algorithm?

Thanks in advance.
Gaurav


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Change-default-scoring-formula-tp3012196p3012372.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Spellcheck Phrases

2011-06-01 Thread Dyer, James

Tanner,

I just entered SOLR-2571 to fix the float-parsing-bug that breaks 
thresholdTokenFrequency.  Its just a 1-line code fix so I also included a 
patch that should cleanly apply to solr 3.1.  See 
https://issues.apache.org/jira/browse/SOLR-2571 for info and patches.

This parameter appears absent from the wiki.  And as it has always been broken 
for me, I haven't tested it.  However, my understanding it should be set as the 
minimum percentage of documents in which a term has to occur in order for it to 
appear in the spelling dictionary.  For instance in the config below, a term 
would have to occur in at least 1% of the documents for it to be part of the 
spelling dictionary.  This might be a good setting for long fields but for the 
short fields in my application, I was thinking of setting this to something 
like 1/1000 of 1% ...

searchComponent name=spellcheck class=solr.SpellCheckComponent
 str name=queryAnalyzerFieldTypetext/str
 lst name=spellchecker
  str name=namespellchecker/str
  str name=fieldSpelling_Dictionary/str
  str name=fieldTypetext/str
  str name=spellcheckIndexDir./spellchecker/str
  str name=thresholdTokenFrequency.01/str 
 /lst
/searchComponent

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Tanner Postert [mailto:tanner.post...@gmail.com] 
Sent: Friday, May 27, 2011 6:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellcheck Phrases

are there any updates on this? any third party apps that can make this work
as expected?

On Wed, Feb 23, 2011 at 12:38 PM, Dyer, James james.d...@ingrambook.comwrote:

 Tanner,

 Currently Solr will only make suggestions for words that are not in the
 dictionary, unless you specifiy spellcheck.onlyMorePopular=true.  However,
 if you do that, then it will try to improve every word in your query, even
 the ones that are spelled correctly (so while it might change brake to
 break it might also change leg to log.)

 You might be able to alleviate some of the pain by setting the
 thresholdTokenFrequency so as to remove misspelled and rarely-used words
 from your dictionary, although I personally haven't been able to get this
 parameter to work.  It also doesn't seem to be documented on the wiki but it
 is in the 1.4.1. source code, in class IndexBasedSpellChecker.  Its also
 mentioned in SmileyPugh's book.  I tried setting it like this, but got a
 ClassCastException on the float value:

 searchComponent name=spellcheck class=solr.SpellCheckComponent
  str name=queryAnalyzerFieldTypetext_spelling/str
  lst name=spellchecker
  str name=namespellchecker/str
  str name=fieldSpelling_Dictionary/str
  str name=fieldTypetext_spelling/str
  str name=buildOnOptimizetrue/str
  str name=thresholdTokenFrequency.001/str
  /lst
 /searchComponent

 I have it on my to-do list to look into this further but haven't yet.  If
 you decide to try it and can get it to work, please let me know how you do
 it.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311

 -Original Message-
 From: Tanner Postert [mailto:tanner.post...@gmail.com]
 Sent: Wednesday, February 23, 2011 12:53 PM
 To: solr-user@lucene.apache.org
 Subject: Spellcheck Phrases

 right now when I search for 'brake a leg', solr returns valid results with
 no indication of misspelling, which is understandable since all of those
 terms are valid words and are probably found in a few pieces of our
 content.
 My question is:

 is there any way for it to recognize that the phase should be break a leg
 and not brake a leg and suggest the proper phrase?

RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen

Don't manually group by author from your results, the list will always
be incomplete...  use faceting instead to show the authors of the books
you have found in your search.

http://wiki.apache.org/solr/SolrFacetingOverview

-Original Message-
From: beccax [mailto:bec...@gmail.com] 
Sent: Wednesday, June 01, 2011 11:56 AM
To: solr-user@lucene.apache.org
Subject: Newbie question: how to deal with different # of search results
per page due to pagination then grouping

Apologize if this question has already been raised.  I tried searching
but
couldn't find the relevant posts.

We've indexed a bunch of documents by different authors.  Then for
search
results, we'd like to show the authors that have 1 or more documents
matching the search keywords.  

The problem is right now our solr search method first paginates results
to
100 documents per page, then we take the results and group by authors.
This
results in different number of authors per page.  (Some authors may only
have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say 25)
per
page?

I mean alternatively we could just show all the documents themselves
ordered
by author, but it's not the user experience we're looking for.

Thanks so much.  And please let me know if you need more details not
provided here.
B

--
View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen

I think facet.offset allows facet paging nicely by letting you index
into the list of facet values.  It is working for me...

http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, June 01, 2011 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

There's no great way to do that.

One approach would be using facets, but that will just get you the 
author names (as stored in fields), and not the documents under it. If 
you really only want to show the author names, facets could work. One 
issue with facets though is Solr won't tell you the total number of 
facet values for your query, so it's tricky to provide next/prev paging 
through them.

There is also a 'field collapsing' feature that I think is not in a 
released Solr, but may be in the Solr repo. I'm not sure it will quite 
do what you want either though, although it's related and worth a look. 
http://wiki.apache.org/solr/FieldCollapsing

Another vaguely related thing that is also not yet in a released Solr, 
is a 'join' function. That could possibly be used to do what you want, 
although it'd be tricky too.
https://issues.apache.org/jira/browse/SOLR-2272

Jonathan

On 6/1/2011 2:56 PM, beccax wrote:
 Apologize if this question has already been raised.  I tried searching
but
 couldn't find the relevant posts.

 We've indexed a bunch of documents by different authors.  Then for
search
 results, we'd like to show the authors that have 1 or more documents
 matching the search keywords.

 The problem is right now our solr search method first paginates
results to
 100 documents per page, then we take the results and group by authors.
This
 results in different number of authors per page.  (Some authors may
only
 have one matching document and others 5 or 10.)

 How do we change it to somehow show the same number of authors (say
25) per
 page?

 I mean alternatively we could just show all the documents themselves
ordered
 by author, but it's not the user experience we're looking for.

 Thanks so much.  And please let me know if you need more details not
 provided here.
 B

 --
 View this message in context:
http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Newbie question: how to deal with different # of search results per page due to pagination then grouping

How do you know whether to provide a 'next' button, or whether you are 
the end of your facet list?


On 6/1/2011 4:47 PM, Robert Petersen wrote:

I think facet.offset allows facet paging nicely by letting you index
into the list of facet values.  It is working for me...

http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
Sent: Wednesday, June 01, 2011 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

There's no great way to do that.

One approach would be using facets, but that will just get you the
author names (as stored in fields), and not the documents under it. If
you really only want to show the author names, facets could work. One
issue with facets though is Solr won't tell you the total number of
facet values for your query, so it's tricky to provide next/prev paging
through them.

There is also a 'field collapsing' feature that I think is not in a
released Solr, but may be in the Solr repo. I'm not sure it will quite
do what you want either though, although it's related and worth a look.
http://wiki.apache.org/solr/FieldCollapsing

Another vaguely related thing that is also not yet in a released Solr,
is a 'join' function. That could possibly be used to do what you want,
although it'd be tricky too.
https://issues.apache.org/jira/browse/SOLR-2272

Jonathan

On 6/1/2011 2:56 PM, beccax wrote:

Apologize if this question has already been raised.  I tried searching

but

couldn't find the relevant posts.

We've indexed a bunch of documents by different authors.  Then for

search

results, we'd like to show the authors that have 1 or more documents
matching the search keywords.

The problem is right now our solr search method first paginates

results to

100 documents per page, then we take the results and group by authors.

This

results in different number of authors per page.  (Some authors may

only

have one matching document and others 5 or 10.)

How do we change it to somehow show the same number of authors (say

25) per

page?

I mean alternatively we could just show all the documents themselves

ordered

by author, but it's not the user experience we're looking for.

Thanks so much.  And please let me know if you need more details not
provided here.
B

--
View this message in context:

http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff
erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
68p3012168.html

Sent from the Solr - User mailing list archive at Nabble.com.

RE: Newbie question: how to deal with different # of search results per page due to pagination then grouping

2011-06-01 Thread Robert Petersen

Yes that is exactly the issue... we're thinking just maybe always have a
next button and if you go too far you just get zero results.  User gets
what the user asks for, and so user could simply back up if desired to
where the facet still has values.  Could also detect an empty facet
results on the front end.  You can also only expand one facet only to
allow paging only the facet pane and not the whole page using an ajax
call.



-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Wednesday, June 01, 2011 2:30 PM
To: solr-user@lucene.apache.org
Cc: Robert Petersen
Subject: Re: Newbie question: how to deal with different # of search
results per page due to pagination then grouping

How do you know whether to provide a 'next' button, or whether you are 
the end of your facet list?

On 6/1/2011 4:47 PM, Robert Petersen wrote:
 I think facet.offset allows facet paging nicely by letting you index
 into the list of facet values.  It is working for me...

 http://wiki.apache.org/solr/SimpleFacetParameters#facet.offset


 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Wednesday, June 01, 2011 12:41 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Newbie question: how to deal with different # of search
 results per page due to pagination then grouping

 There's no great way to do that.

 One approach would be using facets, but that will just get you the
 author names (as stored in fields), and not the documents under it. If
 you really only want to show the author names, facets could work. One
 issue with facets though is Solr won't tell you the total number of
 facet values for your query, so it's tricky to provide next/prev
paging
 through them.

 There is also a 'field collapsing' feature that I think is not in a
 released Solr, but may be in the Solr repo. I'm not sure it will quite
 do what you want either though, although it's related and worth a
look.
 http://wiki.apache.org/solr/FieldCollapsing

 Another vaguely related thing that is also not yet in a released Solr,
 is a 'join' function. That could possibly be used to do what you want,
 although it'd be tricky too.
 https://issues.apache.org/jira/browse/SOLR-2272

 Jonathan

 On 6/1/2011 2:56 PM, beccax wrote:
 Apologize if this question has already been raised.  I tried
searching
 but
 couldn't find the relevant posts.

 We've indexed a bunch of documents by different authors.  Then for
 search
 results, we'd like to show the authors that have 1 or more documents
 matching the search keywords.

 The problem is right now our solr search method first paginates
 results to
 100 documents per page, then we take the results and group by
authors.
 This
 results in different number of authors per page.  (Some authors may
 only
 have one matching document and others 5 or 10.)

 How do we change it to somehow show the same number of authors (say
 25) per
 page?

 I mean alternatively we could just show all the documents themselves
 ordered
 by author, but it's not the user experience we're looking for.

 Thanks so much.  And please let me know if you need more details not
 provided here.
 B

 --
 View this message in context:

http://lucene.472066.n3.nabble.com/Newbie-question-how-to-deal-with-diff

erent-of-search-results-per-page-due-to-pagination-then-grouping-tp30121
 68p3012168.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr memory consumption

Hey Denis,

* How big is your index in terms of number of documents and index size?
* Is it production system where you have many search requests?
* Is there any pattern for OOM errors? I.e. right after you start your
Solr app, after some search activity or specific Solr queries, etc?
* What are 1) cache settings 2) facets and sort-by fields 3) commit
frequency and warmup queries?
etc

Generally you might want to connect to your jvm using jconsole tool
and monitor your heap usage (and other JVM/Solr numbers)

* http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
* http://wiki.apache.org/solr/SolrJmx#Remote_Connection_to_Solr_JMX

HTH,
Alexey

2011/6/1 Denis Kuzmenok forward...@ukr.net:
 There  were  no  parameters  at  all,  and java hitted out of memory
 almost  every day, then i tried to add parameters but nothing changed.
 Xms/Xmx  -  did  not solve the problem too. Now i try the MaxPermSize,
 because it's the last thing i didn't try yet :(


 Wednesday, June 1, 2011, 9:00:56 PM, you wrote:

 Could be related to your crazy high MaxPermSize like Marcus said.

 I'm no JVM tuning expert either. Few people are, it's confusing. So if
 you don't understand it either, why are you trying to throw in very
 non-standard parameters you don't understand?  Just start with whatever
 the Solr example jetty has, and only change things if you have a reason
 to (that you understand).

 On 6/1/2011 1:19 PM, Denis Kuzmenok wrote:
 Overall  memory on server is 24G, and 24G of swap, mostly all the time
 swap  is  free and is not used at all, that's why no free swap sound
 strange to me..

Re: DIH render html entities

Maybe HTMLStripTransformer is what you are looking for.

* http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer

On Tue, May 31, 2011 at 5:35 PM, Erick Erickson erickerick...@gmail.com wrote:
 Convert them to what? Individual fields in your docs? Text?

 If the former, you might get some joy from the XpathEntityProcessor.
 If you want to just strip the markup and index all the content you
 might get some joy from the various *html* analyzers listed here:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 Best
 Erick

 On Fri, May 27, 2011 at 5:19 AM, anass talby anass.ta...@gmail.com wrote:
 Sorry my question was not clear.
 when I get data from database, some field contains some html special chars,
 and what i want to do is just convert them automatically.

 On Fri, May 27, 2011 at 1:00 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Fri, May 27, 2011 at 3:50 PM, anass talby anass.ta...@gmail.com
 wrote:
  Is there any way to render html entities in DIH for a specific field?
 [...]

 This does not make too much sense: What do you mean by
 rendering HTML entities. DIH just indexes, so where would
 it render HTML to, even if it could?

 Please take a look at http://wiki.apache.org/solr/UsingMailingLists

 Regards,
 Gora




 --
       Anass

Re: Better Spellcheck

 I've tried to use a spellcheck dictionary built from my own content, but my
 content ends up having a lot of misspelled words so the spellcheck ends up
 being less than effective.
You can try to use sp.dictionary.threshold parameter to solve this problem
* http://wiki.apache.org/solr/SpellCheckerRequestHandler#sp.dictionary.threshold

 It also misses phrases. When someone
 searches for Untied States I would hope the spellcheck would suggest
 United States but it just recognizes that untied is a valid word and
 doesn't suggest any thing.
So you are saying about auto suggest component and not spellcheck
right? These are two different use cases.

If you want auto suggest and you have some search logs for your system
then you can probably use the following solution:
* 
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

If you don't have significant search logs history and want to populate
your auto suggest dictionary from index or some text file you should
check
* http://wiki.apache.org/solr/Suggester

Re: Documents update