RE: Problem comitting on 40GB index

2010-01-13 Thread Frederico Azeiteiro
Sorry, my bad... I replied to a current mailing list message only changing the 
subject... Didn't know about this  Hijacking problem. Will not happen again.

Just for close this issue, if I understand correctly, for an index of 40G, I 
will need, for running an optimize:
- 40G if all activity on index is stopped
- 80G if index is being searched...)
- 120G if index is being searched and if a commit is performed.

Is this correct?

Thanks.
Frederico
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: terça-feira, 12 de Janeiro de 2010 19:18
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

Huh?

On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Subject: Problem comitting on 40GB index
 : In-Reply-To: 
 7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com

 http://people.apache.org/~hossman/#threadhijack
 Thread Hijacking on Mailing Lists

 When starting a new discussion on a mailing list, please do not reply to
 an existing message, instead start a fresh email.  Even if you change the
 subject line of your email, other mail headers still track which thread
 you replied to and your question is hidden in that thread and gets less
 attention.   It makes following discussions in the mailing list archives
 particularly difficult.
 See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



 -Hoss




Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

2010-01-13 Thread Shalin Shekhar Mangar
On Wed, Jan 13, 2010 at 7:48 AM, Lance Norskog goks...@gmail.com wrote:

 You can do this stripping in the DataImportHandler. You would have to
 write your own stripping code using regular expresssions.


Note that DIH has a HTMLStripTransformer which wraps Solr's HTMLStripReader.

-- 
Regards,
Shalin Shekhar Mangar.


Problem with 'sint' in More Like This feature

2010-01-13 Thread vi...@8kmiles.com

Hi,

I am using the More Like This feature. I have configured it in 
solrconfig.xml as a dedicated request handler and I am using SolrJ.

It's working properly when the similarity fields are all text data types.
But when I add a field whose datatype is 'sint', it's throwing an exception.

Exception - Caused by: org.apache.solr.common.SolrException: 
For_input_string_?javalangNumberFormatException_For_input_string_?


Any help / suggestion is much appreciated.

Thanks,
Vijay


Re: Queries of type field:value not functioning

2010-01-13 Thread Chantal Ackermann

try /solr/select?q.alt=*:*qt=dismax
or /solr/select?q=some search termqt=dismax

dismax should be configured in solrconfig.xml by default, but you have 
to adapt it to list the fields from your schema.xml


and for anything with known field:
/solr/select?q=field:valueqt=standard

Cheers,
Chantal



Siddhant Goel schrieb:

Hi all,

Any query I make which is of type field:value does not return any documents.
Same is the case for the *:* query. The *:* query doesn't return any result
either. The index size is close to 1GB now, so it should be returning some
documents. The rest of the queries are functioning properly. Any help?

Thanks,

--
- Siddhant




Restricting Facet to FilterQuery in combination with mincount

2010-01-13 Thread Chantal Ackermann

Hi all,

is it possible to restrict the returned facets to only those that apply 
to the filter query but still use mincount=0? Keeping those that have a 
count of 0 but apply to the filter, and at the same time leaving out 
those that are not covered by the filter (and thus 0, as well).



Some longer explanation of the question:


Example (don't nail me down on biology here, it's just for illustration):
q=type:mammalfacet.mincount=0facet.field=type

returns facets for all values stored in the field type. Results would 
look like:


mammal(2123)
bird(0)
dinosaur(0)
fish(0)
...

In this case setting facet.mincount=1 solves the problem. But consider:

q=area:waterfq=type:mammalfacet.field=namefacet.mincount=0

would return something like
dolphin (20)
blue whale (20)
salmon (0) = not covered by filter query
lion (0)
dog (0)
... (all sorts of animals, every possible value in field name)

My question is: how can I exclude those facets from the result that are 
not covered by the filter query. In this example: how can I exclude the 
non-mammals from the facets but keep all those mammals that are not 
matched by the actual query parameter?


Thanks!
Chantal


Re: Multi language support

2010-01-13 Thread Robert Muir
right, but we should not encourage users to significantly degrade
overall relevance for all movies due to a few movies and a band (very
special cases, as I said).

In english, by not using stopwords, it doesn't really degrade
relevance that much, so its a reasonable decision to make. This is not
true in other languages!

Instead, systems that worry about all-stopword queries should use
CommonGrams. it will work better for these cases, without taking away
from overall relevance.

On Wed, Jan 13, 2010 at 1:08 AM, Walter Underwood wun...@wunderwood.org wrote:
 There is a band named The The. And a producer named Don Was. For a list 
 of all-stopword movie titles at Netflix, see this post:

 http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

 My favorite is To Be and To Have (Être et Avoir), which is all stopwords in 
 two languages. And a very good movie.

 wunder

 On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

 sorry, i forgot to include this 2009 paper comparing what stopwords do
 across 3 languages:

 http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

 in my opinion, if stopwords annoy your users for very special cases
 like 'the the' then, instead consider using commongrams +
 defaultsimilarity.discountOverlaps = true so that you still get the
 benefits.

 as you can see from the above paper, they can be extremely important
 depending on the language, they just don't matter so much for English.

 On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote:
 There are a lot of projects that don't use stopwords any more. You
 might consider dropping them altogether.

 On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:
 This is the way I've implemented multilingual search as well.

 2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type can
 use an English stemmer, and list of stopwords and synonyms. We, however
 did not use specific stopwords, instead we used one list shared by both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
  analyzer type=
  filter class=solr.StopFilterFactory words=stopwords.en.txt
  filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma          Buyways B.V.
 Technisch Architect    Friesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600      KvK  01074105
 Tel. 050-853 6620      Fax. 050-3118124
 Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

 Hi Solr users.

 I'm trying to set up a site with Solr search integrated. And I use the
 SolJava API to feed the index with search documents. At the moment I
 have only activated search on the English portion of the site. I'm
 interested in using as many features of solr as possible. Synonyms,
 Stopwords and stems all sounds quite interesting and useful but how do
 I set up this in a good way for a multilingual site?

 The site don't have a huge text mass so performance issues don't
 really bother me but still I'd like to hear your suggestions before I
 try to implement an solution.

 Best regards

 Daniel





 --
 Lance Norskog
 goks...@gmail.com




 --
 Robert Muir
 rcm...@gmail.com






-- 
Robert Muir
rcm...@gmail.com


Re: Problem comitting on 40GB index

2010-01-13 Thread Erick Erickson
That's my understanding.. But fortunately disk space is cheap G


On Wed, Jan 13, 2010 at 5:01 AM, Frederico Azeiteiro 
frederico.azeite...@cision.com wrote:

 Sorry, my bad... I replied to a current mailing list message only changing
 the subject... Didn't know about this  Hijacking problem. Will not happen
 again.

 Just for close this issue, if I understand correctly, for an index of 40G,
 I will need, for running an optimize:
 - 40G if all activity on index is stopped
 - 80G if index is being searched...)
 - 120G if index is being searched and if a commit is performed.

 Is this correct?

 Thanks.
 Frederico
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: terça-feira, 12 de Janeiro de 2010 19:18
 To: solr-user@lucene.apache.org
 Subject: Re: Problem comitting on 40GB index

 Huh?

 On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:

 
  : Subject: Problem comitting on 40GB index
  : In-Reply-To: 
  7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com
 
  http://people.apache.org/~hossman/#threadhijack
  Thread Hijacking on Mailing Lists
 
  When starting a new discussion on a mailing list, please do not reply to
  an existing message, instead start a fresh email.  Even if you change the
  subject line of your email, other mail headers still track which thread
  you replied to and your question is hidden in that thread and gets less
  attention.   It makes following discussions in the mailing list archives
  particularly difficult.
  See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
 
 
 
  -Hoss
 
 



Re: DataImportHandler - synchronous execution

2010-01-13 Thread Alexey Serba
Hi,

I created Jira issue SOLR-1721 and attached simple patch ( no
documentation ) for this.

HIH,
Alex

2010/1/13 Noble Paul നോബിള്‍  नोब्ळ् noble.p...@corp.aol.com:
 it can be added

 On Tue, Jan 12, 2010 at 10:18 PM, Alexey Serba ase...@gmail.com wrote:
 Hi,

 I found that there's no explicit option to run DataImportHandler in a
 synchronous mode. I need that option to run DIH from SolrJ (
 EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
 to DIH as a workaround for this, but I think it makes sense to add
 specific option for that. Any objections?

 Alex




 --
 -
 Noble Paul | Systems Architect| AOL | http://aol.com



Re: Multi language support

2010-01-13 Thread Paul Libbrecht
Isn't the conclusion here that some stopword and stemming free  
matching should be the best match if ever and to then gently degrade  
to  weaker forms of matching?


paul


Le 13-janv.-10 à 07:08, Walter Underwood a écrit :

There is a band named The The. And a producer named Don Was. For  
a list of all-stopword movie titles at Netflix, see this post:


http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

My favorite is To Be and To Have (Être et Avoir), which is all  
stopwords in two languages. And a very good movie.


wunder

On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

sorry, i forgot to include this 2009 paper comparing what stopwords  
do

across 3 languages:

http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

in my opinion, if stopwords annoy your users for very special cases
like 'the the' then, instead consider using commongrams +
defaultsimilarity.discountOverlaps = true so that you still get the
benefits.

as you can see from the above paper, they can be extremely important
depending on the language, they just don't matter so much for  
English.


On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com  
wrote:

There are a lot of projects that don't use stopwords any more. You
might consider dropping them altogether.

On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com  
wrote:

This is the way I've implemented multilingual search as well.

2010/1/11 Markus Jelsma mar...@buyways.nl


Hello,


We have implemented language specific search in Solr using  
language
specific fields and field types. For instance, an en_text field  
type can
use an English stemmer, and list of stopwords and synonyms. We,  
however
did not use specific stopwords, instead we used one list shared  
by both

languages.

So you would have a field type like:
fieldType name=en_text class=solr.TextField ...
analyzer type=
filter class=solr.StopFilterFactory words=stopwords.en.txt
filter class=solr.SynonymFilterFactory  
synonyms=synoyms.en.txt


etc etc.



Cheers,

-
Markus Jelsma  Buyways B.V.
Technisch ArchitectFriesestraatweg 215c
http://www.buyways.nl  9743 AD Groningen


Alg. 050-853 6600  KvK  01074105
Tel. 050-853 6620  Fax. 050-3118124
Mob. 06-5025 8350  In: http://www.linkedin.com/in/markus17


On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:


Hi Solr users.

I'm trying to set up a site with Solr search integrated. And I  
use the
SolJava API to feed the index with search documents. At the  
moment I
have only activated search on the English portion of the site.  
I'm
interested in using as many features of solr as possible.  
Synonyms,
Stopwords and stems all sounds quite interesting and useful but  
how do

I set up this in a good way for a multilingual site?

The site don't have a huge text mass so performance issues don't
really bother me but still I'd like to hear your suggestions  
before I

try to implement an solution.

Best regards

Daniel








--
Lance Norskog
goks...@gmail.com





--
Robert Muir
rcm...@gmail.com







Problem indexing files

2010-01-13 Thread Thomas Stuettner

Hi all,

I'm trying to add multiple files to solr 1.4 with solrj.
With this programm 1 Doc is added to solr:

SolrServer server = SolrHelper.getServer();
server.deleteByQuery( *:* );// delete everything!
server.commit();
QueryResponse rsp = server.query( new SolrQuery( *:*) );
Assert.assertEquals( 0, rsp.getResults().getNumFound() );

		ContentStreamUpdateRequest up = new 
ContentStreamUpdateRequest(/update/extract);

up.addFile(new File(d:/temp/test.txt));
		//up.addFile(new File(d:/temp/test2.txt)); //-- Nothing added if 
removing the comment from this line.

up.setParam(literal.contid, doc1);
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

NamedListObject result = server.request(up);

UpdateResponse test = server.commit();

But no doc is added, if i remove the comment tag from the second addFile.
What's wrong with this?


Thanks,

Thomas


Boosting fields with localsolr

2010-01-13 Thread Kevin Thorley
I have tried several variations now, but have been unable to come up with a way 
to boost fields in a localsolr query.  What I need to do is do a localsolr 
search and sort the result set so that a specific value is at the top.  My idea 
was to use a nested dismax query with a boost field like this (with field names 
changed to protect the guilty):

qt=geo  lat=44.47  long=-73.15  radius=10  _query_:{!dismax qf=year 
bf=author:kevin^2}2010  sort=score desc

In plain english, find all posts in the given radius from the year 2010 with 
the posts by author 'kevin' appearing at the top of the result set.

This didn't work, as _query_ wasn't recognized by the localsolr handler.  I 
then tried the opposite, putting the localsolr query in a nested query, but the 
dismax handler didn't parse the nested query.

So, is there any way to accomplish what I am trying?  

Thanks,
Kevin

Re: Boosting fields with localsolr

2010-01-13 Thread Kevin Thorley
On Jan 13, 2010, at 10:44 AM, Kevin Thorley wrote:

 I have tried several variations now, but have been unable to come up with a 
 way to boost fields in a localsolr query.  What I need to do is do a 
 localsolr search and sort the result set so that a specific value is at the 
 top.  My idea was to use a nested dismax query with a boost field like this 
 (with field names changed to protect the guilty):
 
 qt=geo  lat=44.47  long=-73.15  radius=10  _query_:{!dismax qf=year 
 bf=author:kevin^2}2010  sort=score desc

Sorry if this caused any confusion... the bf param above should have been bq

 In plain english, find all posts in the given radius from the year 2010 with 
 the posts by author 'kevin' appearing at the top of the result set.
 
 This didn't work, as _query_ wasn't recognized by the localsolr handler.  I 
 then tried the opposite, putting the localsolr query in a nested query, but 
 the dismax handler didn't parse the nested query.
 
 So, is there any way to accomplish what I am trying?  
 
 Thanks,
 Kevin



RE: Need help Migrating to Solr

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
I don't have experience with migrating, but you should consider using the 
example schema.xml in the distro as a starting basis for creating your schema.  

-Original Message-
From: Abin Mathew [mailto:abin.mat...@toostep.com] 
Sent: Tuesday, January 12, 2010 8:42 PM
To: solr-user@lucene.apache.org
Subject: Need help Migrating to Solr

Hi

I am new to the solr technology. We have been using lucene for handling
searching in our web application www.toostep.com which is a knowledge
sharing platform developed in java using Spring MVC architecture and iBatis
as the persistance framework. Now that the application is getting very
complex we have decided to implement Solr technology over lucene.
Anyone having expertise in this area please give me some guidelines on where
to start off and how to form the schema for Solr.

Thanks and Regards
Abin Mathew


copyField with Analyzer?

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
Hi all,
I tried creating a case-insensitive string using the values provided to a 
string, via CopyField.  This didn't work, since copyField does it's job before 
the analyzer on the case-insensitive string field is invoked.

Is there another way I might accomplish this field replication on the server?



Tim Harsch
Sr. Software Engineer
Dell Perot Systems



RE: Problem comitting on 40GB index

2010-01-13 Thread Marc Des Garets
Just curious, have you checked if the hanging you are experiencing is not 
garbage collection related?

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 13 January 2010 13:33
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

That's my understanding.. But fortunately disk space is cheap G


On Wed, Jan 13, 2010 at 5:01 AM, Frederico Azeiteiro 
frederico.azeite...@cision.com wrote:

 Sorry, my bad... I replied to a current mailing list message only changing
 the subject... Didn't know about this  Hijacking problem. Will not happen
 again.

 Just for close this issue, if I understand correctly, for an index of 40G,
 I will need, for running an optimize:
 - 40G if all activity on index is stopped
 - 80G if index is being searched...)
 - 120G if index is being searched and if a commit is performed.

 Is this correct?

 Thanks.
 Frederico
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: terça-feira, 12 de Janeiro de 2010 19:18
 To: solr-user@lucene.apache.org
 Subject: Re: Problem comitting on 40GB index

 Huh?

 On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:

 
  : Subject: Problem comitting on 40GB index
  : In-Reply-To: 
  7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com
 
  http://people.apache.org/~hossman/#threadhijack
  Thread Hijacking on Mailing Lists
 
  When starting a new discussion on a mailing list, please do not reply to
  an existing message, instead start a fresh email.  Even if you change the
  subject line of your email, other mail headers still track which thread
  you replied to and your question is hidden in that thread and gets less
  attention.   It makes following discussions in the mailing list archives
  particularly difficult.
  See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
 
 
 
  -Hoss
 
 

--
This transmission is strictly confidential, possibly legally privileged, and 
intended solely for the 
addressee.  Any views or opinions expressed within it are those of the author 
and do not necessarily 
represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary 
companies.  If you 
are not the intended recipient then you must not disclose, copy or take any 
action in reliance of this 
transmission. If you have received this transmission in error, please notify 
the sender as soon as 
possible.  No employee or agent is authorised to conclude any binding agreement 
on behalf of 
i-CD Publishing (UK) Ltd with another party by email without express written 
confirmation by an 
authorised employee of the Company. http://www.192.com (Tel: 08000 192 192).  
i-CD Publishing (UK) Ltd 
is incorporated in England and Wales, company number 3148549, VAT No. GB 
673128728.

Re: Question

2010-01-13 Thread Bill Bell
On Wed, Jan 13, 2010 at 10:17 AM, Bill Bell bb...@kaango.com wrote:
 I am using Solr 1.4, and have 3 cores defined in solr.xml. Question on
 replication

 1. How do I set up rsync replication from master to slaves? It was
 easy to do with just one core and one script.conf, but with multiple
 cores what is the easiest way?

 2. I got the system to work by changing the snappuller to pass in a -c
 script.conf but there has got to be an easier way?

 3. On the master I have 3 rsync daemons running. Is it possible to do
 it with one?

 The script.conf really needs multiple data_dir settings...

 --
 Bill Bell
 Vice President of Technology
 bb...@kaango.com
 mobile 720.256.8076
 Kaango, LLC - www.kaango.com




-- 
Bill Bell
Vice President of Technology
bb...@kaango.com
mobile 720.256.8076
Kaango, LLC - www.kaango.com


Re: How to display Highlight with VelocityResponseWriter?

2010-01-13 Thread qiuyan . xu

Thanks a lot. It works now. When i added the line
#set($hl = $response.highlighting)
i got the highlighting. But i wonder if there's any document that  
describes the usage of that. I mean i didn't know the name of those  
methods. Actually i just managed to guess it.


best regards,
Qiuyan

Quoting Sascha Szott sz...@zib.de:


Qiuyan,


with highlight can also be displayed in the web gui. I've added bool
name=hltrue/bool into the standard responseHandler and it already
works, i.e without velocity. But the same line doesn't take effect in
itas. Should i configure anything else? Thanks in advance.
First of all, just a few notes on the /itas request handler in your  
solrconfig.xml:


1. The entry

arr name=components
  strhighlight/str
/arr

is obsolete, since the highlighting component is a default search  
component [1].


2. Note that since you didn't specify a value for hl.fl highlighting  
will only affect the fields listed inside of qf.


3. Why did you override the default value of hl.fragmenter? In most  
cases the default fragmenting algorithm (gap) works fine - and maybe  
in yours as well?



To make sure all your hl related settings are correct, can you post  
an xml output (change the wt parameter to xml) for a search with  
highlighted results.


And finally, can you post the vtl code snippet that should produce  
the highlighted output.


-Sascha

[1] http://wiki.apache.org/solr/SearchComponent













RE: Problem comitting on 40GB index

2010-01-13 Thread Frederico Azeiteiro
The hanging didn't happen again since yesterday. I never run out of space 
again. This is still a dev environment, so the number of searches is very low. 
Maybe I'm just lucky...

Where can I see the garbage collection info?

-Original Message- 
From: Marc Des Garets [mailto:marc.desgar...@192.com] 
Sent: quarta-feira, 13 de Janeiro de 2010 17:20
To: solr-user@lucene.apache.org
Subject: RE: Problem comitting on 40GB index

Just curious, have you checked if the hanging you are experiencing is not 
garbage collection related?

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 13 January 2010 13:33
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

That's my understanding.. But fortunately disk space is cheap G


On Wed, Jan 13, 2010 at 5:01 AM, Frederico Azeiteiro 
frederico.azeite...@cision.com wrote:

 Sorry, my bad... I replied to a current mailing list message only changing
 the subject... Didn't know about this  Hijacking problem. Will not happen
 again.

 Just for close this issue, if I understand correctly, for an index of 40G,
 I will need, for running an optimize:
 - 40G if all activity on index is stopped
 - 80G if index is being searched...)
 - 120G if index is being searched and if a commit is performed.

 Is this correct?

 Thanks.
 Frederico
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: terça-feira, 12 de Janeiro de 2010 19:18
 To: solr-user@lucene.apache.org
 Subject: Re: Problem comitting on 40GB index

 Huh?

 On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:

 
  : Subject: Problem comitting on 40GB index
  : In-Reply-To: 
  7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com
 
  http://people.apache.org/~hossman/#threadhijack
  Thread Hijacking on Mailing Lists
 
  When starting a new discussion on a mailing list, please do not reply to
  an existing message, instead start a fresh email.  Even if you change the
  subject line of your email, other mail headers still track which thread
  you replied to and your question is hidden in that thread and gets less
  attention.   It makes following discussions in the mailing list archives
  particularly difficult.
  See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
 
 
 
  -Hoss
 
 

--
This transmission is strictly confidential, possibly legally privileged, and 
intended solely for the 
addressee.  Any views or opinions expressed within it are those of the author 
and do not necessarily 
represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary 
companies.  If you 
are not the intended recipient then you must not disclose, copy or take any 
action in reliance of this 
transmission. If you have received this transmission in error, please notify 
the sender as soon as 
possible.  No employee or agent is authorised to conclude any binding agreement 
on behalf of 
i-CD Publishing (UK) Ltd with another party by email without express written 
confirmation by an 
authorised employee of the Company. http://www.192.com (Tel: 08000 192 192).  
i-CD Publishing (UK) Ltd 
is incorporated in England and Wales, company number 3148549, VAT No. GB 
673128728.


Interesting OutOfMemoryError on a 170M index

2010-01-13 Thread Minutello, Nick

Hi,

I have a bit of an interesting OutOfMemoryError that I'm trying to
figure out.

My client  Solr server are running in the same JVM (for deployment
simplicity). FWIW, I'm using Jetty to host Solr. I'm using the supplied
code for the http-based client interface. Solr 1.3.0.

My app is adding about 20,000 documents per minute to the index - one at
a time (it is listening to an event stream and for every event, it adds
a new document to the index).
The size of the documents, however, is tiny - the total index growth is
only about 170M (after about 1 hr and the OutOfMemoryError)
At this point, there is zero querying happening - just updates to the
index (only adding documents, no updates or deletes)
After about an hour or so, my JVM runs out of heap space - and if I look
at the memory utilisation over time, it looks like a classic memory
leak. It slowly ramps up until we end up with constant FULL GC's and
eventual OOME. Max heap space is 512M.

In Solr, I'm using autocommit (to buffer the updates)
autoCommit
  maxDocs1/maxDocs
  maxTime1000/maxTime
/autoCommit

(Aside: Now, I'm not sure if I am meant to call commit or not on the
client SolrServer class if I am using autocommit - but as it turns out,
I get OOME whether I do that or not)

Any suggestions/advice of quick things to check before I dust off the
profiler?

Thanks in advance.

Cheers,
Nick

=== 
 Please access the attached hyperlink for an important electronic 
communications disclaimer: 
 http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
 
=== 
 


case-insensitive string type

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
Hi I have a field:

field name=srcANYSTRStrCI type=string_ci indexed=true stored=true 
multiValued=true /

With type definition:
!-- A Case insensitive version of string type  --
fieldType name=string_ci class=solr.StrField
sortMissingLast=true omitNorms=true
analyzer type=index
tokenizer 
class=solr.KeywordTokenizerFactory/   
filter class=solr.LowerCaseFilterFactory /
/analyzer
analyzer type=query
tokenizer 
class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
/analyzer
/fieldType

When searching that field I can't get a case-insensitive match.  It works as if 
it is a regular string, for instance I can do a prefix query and so long as the 
prefix matches the case of the value it works, but if I change the prefix case 
it doesn't

Essentially I am trying to get case-insensitive matching that supports wild 
cards...

Tim Harsch
Sr. Software Engineer
Dell Perot Systems
(650) 604-0374



Re: case-insensitive string type

2010-01-13 Thread Rob Casson
from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

On wildcard and fuzzy searches, no text analysis is performed on
the search word.

i'd just lowercase the wildcard-ed search term in your client code,
before you send it to solr.

hth,
rob

On Wed, Jan 13, 2010 at 2:18 PM, Harsch, Timothy J. (ARC-TI)[PEROT
SYSTEMS] timothy.j.har...@nasa.gov wrote:
 Hi I have a field:

 field name=srcANYSTRStrCI type=string_ci indexed=true stored=true 
 multiValued=true /

 With type definition:
                !-- A Case insensitive version of string type  --
                fieldType name=string_ci class=solr.StrField
                        sortMissingLast=true omitNorms=true
                        analyzer type=index
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter class=solr.LowerCaseFilterFactory /
                        /analyzer
                        analyzer type=query
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter class=solr.LowerCaseFilterFactory /
                        /analyzer
                /fieldType

 When searching that field I can't get a case-insensitive match.  It works as 
 if it is a regular string, for instance I can do a prefix query and so long 
 as the prefix matches the case of the value it works, but if I change the 
 prefix case it doesn't

 Essentially I am trying to get case-insensitive matching that supports wild 
 cards...

 Tim Harsch
 Sr. Software Engineer
 Dell Perot Systems
 (650) 604-0374




Re: case-insensitive string type

2010-01-13 Thread Erick Erickson
What do you get when you add debugQuery=on to your lower-case query?

And does Luke show you what you expect in the index?


On Wed, Jan 13, 2010 at 2:18 PM, Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
timothy.j.har...@nasa.gov wrote:

 Hi I have a field:

 field name=srcANYSTRStrCI type=string_ci indexed=true stored=true
 multiValued=true /

 With type definition:
!-- A Case insensitive version of string type  --
fieldType name=string_ci class=solr.StrField
sortMissingLast=true omitNorms=true
analyzer type=index
tokenizer
 class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory
 /
/analyzer
analyzer type=query
tokenizer
 class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory
 /
/analyzer
/fieldType

 When searching that field I can't get a case-insensitive match.  It works
 as if it is a regular string, for instance I can do a prefix query and so
 long as the prefix matches the case of the value it works, but if I change
 the prefix case it doesn't

 Essentially I am trying to get case-insensitive matching that supports wild
 cards...

 Tim Harsch
 Sr. Software Engineer
 Dell Perot Systems
 (650) 604-0374




RE: case-insensitive string type

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
I considered that, but I'm also having the issue that I can't get an exact 
match as case insensitive either.

-Original Message-
From: Rob Casson [mailto:rob.cas...@gmail.com] 
Sent: Wednesday, January 13, 2010 11:26 AM
To: solr-user@lucene.apache.org
Subject: Re: case-insensitive string type

from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

On wildcard and fuzzy searches, no text analysis is performed on
the search word.

i'd just lowercase the wildcard-ed search term in your client code,
before you send it to solr.

hth,
rob

On Wed, Jan 13, 2010 at 2:18 PM, Harsch, Timothy J. (ARC-TI)[PEROT
SYSTEMS] timothy.j.har...@nasa.gov wrote:
 Hi I have a field:

 field name=srcANYSTRStrCI type=string_ci indexed=true stored=true 
 multiValued=true /

 With type definition:
                !-- A Case insensitive version of string type  --
                fieldType name=string_ci class=solr.StrField
                        sortMissingLast=true omitNorms=true
                        analyzer type=index
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter class=solr.LowerCaseFilterFactory /
                        /analyzer
                        analyzer type=query
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter class=solr.LowerCaseFilterFactory /
                        /analyzer
                /fieldType

 When searching that field I can't get a case-insensitive match.  It works as 
 if it is a regular string, for instance I can do a prefix query and so long 
 as the prefix matches the case of the value it works, but if I change the 
 prefix case it doesn't

 Essentially I am trying to get case-insensitive matching that supports wild 
 cards...

 Tim Harsch
 Sr. Software Engineer
 Dell Perot Systems
 (650) 604-0374




RE: case-insensitive string type

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
From the query
http://localhost:8080/solr/select?q=idxPartition%3ASOMEPART%20AND%20srcANYSTRStrCI:%22mixcase%20or%20lower%22debugQuery=on

Debug info attached


-Original Message-
From: Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
[mailto:timothy.j.har...@nasa.gov] 
Sent: Wednesday, January 13, 2010 11:28 AM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

I considered that, but I'm also having the issue that I can't get an exact 
match as case insensitive either.

-Original Message-
From: Rob Casson [mailto:rob.cas...@gmail.com] 
Sent: Wednesday, January 13, 2010 11:26 AM
To: solr-user@lucene.apache.org
Subject: Re: case-insensitive string type

from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

On wildcard and fuzzy searches, no text analysis is performed on
the search word.

i'd just lowercase the wildcard-ed search term in your client code,
before you send it to solr.

hth,
rob

On Wed, Jan 13, 2010 at 2:18 PM, Harsch, Timothy J. (ARC-TI)[PEROT
SYSTEMS] timothy.j.har...@nasa.gov wrote:
 Hi I have a field:

 field name=srcANYSTRStrCI type=string_ci indexed=true stored=true 
 multiValued=true /

 With type definition:
                !-- A Case insensitive version of string type  --
                fieldType name=string_ci class=solr.StrField
                        sortMissingLast=true omitNorms=true
                        analyzer type=index
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter class=solr.LowerCaseFilterFactory /
                        /analyzer
                        analyzer type=query
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter class=solr.LowerCaseFilterFactory /
                        /analyzer
                /fieldType

 When searching that field I can't get a case-insensitive match.  It works as 
 if it is a regular string, for instance I can do a prefix query and so long 
 as the prefix matches the case of the value it works, but if I change the 
 prefix case it doesn't

 Essentially I am trying to get case-insensitive matching that supports wild 
 cards...

 Tim Harsch
 Sr. Software Engineer
 Dell Perot Systems
 (650) 604-0374


?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint name=QTime62/intlst name=paramsstr name=debugQueryon/strstr name=qidxPartition:SOMEPART AND srcANYSTRStrCI:mixcase or lower/str/lst/lstresult name=response numFound=0 start=0/lst name=debugstr name=rawquerystringidxPartition:SOMEPART AND srcANYSTRStrCI:mixcase or lower/strstr name=querystringidxPartition:SOMEPART AND srcANYSTRStrCI:mixcase or lower/strstr name=parsedquery+idxPartition:SOMEPART +srcANYSTRStrCI:mixcase or lower/strstr name=parsedquery_toString+idxPartition:SOMEPART +srcANYSTRStrCI:mixcase or lower/strlst name=explain/str name=QParserLuceneQParser/strlst name=timingdouble name=time31.0/doublelst name=preparedouble name=time31.0/doublelst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.StatsComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.DebugComponentdouble name=time0.0/double/lst/lstlst name=processdouble name=time0.0/doublelst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.StatsComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.DebugComponentdouble name=time0.0/double/lst/lst/lst/lst
/response

RE: case-insensitive string type

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
The value in the srcANYSTRStrCI field is miXCAse or LowER according to Luke.

-Original Message-
From: Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
[mailto:timothy.j.har...@nasa.gov] 
Sent: Wednesday, January 13, 2010 11:31 AM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

From the query
http://localhost:8080/solr/select?q=idxPartition%3ASOMEPART%20AND%20srcANYSTRStrCI:%22mixcase%20or%20lower%22debugQuery=on

Debug info attached


-Original Message-
From: Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
[mailto:timothy.j.har...@nasa.gov]
Sent: Wednesday, January 13, 2010 11:28 AM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

I considered that, but I'm also having the issue that I can't get an exact 
match as case insensitive either.

-Original Message-
From: Rob Casson [mailto:rob.cas...@gmail.com]
Sent: Wednesday, January 13, 2010 11:26 AM
To: solr-user@lucene.apache.org
Subject: Re: case-insensitive string type

from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

On wildcard and fuzzy searches, no text analysis is performed on the 
search word.

i'd just lowercase the wildcard-ed search term in your client code, before you 
send it to solr.

hth,
rob

On Wed, Jan 13, 2010 at 2:18 PM, Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
timothy.j.har...@nasa.gov wrote:
 Hi I have a field:

 field name=srcANYSTRStrCI type=string_ci indexed=true 
 stored=true multiValued=true /

 With type definition:
                !-- A Case insensitive version of string type  --
                fieldType name=string_ci class=solr.StrField
                        sortMissingLast=true omitNorms=true
                        analyzer type=index
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter 
 class=solr.LowerCaseFilterFactory /
                        /analyzer
                        analyzer type=query
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter 
 class=solr.LowerCaseFilterFactory /
                        /analyzer
                /fieldType

 When searching that field I can't get a case-insensitive match.  It 
 works as if it is a regular string, for instance I can do a prefix 
 query and so long as the prefix matches the case of the value it 
 works, but if I change the prefix case it doesn't

 Essentially I am trying to get case-insensitive matching that supports wild 
 cards...

 Tim Harsch
 Sr. Software Engineer
 Dell Perot Systems
 (650) 604-0374




RE: case-insensitive string type

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
I created a document that has a string field and a case insensitive string 
field using my string_ci type, both have the same value sent at document 
creation time: miXCAse or LowER.

I attach two debug query results.  One against the string type and one against 
mine.  The query is only different by changing the query field.

Against the string there are results. Against mine there are none.  Looking at 
the debug info, querying my type does lower case the query value it seems.  
Does this mean the analyzer to the index is failing?  Would the fact that Luke 
shows the value as case preserved in both the string field and the string_ci 
field support this?

-Original Message-
From: Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
[mailto:timothy.j.har...@nasa.gov] 
Sent: Wednesday, January 13, 2010 11:35 AM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

The value in the srcANYSTRStrCI field is miXCAse or LowER according to Luke.

-Original Message-
From: Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
[mailto:timothy.j.har...@nasa.gov] 
Sent: Wednesday, January 13, 2010 11:31 AM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

From the query
http://localhost:8080/solr/select?q=idxPartition%3ASOMEPART%20AND%20srcANYSTRStrCI:%22mixcase%20or%20lower%22debugQuery=on

Debug info attached


-Original Message-
From: Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
[mailto:timothy.j.har...@nasa.gov]
Sent: Wednesday, January 13, 2010 11:28 AM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

I considered that, but I'm also having the issue that I can't get an exact 
match as case insensitive either.

-Original Message-
From: Rob Casson [mailto:rob.cas...@gmail.com]
Sent: Wednesday, January 13, 2010 11:26 AM
To: solr-user@lucene.apache.org
Subject: Re: case-insensitive string type

from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

On wildcard and fuzzy searches, no text analysis is performed on the 
search word.

i'd just lowercase the wildcard-ed search term in your client code, before you 
send it to solr.

hth,
rob

On Wed, Jan 13, 2010 at 2:18 PM, Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS] 
timothy.j.har...@nasa.gov wrote:
 Hi I have a field:

 field name=srcANYSTRStrCI type=string_ci indexed=true 
 stored=true multiValued=true /

 With type definition:
                !-- A Case insensitive version of string type  --
                fieldType name=string_ci class=solr.StrField
                        sortMissingLast=true omitNorms=true
                        analyzer type=index
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter 
 class=solr.LowerCaseFilterFactory /
                        /analyzer
                        analyzer type=query
                                tokenizer 
 class=solr.KeywordTokenizerFactory/
                                filter 
 class=solr.LowerCaseFilterFactory /
                        /analyzer
                /fieldType

 When searching that field I can't get a case-insensitive match.  It 
 works as if it is a regular string, for instance I can do a prefix 
 query and so long as the prefix matches the case of the value it 
 works, but if I change the prefix case it doesn't

 Essentially I am trying to get case-insensitive matching that supports wild 
 cards...

 Tim Harsch
 Sr. Software Engineer
 Dell Perot Systems
 (650) 604-0374


?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint name=QTime32/intlst name=paramsstr name=debugQueryon/strstr name=qsrcANYSTRStr:(miXCAse or LowER)/str/lst/lstresult name=response numFound=1 start=0docstr name=idxKeyTimsUniqueKey/strarr name=srcANYSTRStrstrmiXCAse or LowER/str/arrarr name=srcANYSTRStrCIstrmiXCAse or LowER/str/arr/doc/resultlst name=debugstr name=rawquerystringsrcANYSTRStr:(miXCAse or LowER)/strstr name=querystringsrcANYSTRStr:(miXCAse or LowER)/strstr name=parsedquerysrcANYSTRStr:miXCAse or LowER/strstr name=parsedquery_toStringsrcANYSTRStr:miXCAse or LowER/strlst name=explainstr name=TimsUniqueKey
9.250228 = (MATCH) fieldWeight(srcANYSTRStr:miXCAse or LowER in 0), product of:
  1.0 = tf(termFreq(srcANYSTRStr:miXCAse or LowER)=1)
  9.250228 = idf(docFreq=1, maxDocs=7657)
  1.0 = fieldNorm(field=srcANYSTRStr, doc=0)
/str/lststr name=QParserLuceneQParser/strlst name=timingdouble name=time16.0/doublelst name=preparedouble name=time0.0/doublelst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.StatsComponentdouble name=time0.0/double/lstlst 

RE: case-insensitive string type

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
That seems to work.

But why?  Does string type not support LowerCaseFilterFactory?  Or 
KeywordTokenizerFactory?

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, January 13, 2010 11:51 AM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

 The value in the srcANYSTRStrCI field
 is miXCAse or LowER according to Luke.

Can you try this fieldType (that uses class=solr.TextField) declaration and 
re-start tomcat  re-index:

 fieldType name=string_ci class=solr.TextField sortMissingLast=true 
omitNorms=true
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/   
filter class=solr.LowerCaseFilterFactory /   
filter class=solr.TrimFilterFactory /
 /analyzer
/fieldType








  


RE: case-insensitive string type

2010-01-13 Thread Harsch, Timothy J. (ARC-TI)[PEROT SYSTEMS]
Thanks, I know I read that sometime back but I guess I thought that was because 
there were no analyzer tags defined on the string field in the schema.  I 
guess cause I'm still kind of a noob - I didn't take that to mean that it 
couldn't be made to have analyzers.  A subtle but important distinction I guess.

So my concern now is that my use case is that I need a field that behaves like 
string, case-sensitive, and a case-insensitive version of the same.  Is it the 
case the solr.StrField and solr.textField with LowerCaseFilterFactory and 
KeywordTokenizerFactory only differ by their treatment of character case?

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, January 13, 2010 12:18 PM
To: solr-user@lucene.apache.org
Subject: RE: case-insensitive string type

 That seems to work.
 
 But why?  Does string type not support
 LowerCaseFilterFactory?  Or KeywordTokenizerFactory?

From from apache-solr-1.4.0\example\solr\conf\schema.xml :

The StrField type is not analyzed, but indexed/stored verbatim. 

solr.TextField allows the specification of custom text analyzers specified as 
a tokenizer and a list of token filters.




  


Re: Multi language support

2010-01-13 Thread Lance Norskog
Robert Muir: Thank you for the pointer to that paper!

On Wed, Jan 13, 2010 at 6:29 AM, Paul Libbrecht p...@activemath.org wrote:
 Isn't the conclusion here that some stopword and stemming free matching
 should be the best match if ever and to then gently degrade to  weaker forms
 of matching?

 paul


 Le 13-janv.-10 à 07:08, Walter Underwood a écrit :

 There is a band named The The. And a producer named Don Was. For a
 list of all-stopword movie titles at Netflix, see this post:

 http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

 My favorite is To Be and To Have (Être et Avoir), which is all stopwords
 in two languages. And a very good movie.

 wunder

 On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

 sorry, i forgot to include this 2009 paper comparing what stopwords do
 across 3 languages:


 http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

 in my opinion, if stopwords annoy your users for very special cases
 like 'the the' then, instead consider using commongrams +
 defaultsimilarity.discountOverlaps = true so that you still get the
 benefits.

 as you can see from the above paper, they can be extremely important
 depending on the language, they just don't matter so much for English.

 On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog goks...@gmail.com wrote:

 There are a lot of projects that don't use stopwords any more. You
 might consider dropping them altogether.

 On Mon, Jan 11, 2010 at 2:25 PM, Don Werve d...@madwombat.com wrote:

 This is the way I've implemented multilingual search as well.

 2010/1/11 Markus Jelsma mar...@buyways.nl

 Hello,


 We have implemented language specific search in Solr using language
 specific fields and field types. For instance, an en_text field type
 can
 use an English stemmer, and list of stopwords and synonyms. We,
 however
 did not use specific stopwords, instead we used one list shared by
 both
 languages.

 So you would have a field type like:
 fieldType name=en_text class=solr.TextField ...
 analyzer type=
 filter class=solr.StopFilterFactory words=stopwords.en.txt
 filter class=solr.SynonymFilterFactory synonyms=synoyms.en.txt

 etc etc.



 Cheers,

 -
 Markus Jelsma          Buyways B.V.
 Technisch Architect    Friesestraatweg 215c
 http://www.buyways.nl  9743 AD Groningen


 Alg. 050-853 6600      KvK  01074105
 Tel. 050-853 6620      Fax. 050-3118124
 Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


 On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

 Hi Solr users.

 I'm trying to set up a site with Solr search integrated. And I use
 the
 SolJava API to feed the index with search documents. At the moment I
 have only activated search on the English portion of the site. I'm
 interested in using as many features of solr as possible. Synonyms,
 Stopwords and stems all sounds quite interesting and useful but how
 do
 I set up this in a good way for a multilingual site?

 The site don't have a huge text mass so performance issues don't
 really bother me but still I'd like to hear your suggestions before I
 try to implement an solution.

 Best regards

 Daniel





 --
 Lance Norskog
 goks...@gmail.com




 --
 Robert Muir
 rcm...@gmail.com







-- 
Lance Norskog
goks...@gmail.com


Re: copyField with Analyzer?

2010-01-13 Thread Lance Norskog
You can do this filtering in the DataImportHandler. The regular
expression tool is probably enough:

http://wiki.apache.org/solr/DataImportHandler#RegexTransformer

On Wed, Jan 13, 2010 at 8:57 AM, Harsch, Timothy J. (ARC-TI)[PEROT
SYSTEMS] timothy.j.har...@nasa.gov wrote:
 Hi all,
 I tried creating a case-insensitive string using the values provided to a 
 string, via CopyField.  This didn't work, since copyField does it's job 
 before the analyzer on the case-insensitive string field is invoked.

 Is there another way I might accomplish this field replication on the server?



 Tim Harsch
 Sr. Software Engineer
 Dell Perot Systems





-- 
Lance Norskog
goks...@gmail.com


Re: Interesting OutOfMemoryError on a 170M index

2010-01-13 Thread Lance Norskog
The time in autocommit is in milliseconds. You are committing every
second while indexing.  This then causes a build-up of sucessive index
readers that absorb each commit, which is probably the out-of-memory.

On Wed, Jan 13, 2010 at 10:36 AM, Minutello, Nick
nick.minute...@credit-suisse.com wrote:

 Hi,

 I have a bit of an interesting OutOfMemoryError that I'm trying to
 figure out.

 My client  Solr server are running in the same JVM (for deployment
 simplicity). FWIW, I'm using Jetty to host Solr. I'm using the supplied
 code for the http-based client interface. Solr 1.3.0.

 My app is adding about 20,000 documents per minute to the index - one at
 a time (it is listening to an event stream and for every event, it adds
 a new document to the index).
 The size of the documents, however, is tiny - the total index growth is
 only about 170M (after about 1 hr and the OutOfMemoryError)
 At this point, there is zero querying happening - just updates to the
 index (only adding documents, no updates or deletes)
 After about an hour or so, my JVM runs out of heap space - and if I look
 at the memory utilisation over time, it looks like a classic memory
 leak. It slowly ramps up until we end up with constant FULL GC's and
 eventual OOME. Max heap space is 512M.

 In Solr, I'm using autocommit (to buffer the updates)
        autoCommit
          maxDocs1/maxDocs
          maxTime1000/maxTime
        /autoCommit

 (Aside: Now, I'm not sure if I am meant to call commit or not on the
 client SolrServer class if I am using autocommit - but as it turns out,
 I get OOME whether I do that or not)

 Any suggestions/advice of quick things to check before I dust off the
 profiler?

 Thanks in advance.

 Cheers,
 Nick

 ===
  Please access the attached hyperlink for an important electronic 
 communications disclaimer:
  http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
  ===





-- 
Lance Norskog
goks...@gmail.com


RE: Interesting OutOfMemoryError on a 170M index

2010-01-13 Thread Minutello, Nick
Agreed, commit every second.

Assuming I understand what you're saying correctly:
There shouldn't be any index readers - as at this point, just writing to the 
index.
Did I understand correctly what you meant?

-Nick

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: 13 January 2010 22:28
To: solr-user@lucene.apache.org
Subject: Re: Interesting OutOfMemoryError on a 170M index

The time in autocommit is in milliseconds. You are committing every second 
while indexing.  This then causes a build-up of sucessive index readers that 
absorb each commit, which is probably the out-of-memory.

On Wed, Jan 13, 2010 at 10:36 AM, Minutello, Nick 
nick.minute...@credit-suisse.com wrote:

 Hi,

 I have a bit of an interesting OutOfMemoryError that I'm trying to 
 figure out.

 My client  Solr server are running in the same JVM (for deployment 
 simplicity). FWIW, I'm using Jetty to host Solr. I'm using the 
 supplied code for the http-based client interface. Solr 1.3.0.

 My app is adding about 20,000 documents per minute to the index - one 
 at a time (it is listening to an event stream and for every event, it 
 adds a new document to the index).
 The size of the documents, however, is tiny - the total index growth 
 is only about 170M (after about 1 hr and the OutOfMemoryError) At this 
 point, there is zero querying happening - just updates to the index 
 (only adding documents, no updates or deletes) After about an hour or 
 so, my JVM runs out of heap space - and if I look at the memory 
 utilisation over time, it looks like a classic memory leak. It slowly 
 ramps up until we end up with constant FULL GC's and eventual OOME. 
 Max heap space is 512M.

 In Solr, I'm using autocommit (to buffer the updates)
        autoCommit
          maxDocs1/maxDocs
          maxTime1000/maxTime
        /autoCommit

 (Aside: Now, I'm not sure if I am meant to call commit or not on the 
 client SolrServer class if I am using autocommit - but as it turns 
 out, I get OOME whether I do that or not)

 Any suggestions/advice of quick things to check before I dust off the 
 profiler?

 Thanks in advance.

 Cheers,
 Nick

 ==
 =
  Please access the attached hyperlink for an important electronic 
 communications disclaimer:
  http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
  
 ==
 =





--
Lance Norskog
goks...@gmail.com

=== 
 Please access the attached hyperlink for an important electronic 
communications disclaimer: 
 http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
 
=== 
 


Need deployment strategy

2010-01-13 Thread Paul Rosen

Hi all,

The way the indexing works on our system is as follows:

We have a separate staging server with a copy of our web app. The 
clients will index a number of documents in a batch on the staging 
server (this happens about once a week), then they play with the results 
on the staging server for a day until satisfied. Only then do they give 
the ok to deploy.


What I've been doing is, when they want to deploy, I do the following:

1) merge and optimize the index on the staging server,

2) copy it to the production server,

3) stop solr on production,

4) copy the new index on top of the old one,

5) start solr on production.

This works, but has the following disadvantages:

1) The index is getting bigger, so it takes longer to zip it and 
transfer it.


2) The user is only added a few records, yet we copy over all of them. 
If a bug happens that causes an unrelated document to get deleted or 
replaced on staging, we wouldn't notice, and we'd propagate the problem 
to the server. I'd sleep better if I were only moving the records that 
were new or changed and leaving the records that already work in place.


3) solr is down on production for about 5 minutes, so users during that 
time are getting errors.


I was looking for some kind of replication strategy where I can run a 
task on the production server to tell it to merge a core from the 
staging server. Is that possible?


I can open up port 8983 on the staging server only to the production 
server, but then what do I do on production to get the core?


Thanks,
Paul


RE: Problem comitting on 40GB index

2010-01-13 Thread Sven Maurmann

Hi!

Garbage collection is an issue of the underlying JVM. You may use
–XX:+PrintGCDetails as an argument to your JVM in order to collect
details of the garbage collection. If you also use the parameter
–XX:+PrintGCTimeStamps you get the time stamps of the garbage
collection.

For further information you may want to refer to the paper

http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf

which points you to a few other utilities related to GC.

Best,

Sven Maurmann

--On Mittwoch, 13. Januar 2010 18:03 + Frederico Azeiteiro 
frederico.azeite...@cision.com wrote:



The hanging didn't happen again since yesterday. I never run out of space
again. This is still a dev environment, so the number of searches is very
low. Maybe I'm just lucky...

Where can I see the garbage collection info?

-Original Message-
From: Marc Des Garets [mailto:marc.desgar...@192.com]
Sent: quarta-feira, 13 de Janeiro de 2010 17:20
To: solr-user@lucene.apache.org
Subject: RE: Problem comitting on 40GB index

Just curious, have you checked if the hanging you are experiencing is not
garbage collection related?

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 13 January 2010 13:33
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

That's my understanding.. But fortunately disk space is cheap G


On Wed, Jan 13, 2010 at 5:01 AM, Frederico Azeiteiro 
frederico.azeite...@cision.com wrote:


Sorry, my bad... I replied to a current mailing list message only
changing the subject... Didn't know about this  Hijacking problem.
Will not happen again.

Just for close this issue, if I understand correctly, for an index of
40G, I will need, for running an optimize:
- 40G if all activity on index is stopped
- 80G if index is being searched...)
- 120G if index is being searched and if a commit is performed.

Is this correct?

Thanks.
Frederico
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: terça-feira, 12 de Janeiro de 2010 19:18
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

Huh?

On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : Subject: Problem comitting on 40GB index
 : In-Reply-To: 
 7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com

 http://people.apache.org/~hossman/#threadhijack
 Thread Hijacking on Mailing Lists

 When starting a new discussion on a mailing list, please do not reply
 to an existing message, instead start a fresh email.  Even if you
 change the subject line of your email, other mail headers still track
 which thread you replied to and your question is hidden in that
 thread and gets less attention.   It makes following discussions in
 the mailing list archives particularly difficult.
 See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



 -Hoss




--
This transmission is strictly confidential, possibly legally privileged,
and intended solely for the  addressee.  Any views or opinions expressed
within it are those of the author and do not necessarily  represent those
of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary companies.
If you  are not the intended recipient then you must not disclose, copy
or take any action in reliance of this  transmission. If you have
received this transmission in error, please notify the sender as soon as
possible.  No employee or agent is authorised to conclude any binding
agreement on behalf of  i-CD Publishing (UK) Ltd with another party by
email without express written confirmation by an  authorised employee of
the Company. http://www.192.com (Tel: 08000 192 192).  i-CD Publishing
(UK) Ltd  is incorporated in England and Wales, company number 3148549,
VAT No. GB 673128728.



Re: Question

2010-01-13 Thread Otis Gospodnetic
Bill,

If you are using Solr 1.4, don't bother with rsync, use the Java-based 
replication - info on zee Wiki.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch





From: Bill Bell bb...@kaango.com
To: solr-user@lucene.apache.org
Sent: Wed, January 13, 2010 12:21:44 PM
Subject: Re: Question

On Wed, Jan 13, 2010 at 10:17 AM, Bill Bell bb...@kaango.com wrote:
 I am using Solr 1.4, and have 3 cores defined in solr.xml. Question on
 replication

 1. How do I set up rsync replication from master to slaves? It was
 easy to do with just one core and one script.conf, but with multiple
 cores what is the easiest way?

 2. I got the system to work by changing the snappuller to pass in a -c
 script.conf but there has got to be an easier way?

 3. On the master I have 3 rsync daemons running. Is it possible to do
 it with one?

 The script.conf really needs multiple data_dir settings...

 --
 Bill Bell
 Vice President of Technology
 bb...@kaango.com
 mobile 720.256.8076
 Kaango, LLC - www.kaango.com




-- 
Bill Bell
Vice President of Technology
bb...@kaango.com
mobile 720.256.8076
Kaango, LLC - www.kaango.com


Re: Queries of type field:value not functioning

2010-01-13 Thread Otis Gospodnetic
Hi,

Pointers:
* What happens when you don't use a field name?
* What are your logs showing?
* What is debugQuery=on showing?
* What is the Analysis page for some of the problematic queries showing?

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch





From: Siddhant Goel siddhantg...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wed, January 13, 2010 5:38:53 AM
Subject: Queries of type field:value not functioning

Hi all,

Any query I make which is of type field:value does not return any documents.
Same is the case for the *:* query. The *:* query doesn't return any result
either. The index size is close to 1GB now, so it should be returning some
documents. The rest of the queries are functioning properly. Any help?

Thanks,

-- 
- Siddhant


Re: Interesting OutOfMemoryError on a 170M index

2010-01-13 Thread Ryan McKinley


On Jan 13, 2010, at 5:34 PM, Minutello, Nick wrote:


Agreed, commit every second.


Do you need the index to be updated this often?  Are you reading from  
it every second?  and need results that are that fresh


If not, i imagine increasing the auto-commit time to 1min or even 10  
secs would help some.


Re, calling commit from the client with auto-commit...  if you are  
using auto-commit, you should not call commit from the client


ryan





Assuming I understand what you're saying correctly:
There shouldn't be any index readers - as at this point, just  
writing to the index.

Did I understand correctly what you meant?

-Nick

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: 13 January 2010 22:28
To: solr-user@lucene.apache.org
Subject: Re: Interesting OutOfMemoryError on a 170M index

The time in autocommit is in milliseconds. You are committing every  
second while indexing.  This then causes a build-up of sucessive  
index readers that absorb each commit, which is probably the out-of- 
memory.


On Wed, Jan 13, 2010 at 10:36 AM, Minutello, Nick nick.minute...@credit-suisse.com 
 wrote:


Hi,

I have a bit of an interesting OutOfMemoryError that I'm trying to
figure out.

My client  Solr server are running in the same JVM (for deployment
simplicity). FWIW, I'm using Jetty to host Solr. I'm using the
supplied code for the http-based client interface. Solr 1.3.0.

My app is adding about 20,000 documents per minute to the index - one
at a time (it is listening to an event stream and for every event, it
adds a new document to the index).
The size of the documents, however, is tiny - the total index growth
is only about 170M (after about 1 hr and the OutOfMemoryError) At  
this

point, there is zero querying happening - just updates to the index
(only adding documents, no updates or deletes) After about an hour or
so, my JVM runs out of heap space - and if I look at the memory
utilisation over time, it looks like a classic memory leak. It slowly
ramps up until we end up with constant FULL GC's and eventual OOME.
Max heap space is 512M.

In Solr, I'm using autocommit (to buffer the updates)
   autoCommit
 maxDocs1/maxDocs
 maxTime1000/maxTime
   /autoCommit

(Aside: Now, I'm not sure if I am meant to call commit or not on the
client SolrServer class if I am using autocommit - but as it turns
out, I get OOME whether I do that or not)

Any suggestions/advice of quick things to check before I dust off the
profiler?

Thanks in advance.

Cheers,
Nick

= 
=

=
 Please access the attached hyperlink for an important electronic  
communications disclaimer:

 http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html

= 
=

=






--
Lance Norskog
goks...@gmail.com

= 
= 
= 
= 
= 
= 
= 
= 
= 
==
Please access the attached hyperlink for an important electronic  
communications disclaimer:

http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
= 
= 
= 
= 
= 
= 
= 
= 
= 
==






RE: Interesting OutOfMemoryError on a 170M index

2010-01-13 Thread Minutello, Nick

 if you are using auto-commit, you should not call commit from the
client
Cheers, thanks.

 Do you need the index to be updated this often?  
Wouldn't increasing the autocommit time make it worse? (ie more
documents buffered)
I can extend it and see what effect it has

-Nick

 

-Original Message-
From: Ryan McKinley [mailto:ryan...@gmail.com] 
Sent: 13 January 2010 23:16
To: solr-user@lucene.apache.org
Subject: Re: Interesting OutOfMemoryError on a 170M index


On Jan 13, 2010, at 5:34 PM, Minutello, Nick wrote:

 Agreed, commit every second.

Do you need the index to be updated this often?  Are you reading from it
every second?  and need results that are that fresh

If not, i imagine increasing the auto-commit time to 1min or even 10
secs would help some.

Re, calling commit from the client with auto-commit...  if you are using
auto-commit, you should not call commit from the client

ryan




 Assuming I understand what you're saying correctly:
 There shouldn't be any index readers - as at this point, just writing 
 to the index.
 Did I understand correctly what you meant?

 -Nick

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: 13 January 2010 22:28
 To: solr-user@lucene.apache.org
 Subject: Re: Interesting OutOfMemoryError on a 170M index

 The time in autocommit is in milliseconds. You are committing every 
 second while indexing.  This then causes a build-up of sucessive index

 readers that absorb each commit, which is probably the out-of- memory.

 On Wed, Jan 13, 2010 at 10:36 AM, Minutello, Nick 
 nick.minute...@credit-suisse.com
  wrote:

 Hi,

 I have a bit of an interesting OutOfMemoryError that I'm trying to 
 figure out.

 My client  Solr server are running in the same JVM (for deployment 
 simplicity). FWIW, I'm using Jetty to host Solr. I'm using the 
 supplied code for the http-based client interface. Solr 1.3.0.

 My app is adding about 20,000 documents per minute to the index - one

 at a time (it is listening to an event stream and for every event, it

 adds a new document to the index).
 The size of the documents, however, is tiny - the total index growth 
 is only about 170M (after about 1 hr and the OutOfMemoryError) At 
 this point, there is zero querying happening - just updates to the 
 index (only adding documents, no updates or deletes) After about an 
 hour or so, my JVM runs out of heap space - and if I look at the 
 memory utilisation over time, it looks like a classic memory leak. It

 slowly ramps up until we end up with constant FULL GC's and eventual 
 OOME.
 Max heap space is 512M.

 In Solr, I'm using autocommit (to buffer the updates)
autoCommit
  maxDocs1/maxDocs
  maxTime1000/maxTime
/autoCommit

 (Aside: Now, I'm not sure if I am meant to call commit or not on the 
 client SolrServer class if I am using autocommit - but as it turns 
 out, I get OOME whether I do that or not)

 Any suggestions/advice of quick things to check before I dust off the

 profiler?

 Thanks in advance.

 Cheers,
 Nick

 =
 =
 =
  Please access the attached hyperlink for an important electronic 
 communications disclaimer:
  http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html

 =
 =
 =





 --
 Lance Norskog
 goks...@gmail.com

 =
 =
 =
 =
 =
 =
 =
 =
 =
 ==
 Please access the attached hyperlink for an important electronic 
 communications disclaimer:
 http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
 =
 =
 =
 =
 =
 =
 =
 =
 =
 ==



=== 
 Please access the attached hyperlink for an important electronic 
communications disclaimer: 
 http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
 
=== 
 


Re: How to display Highlight with VelocityResponseWriter?

2010-01-13 Thread Sascha Szott
Hi Qiuyan,

 Thanks a lot. It works now. When i added the line
 #set($hl = $response.highlighting)
 i got the highlighting. But i wonder if there's any document that
 describes the usage of that. I mean i didn't know the name of those
 methods. Actually i just managed to guess it.
Solritas (aka VelocityResponseWriter) binds a number of objects into a so
called VelocityContext (consult [1] for a complete list). You can think of
a map that allows you to access objects by symbolic names, e.g., an
instance of QueryResponse is stored under response (that's why you write
$response in your template).

Since $response is an instance of QueryResponse you can call all methods
on it the API [2] provides. Furthermore, Velocity incorporates a
JavaBean-like introspection mechanism that lets you write
$response.highlighting instead of $response.getHighlighting() (only a bit
of syntactic sugar).

-Sascha

[1] http://wiki.apache.org/solr/VelocityResponseWriter#line-93
[2]
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html

 Quoting Sascha Szott sz...@zib.de:

 Qiuyan,

 with highlight can also be displayed in the web gui. I've added bool
 name=hltrue/bool into the standard responseHandler and it already
 works, i.e without velocity. But the same line doesn't take effect in
 itas. Should i configure anything else? Thanks in advance.
 First of all, just a few notes on the /itas request handler in your
 solrconfig.xml:

 1. The entry

 arr name=components
   strhighlight/str
 /arr

 is obsolete, since the highlighting component is a default search
 component [1].

 2. Note that since you didn't specify a value for hl.fl highlighting
 will only affect the fields listed inside of qf.

 3. Why did you override the default value of hl.fragmenter? In most
 cases the default fragmenting algorithm (gap) works fine - and maybe
 in yours as well?


 To make sure all your hl related settings are correct, can you post
 an xml output (change the wt parameter to xml) for a search with
 highlighted results.

 And finally, can you post the vtl code snippet that should produce
 the highlighted output.

 -Sascha

 [1] http://wiki.apache.org/solr/SearchComponent














RE: Interesting OutOfMemoryError on a 170M index

2010-01-13 Thread Minutello, Nick
 
Hm, Ryan, you may have inadvertently solved the problem. :)

Going flat out in a loop, indexing 1 doc at a time, I can only index
about 17,000 per minute - roughly what I was seeing with my app
running... which makes me suspicious. The number is too close to be
coincidental.

It could very well be that I may be getting many more than 17,000
updates per minute - and because I cant index them fast enough, the
event queue in the underlying library (that is providing me the events)
may be growing without bound... 

So, looks like I have to increase the throughput with the indexing.
(indexing 1 at a time is far from ideal - even with the buffering). I
may have to either implement some client-side buffering to make it more
efficient - or eliminate the http layer (go embedded).

Thanks.

-Nick


-Original Message-
From: Minutello, Nick 
Sent: 13 January 2010 23:29
To: solr-user@lucene.apache.org
Subject: RE: Interesting OutOfMemoryError on a 170M index


 if you are using auto-commit, you should not call commit from the
client
Cheers, thanks.

 Do you need the index to be updated this often?  
Wouldn't increasing the autocommit time make it worse? (ie more
documents buffered) I can extend it and see what effect it has

-Nick

 

-Original Message-
From: Ryan McKinley [mailto:ryan...@gmail.com]
Sent: 13 January 2010 23:16
To: solr-user@lucene.apache.org
Subject: Re: Interesting OutOfMemoryError on a 170M index


On Jan 13, 2010, at 5:34 PM, Minutello, Nick wrote:

 Agreed, commit every second.

Do you need the index to be updated this often?  Are you reading from it
every second?  and need results that are that fresh

If not, i imagine increasing the auto-commit time to 1min or even 10
secs would help some.

Re, calling commit from the client with auto-commit...  if you are using
auto-commit, you should not call commit from the client

ryan




 Assuming I understand what you're saying correctly:
 There shouldn't be any index readers - as at this point, just writing 
 to the index.
 Did I understand correctly what you meant?

 -Nick

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: 13 January 2010 22:28
 To: solr-user@lucene.apache.org
 Subject: Re: Interesting OutOfMemoryError on a 170M index

 The time in autocommit is in milliseconds. You are committing every 
 second while indexing.  This then causes a build-up of sucessive index

 readers that absorb each commit, which is probably the out-of- memory.

 On Wed, Jan 13, 2010 at 10:36 AM, Minutello, Nick 
 nick.minute...@credit-suisse.com
  wrote:

 Hi,

 I have a bit of an interesting OutOfMemoryError that I'm trying to 
 figure out.

 My client  Solr server are running in the same JVM (for deployment 
 simplicity). FWIW, I'm using Jetty to host Solr. I'm using the 
 supplied code for the http-based client interface. Solr 1.3.0.

 My app is adding about 20,000 documents per minute to the index - one

 at a time (it is listening to an event stream and for every event, it

 adds a new document to the index).
 The size of the documents, however, is tiny - the total index growth 
 is only about 170M (after about 1 hr and the OutOfMemoryError) At 
 this point, there is zero querying happening - just updates to the 
 index (only adding documents, no updates or deletes) After about an 
 hour or so, my JVM runs out of heap space - and if I look at the 
 memory utilisation over time, it looks like a classic memory leak. It

 slowly ramps up until we end up with constant FULL GC's and eventual 
 OOME.
 Max heap space is 512M.

 In Solr, I'm using autocommit (to buffer the updates)
autoCommit
  maxDocs1/maxDocs
  maxTime1000/maxTime
/autoCommit

 (Aside: Now, I'm not sure if I am meant to call commit or not on the 
 client SolrServer class if I am using autocommit - but as it turns 
 out, I get OOME whether I do that or not)

 Any suggestions/advice of quick things to check before I dust off the

 profiler?

 Thanks in advance.

 Cheers,
 Nick

 =
 =
 =
  Please access the attached hyperlink for an important electronic 
 communications disclaimer:
  http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html

 =
 =
 =





 --
 Lance Norskog
 goks...@gmail.com

 =
 =
 =
 =
 =
 =
 =
 =
 =
 ==
 Please access the attached hyperlink for an important electronic 
 communications disclaimer:
 http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
 =
 =
 =
 =
 =
 =
 =
 =
 =
 ==




=== 
 Please access the attached hyperlink for an important electronic
communications disclaimer: 
 

Re: Encountering a roadblock with my Solr schema design...use dedupe?

2010-01-13 Thread Chris Hostetter

: Dedupe is completely the wrong word. Deduping is something else
: entirely - it is about trying not to index the same document twice.

Dedup can also certainly be used with field collapsing -- that was one of 
the initial use cases identified for the SignatureUpdateProcessorFactory 
... you can compute an 'expensive' signature when adding a document, index 
it, and then FieldCollapse on that signature field.

This gives you query time deduplication based on a value computed when 
indexing (the canonical example is multiple urls refrenceing the same 
content but with slightly differnet boilerplate markup.  You can use a 
Signature class that recognizes the boilerplate and computes an identical 
signature value for each URL whose content is the same but still index 
all of the URLs and their content as distinct documents ... so use cases 
where people only distinct URLs work using field collapse but by default 
all matching documents can still be returned and searches on text in the 
boilerplate markup also still work.


-Hoss



Re: Encountering a roadblock with my Solr schema design...use dedupe?

2010-01-13 Thread Kelly Taylor

Hoss,

Would you suggest using dedup for my use case; and if so, do you know of a
working example I can reference?

I don't have an issue using the patched version of Solr, but I'd much rather
use the GA version.

-Kelly



hossman wrote:
 
 
 : Dedupe is completely the wrong word. Deduping is something else
 : entirely - it is about trying not to index the same document twice.
 
 Dedup can also certainly be used with field collapsing -- that was one of 
 the initial use cases identified for the SignatureUpdateProcessorFactory 
 ... you can compute an 'expensive' signature when adding a document, index 
 it, and then FieldCollapse on that signature field.
 
 This gives you query time deduplication based on a value computed when 
 indexing (the canonical example is multiple urls refrenceing the same 
 content but with slightly differnet boilerplate markup.  You can use a 
 Signature class that recognizes the boilerplate and computes an identical 
 signature value for each URL whose content is the same but still index 
 all of the URLs and their content as distinct documents ... so use cases 
 where people only distinct URLs work using field collapse but by default 
 all matching documents can still be returned and searches on text in the 
 boilerplate markup also still work.
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27155115.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: What is this error means?

2010-01-13 Thread Ellery Leung

Hi Israel

Thank you for your response.

However, I use both ini_set and set the _defaultTimeout to 6000 but the
error still occur with same error message.

Now, when I start build the index, the error pops up much faster than
changing it before.

So do you have any idea?

Thank you in advance for your help.




Israel Ekpo wrote:
 
 Ellery,
 
 A preliminary look at the source code indicates that the error is
 happening
 because the solr server is taking longer than expected to respond to the
 client
 
 http://code.google.com/p/solr-php-client/source/browse/trunk/Apache/Solr/Service.php
 
 The default time out handed down to Apache_Solr_Service:_sendRawPost() is
 60
 seconds since you were calling the addDocument() method
 
 So if it took longer than that (1 minute), then it will exit with that
 error
 message.
 
 You will have to increase the default value to something very high like 10
 minutes or so on line 252 in the source code since there is no way to
 specify that in the constructor or the addDocument method.
 
 Another alternative will be to update the default_socket_timeout in the
 php.ini file or in the code using ini_set
 
 I hope that helps
 
 
 
 On Tue, Jan 12, 2010 at 9:33 PM, Ellery Leung elleryle...@be-o.com
 wrote:
 

 Hi, here is the stack trace:

 br /
 Fatal error:  Uncaught exception 'Exception' with message 'quot;0quot;
 Status: Communication Error' in
 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Serv
 ice.php:385
 Stack trace:
 #0 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(652):
 Apache_Solr_Ser
 vice-gt;_sendRawPost('http://127.0.0', 'lt;add allowDups=...')
 #1 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(676):
 Apache_Solr_Ser
 vice-gt;add('lt;add allowDups=...')
 #2

 C:\nginx\html\apps\milio\lib\System\classes\SolrSearchEngine.class.php(221):
 Apache_Solr_Service-gt;addDocument(Object(Apache_Solr_Document))
 #3

 C:\nginx\html\apps\milio\lib\System\classes\SolrSearchEngine.class.php(262):
 SolrSearchEngine-gt;buildIndex(Array, 'key')
 #4
 C:\nginx\html\apps\milio\lib\System\classes\Indexer\Indexer.class.php(51):
 So
 lrSearchEngine-gt;createFullIndex('contacts', Array, 'key', 'www')
 #5 C:\nginx\html\apps\milio\lib\System\functions\createIndex.php(64):
 Indexer-g
 t;create('www')
 #6 {main}
  thrown in C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php on li
 ne 385br /

 C:\nginx\html\apps\milio\htdocs\Contactspause
 Press any key to continue . . .

 Thanks for helping me.


 Grant Ingersoll-6 wrote:
 
  Do you have a stack trace?
 
  On Jan 12, 2010, at 2:54 AM, Ellery Leung wrote:
 
  When I am building the index for around 2 ~ 25000 records,
 sometimes
  I
  came across with this error:
 
 
 
  Uncaught exception Exception with message '0' Status: Communication
  Error
 
 
 
  I search Google  Yahoo but no answer.
 
 
 
  I am now committing document to solr on every 10 records fetched from
 a
  SQLite Database with PHP 5.3.
 
 
 
  Platform: Windows 7 Home
 
  Web server: Nginx
 
  Solr Specification Version: 1.4.0
 
  Solr Implementation Version: 1.4.0 833479 - grantingersoll -
 2009-11-06
  12:33:40
 
  Lucene Specification Version: 2.9.1
 
  Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
 
  Solr hosted in jetty 6.1.3
 
 
 
  All the above are in one single test machine.
 
 
 
  The situation is that sometimes when I build the index, it can be
 created
  successfully.  But sometimes it will just stop with the above error.
 
 
 
  Any clue?  Please help.
 
 
 
  Thank you in advance.
 
 
 
 

 --
 View this message in context:
 http://old.nabble.com/What-is-this-error-means--tp27123815p27138658.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Good Enough is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.
 http://www.israelekpo.com/
 
 

-- 
View this message in context: 
http://old.nabble.com/What-is-this-error-means--tp27123815p27155487.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Reverse sort facet query [SOLR-1672]

2010-01-13 Thread Chris Hostetter

: i.e. just extend facet.sort to allow a 'count desc'. By convention, ok 
: to use a a space in the name? - or would count.desc (and count.asc as 
: alias for count) be more compliant?

i would use space to remain consistent with the existing sort 
param. 

it might even make sense to refactor (re/ab)use the existing sort 
parsing code in QueryParsing.parseSort ... but now that that also know 
about parsing functions it's a bit hairry ... so that does seem a little 
crazy.




: 
:  
: 
: Peter
:  
: 
: 
: _
: We want to hear all your funny, exciting and crazy Hotmail stories. Tell us 
now
: http://clk.atdmt.com/UKM/go/195013117/direct/01/



-Hoss



Re: What is this error means?

2010-01-13 Thread Ellery Leung

Here are a workaround of this issue:

On line 382 of SolrPhpClient/Apache/Solr/Service.php, I change to:

while(true){
$str = file_get_contents($url, false, $this-_postContext);
if(empty($str) == false){
break;
}
}

$response = new Apache_Solr_Response($str, $http_response_header,
$this-_createDocuments, $this-_collapseSingleValueArrays);

As I found that, for some strange reason on Windows, when you post some data
and add index, Solr may not be able to receive it.  Therefore I added an
infinitive loop and if it does not receive any response ($str is empty), we
post it again.

Side effect: when I open the window console to see it, sometimes it will
prompt:

Failed to open stream: HTTP request failed!

I haven't researched it yet, but the index is built successfully.

Hope it helps someone.





Ellery Leung wrote:
 
 Hi Israel
 
 Thank you for your response.
 
 However, I use both ini_set and set the _defaultTimeout to 6000 but the
 error still occur with same error message.
 
 Now, when I start build the index, the error pops up much faster than
 changing it before.
 
 So do you have any idea?
 
 Thank you in advance for your help.
 
 
 
 
 Israel Ekpo wrote:
 
 Ellery,
 
 A preliminary look at the source code indicates that the error is
 happening
 because the solr server is taking longer than expected to respond to the
 client
 
 http://code.google.com/p/solr-php-client/source/browse/trunk/Apache/Solr/Service.php
 
 The default time out handed down to Apache_Solr_Service:_sendRawPost() is
 60
 seconds since you were calling the addDocument() method
 
 So if it took longer than that (1 minute), then it will exit with that
 error
 message.
 
 You will have to increase the default value to something very high like
 10
 minutes or so on line 252 in the source code since there is no way to
 specify that in the constructor or the addDocument method.
 
 Another alternative will be to update the default_socket_timeout in the
 php.ini file or in the code using ini_set
 
 I hope that helps
 
 
 
 On Tue, Jan 12, 2010 at 9:33 PM, Ellery Leung elleryle...@be-o.com
 wrote:
 

 Hi, here is the stack trace:

 br /
 Fatal error:  Uncaught exception 'Exception' with message 'quot;0quot;
 Status: Communication Error' in
 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Serv
 ice.php:385
 Stack trace:
 #0 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(652):
 Apache_Solr_Ser
 vice-gt;_sendRawPost('http://127.0.0', 'lt;add allowDups=...')
 #1 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(676):
 Apache_Solr_Ser
 vice-gt;add('lt;add allowDups=...')
 #2

 C:\nginx\html\apps\milio\lib\System\classes\SolrSearchEngine.class.php(221):
 Apache_Solr_Service-gt;addDocument(Object(Apache_Solr_Document))
 #3

 C:\nginx\html\apps\milio\lib\System\classes\SolrSearchEngine.class.php(262):
 SolrSearchEngine-gt;buildIndex(Array, 'key')
 #4
 C:\nginx\html\apps\milio\lib\System\classes\Indexer\Indexer.class.php(51):
 So
 lrSearchEngine-gt;createFullIndex('contacts', Array, 'key', 'www')
 #5 C:\nginx\html\apps\milio\lib\System\functions\createIndex.php(64):
 Indexer-g
 t;create('www')
 #6 {main}
  thrown in C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php on li
 ne 385br /

 C:\nginx\html\apps\milio\htdocs\Contactspause
 Press any key to continue . . .

 Thanks for helping me.


 Grant Ingersoll-6 wrote:
 
  Do you have a stack trace?
 
  On Jan 12, 2010, at 2:54 AM, Ellery Leung wrote:
 
  When I am building the index for around 2 ~ 25000 records,
 sometimes
  I
  came across with this error:
 
 
 
  Uncaught exception Exception with message '0' Status: Communication
  Error
 
 
 
  I search Google  Yahoo but no answer.
 
 
 
  I am now committing document to solr on every 10 records fetched from
 a
  SQLite Database with PHP 5.3.
 
 
 
  Platform: Windows 7 Home
 
  Web server: Nginx
 
  Solr Specification Version: 1.4.0
 
  Solr Implementation Version: 1.4.0 833479 - grantingersoll -
 2009-11-06
  12:33:40
 
  Lucene Specification Version: 2.9.1
 
  Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
 
  Solr hosted in jetty 6.1.3
 
 
 
  All the above are in one single test machine.
 
 
 
  The situation is that sometimes when I build the index, it can be
 created
  successfully.  But sometimes it will just stop with the above error.
 
 
 
  Any clue?  Please help.
 
 
 
  Thank you in advance.
 
 
 
 

 --
 View this message in context:
 http://old.nabble.com/What-is-this-error-means--tp27123815p27138658.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Good Enough is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.
 http://www.israelekpo.com/
 
 
 
 

-- 
View this message in context: 
http://old.nabble.com/What-is-this-error-means--tp27123815p27156058.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Queries of type field:value not functioning

2010-01-13 Thread Siddhant Goel
Hi,

Thanks for the responses.
q.alt did the job. Turns out that the dismax query parser was at fault, and
wasn't able to handle queries of the type *:*. Putting the query in q.alt,
or adding a defType=lucene (as pointed out to me on the irc channel) worked.

Thanks,


-- 
- Siddhant