date:20090513

Hmmm, maybe we need to think about someway to hook this into the build  
process or make it easier to just drop it into the conf or lib dirs.   
I'm no web.xml expert, but I'm sure you're not the first one to want  
to do this kind of thing.


The easiest way _might_ be to patch build.xml to take a property for  
the location of the web.xml, defaulting to the current Solr one.   
Then, people who want to use their own version could just pass in - 
Dweb.xml=path to my web.xml.  The downside to this is that it may  
cause problems for us devs when users ask questions about strange  
behavior and it turns out they have mucked up the web.xml


FYI: dist-war is in build.xml, not common-build.xml.

-Grant

On May 12, 2009, at 5:52 AM, Jacob Singh wrote:


Hi folks,

I just wrote a Servlet Filter to handle authentication for our
service.  Here's what I did:

1. Created a dir in contrib
2. Put my project in there, I took the dataimporthandler build.xml as
an example and modified it to suit my needs.  Worked great!
3. ant dist now builds my jar and includes it

I now need to modify web.xml to add my filter-mapping, init params,
etc.  How can I do this cleanly?  Or do I need to manually open up the
archive and edit it and then re-war it?

In common-build I don't see a target for dist-war, so don't see how it
is possible...

Thanks!
Jacob

--

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Re: camel-casing and dismax troubles

2009-05-13 Thread Yonik Seeley

On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young
ge...@modperlcookbook.org wrote:
 hi all :)

 I'm having trouble with camel-cased query strings and the dismax handler.

 a user query

  LeAnn Rimes

 isn't matching the indexed term

  Leann Rimes

This is the camel-case case that can't currently be handled by a
single WordDelimiterFilter.

If the indexeddoc had LeAnn, then it would be indexed as
le,ann/leann and hence queries of both forms le ann and
leann would match.

However since the indexed term is simply leann, a
WordDelimiterFilter configured to split won't match (a search for
LeAnn will be translated into a search for le ann.

One way to work around this now is to do a copyField into another
field that catenates split terms in the query analyzer instead of
generating/splitting, and then search across both fields.

BTW, your parsed query below shows you turned on both catenation and
generation (or perhaps preserveOriginal) for split subwords in your
query analyzer.  Unfortunately this configuration doesn't work due to
the ambiguity of what it means to have multiple terms at the same
position (this is the same problem for multi-word synonyms at query
time).  The query shown below looks for leann or le followed by
ann and hence an indexed term of leann won't match.

-Yonik
http://www.lucidimagination.com

 even though both are lower-cased in the end.  furthermore, the
 analysis tool shows a match.

 the debug query looks like

  parsedquery:+((DisjunctionMaxQuery((search-en:\(leann le)
 ann\)) DisjunctionMaxQuery((search-en:rimes)))~2) (),

 I have a feeling it's due to how the broken up tokens are added back
 into the token stream with PreserveOriginal, and some strange
 interaction between that order and dismax, but I'm not entirely sure.

 configs follow.  thoughts appreciated.

 --Geoff

  fieldType name=search-en class=solr.TextField
 positionIncrementGap=100
    analyzer type=index
      tokenizer class=solr.WhitespaceTokenizerFactory/
      filter class=solr.ISOLatin1AccentFilterFactory /
      filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
                                                      generateWordParts=1
                                                      generateNumberParts=1
                                                      catenateWords=1
                                                      catenateNumbers=1
                                                      catenateAll=1/
      filter class=solr.LowerCaseFilterFactory/
      filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt ignoreCase=true expand=true/
      filter class=solr.StopFilterFactory ignoreCase=false
 words=stopwords-en.txt/
    /analyzer

    analyzer type=query
      tokenizer class=solr.WhitespaceTokenizerFactory/
      filter class=solr.ISOLatin1AccentFilterFactory /
      filter class=solr.WordDelimiterFilterFactory preserveOriginal=1
                                                      generateWordParts=1
                                                      generateNumberParts=1
                                                      catenateWords=0
                                                      catenateNumbers=0
                                                      catenateAll=0/
      filter class=solr.LowerCaseFilterFactory/
      filter class=solr.StopFilterFactory ignoreCase=false
 words=stopwords-en.txt/
    /analyzer
  /fieldType

Delete documents from index with dataimport

2009-05-13 Thread Andrew McCombe

Hi

Is it possible, through dataimport handler to remove an existing
document from the Solr index?

I import/update from my database where the active field is true.
However, if the client then set's active to false, the document stays
in the Solr index and doesn't get removed.

Regards
Andrew

RE: Selective Searches Based on User Identity

2009-05-13 Thread Terence Gannon

Yes, the ownerUid will likely be assigned once and never changed.  But
you still need it, in order to keep track of who has contributed which
document.

I've been going over some of the simpler query scenarios, and Solr is
capable of handling them without having to resort to an external
RDBMS.  In order to limit documents to those which a given user owns,
or those to which he has been granted access, the syntax fragment
would be something like;

ownerUid:ab2734 or grantedUid:ab2734

where abs2734 is the uid for the user doing the query.  However, I'm
less comfortable with more complex query scenarios, particularly if
the concept of groups is eventually introduced, which is likely in my
scenario.
In the latter case, it may be necessary to use an external RDBMS.
I'll plead ignorance of the 'ineluctable filter query' and will have
to read up on that one.

With respect to updates to rights, they are not likely to be that
frequent, but when they are, they entire document will have to be
reindexed rather than simply updating the grantedUid and/or deniedUid
fields.  I don't believe Solr supports the updating of individual
fields, at least not yet.  This may be another reason to eventually go
to an external RDBMS.

Thanks very much for your help!

Terence

-Original Message-
From: Michael Ludwig
Sent: May 13, 2009 05:27
To: solr-user@lucene.apache.org
Subject: Re: Selective Searches Based on User Identity

Terence Gannon schrieb:
 Paul -- thanks for the reply, I appreciate it.  That's a very
 practical approach, and is worth taking a closer look at.  Actually,
 taking your idea one step further, perhaps three fields; 1) ownerUid
 (uid of the document's owner) 2) grantedUid (uid of users who have
 been granted access), and 3) deniedUid (uid of users specifically
 denied access to the document).

Grants might change quite a bit, the owner will likely remain the same.

Wouldn't it be better to include only the owner in the document and
store grants someplace else, like in an RDBMS or - if you don't want
one - a lightweight embedded database like BDB?

That way you could have your application tag an ineluctable filter query
onto each and every user query, which would ensure to include only those
documents in the results the owner of which has granted the user access.

Considering that I'm a Solr/Lucene newbie, this approach might have a
disadvantage that escapes me, which is why other people haven't made
this particular suggestion. If so, I'd be happy to learn why this isn't
preferable.

Michael Ludwig

Solr vs Sphinx

2009-05-13 Thread wojtekpia


I came across this article praising Sphinx:
http://www.theregister.co.uk/2009/05/08/dziuba_sphinx/. The article
specifically mentions Solr as an 'aging' technology, and states that
performance on Sphinx is 2x-4x faster than Solr. Has anyone compared Sphinx
to Solr? Or used Sphinx in the past? I realize that you can't just say one
is faster than the other because it depends so much on configuration,
requirements, # docs, size of each doc, etc. I'm just looking for general
observations. I've found other articles comparing Solr with Sphinx and most
state that performance is similar between the two. 

Thanks,

Wojtek
-- 
View this message in context: 
http://www.nabble.com/Solr-vs-Sphinx-tp23524676p23524676.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Replication master+slave

2009-05-13 Thread Bryan Talbot

I see that Nobel's final comment in SOLR-1154 is that config files  
need to be able to include snippets from external files.  In my  
limited testing, a simple patch to enable XInclude support seems to  
work.




--- src/java/org/apache/solr/core/Config.java   (revision 774137)
+++ src/java/org/apache/solr/core/Config.java   (working copy)
@@ -100,8 +100,10 @@
  if (lis == null) {
lis = loader.openConfig(name);
  }
-  javax.xml.parsers.DocumentBuilder builder =  
DocumentBuilderFactory.newInstance().newDocumentBuilder();

-  doc = builder.parse(lis);
+  javax.xml.parsers.DocumentBuilderFactory dbf =  
DocumentBuilderFactory.newInstance();

+  dbf.setNamespaceAware(true);
+  dbf.setXIncludeAware(true);
+  doc = dbf.newDocumentBuilder().parse(lis);

DOMUtil.substituteProperties(doc, loader.getCoreProperties());
} catch (ParserConfigurationException e)  {



This allows a clause like this to include the contents of  
replication.xml if it exists.  If it's not found an exception will be  
thrown.


!-- include external file to define replication configuration --
xi:include href=http://localhost:8983/solr/corename/admin/file/?file=replication.xml 


 xmlns:xi=http://www.w3.org/2001/XInclude;
/xi:include


If the file is optional and no exception should be thrown if the file  
is missing, simply include a fallback action: in this case the  
fallback is empty and does nothing.


!-- include external file to define replication configuration --
xi:include href=http://localhost:8983/solr/forum_en/admin/file/?file=replication.xml 


 xmlns:xi=http://www.w3.org/2001/XInclude;
xi:fallback/
/xi:include


-Bryan




On May 12, 2009, at May 12, 8:05 PM, Jian Han Guo wrote:

I was looking at the same problem, and had a discussion with Noble.  
You can

use a hack to achieve what you want, see

https://issues.apache.org/jira/browse/SOLR-1154

Thanks,

Jianhan


On Tue, May 12, 2009 at 5:13 PM, Bryan Talbot  
btal...@aeriagames.comwrote:


So how are people managing solrconfig.xml files which are largely  
the same

other than differences for replication?

I don't think it's a good thing to maintain two copies of the  
same file

and I'd like to avoid that.  Maybe enabling the XInclude feature in
DocumentBuilders would make it possible to modularize configuration  
files to

make this possible?


http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware(boolean) 
http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware%28boolean%29 




-Bryan





On May 12, 2009, at May 12, 11:43 AM, Shalin Shekhar Mangar wrote:

On Tue, May 12, 2009 at 10:42 PM, Bryan Talbot  
btal...@aeriagames.com

wrote:


For replication in 1.4, the wiki at
http://wiki.apache.org/solr/SolrReplication says that a node can  
be both

the master and a slave:

A node can act as both master and slave. In that case both the  
master and

slave configuration lists need to be present inside the
ReplicationHandler
requestHandler in the solrconfig.xml.

What does this mean?  Does the core then poll itself for updates?




No. This type of configuration is meant for repeaters. Suppose  
there are
slaves in multiple data-centers (say data center A and B). There  
is always

a
single master (say in A). One of the slaves in B is used as a  
master for

the
other slaves in B. Therefore, this one slave in B is both a master  
as well

as the slave.



I'd like to have a single set of configuration files that are  
shared by

masters and slaves and avoid duplicating configuration details in
multiple
files (one for master and one for slave) to ease management and  
failover.

Is this possible?


You wouldn't want the master to be a slave. So I guess you'd need  
to have

a
separate file. Also, it needs to be a separate file so that the  
slave does

not become a master when the solrconfig.xml is replicated.



When I attempt to setup a multi server master-slave configuration  
and
include both master and slave replication configuration options,  
I into

some
problems.  I'm  running a nightly build from May 7.


Not sure what happened. Is that the url for this solr (meaning  
same solr

url
is master and slave of itself)? If yes, that is not a valid  
configuration.


--
Regards,
Shalin Shekhar Mangar.

Re: Solr vs Sphinx

2009-05-13 Thread Yonik Seeley

It's probably the case that every search engine out there is faster
than Solr at one thing or another, and that Solr is faster or better
at some other things.

I prefer to spend my time improving Solr rather than engage in
benchmarking wars... and Solr 1.4 will have a ton of speed
improvements over Solr 1.3.

-Yonik
http://www.lucidimagination.com

Re: camel-casing and dismax troubles

2009-05-13 Thread Geoffrey Young

On Wed, May 13, 2009 at 6:23 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, May 12, 2009 at 7:19 PM, Geoffrey Young
 ge...@modperlcookbook.org wrote:
 hi all :)

 I'm having trouble with camel-cased query strings and the dismax handler.

 a user query

  LeAnn Rimes

 isn't matching the indexed term

  Leann Rimes

 This is the camel-case case that can't currently be handled by a
 single WordDelimiterFilter.

 If the indexeddoc had LeAnn, then it would be indexed as
 le,ann/leann and hence queries of both forms le ann and
 leann would match.

 However since the indexed term is simply leann, a
 WordDelimiterFilter configured to split won't match (a search for
 LeAnn will be translated into a search for le ann.

but the concatparts and/or concatall should handle splicing the tokens
back together, right?


 One way to work around this now is to do a copyField into another
 field that catenates split terms in the query analyzer instead of
 generating/splitting, and then search across both fields.

yeah, unforunately, that's not an option for me :)


 BTW, your parsed query below shows you turned on both catenation and
 generation (or perhaps preserveOriginal) for split subwords in your
 query analyzer.  Unfortunately this configuration doesn't work due to
 the ambiguity of what it means to have multiple terms at the same
 position (this is the same problem for multi-word synonyms at query
 time).  The query shown below looks for leann or le followed by
 ann and hence an indexed term of leann won't match.

ugh.  ok, thanks for letting me know.

I'm not using the same concat parameters on the index as the query
based on the solr wiki docs.  but I've always wondered if that was a
good idea.  I'll see if matching them up helps at all.

thanks.  I'll let you know what I find.

--Geoff

Re: Solr vs Sphinx



On May 13, 2009, at 11:55 AM, wojtekpia wrote:



I came across this article praising Sphinx:
http://www.theregister.co.uk/2009/05/08/dziuba_sphinx/. The article
specifically mentions Solr as an 'aging' technology,


Solr is the same age as Sphinx (2006), so if Solr is aging, then so is  
Sphinx.  But, hey aren't we all aging?  It sure beats not aging.  ;-)   
That being said, we are always open to suggestions and improvements.   
Lucene has seen a massive speedup on indexing that comes through in  
Solr in the past year (and it was fast before), and Solr 1.4 looks to  
be faster than 1.3 (and it was fast before, too.)  The Solr community  
is clearly interested in moving things forward and staying fresh, as  
is the Lucene community.



and states that
performance on Sphinx is 2x-4x faster than Solr. Has anyone compared  
Sphinx
to Solr? Or used Sphinx in the past? I realize that you can't just  
say one

is faster than the other because it depends so much on configuration,
requirements, # docs, size of each doc, etc. I'm just looking for  
general
observations. I've found other articles comparing Solr with Sphinx  
and most

state that performance is similar between the two.


I can't speak to Sphinx, as I haven't used it.

As for performance tests, those are always apples and oranges.  If one  
camp does them, then the other camp says You don't know how to use  
our product and vice versa.  I think that applies here.  So, when you  
see things like Internal tests show that is always a red flag in my  
mind.  I've contacted others in the past who have done comparisons  
and after one round of emailing it was almost always clear that they  
didn't know what best practices are for any given product and thus  
were doing things sub-optimally.


One thing in the article that is worthwhile to consider is the fact  
that some (most?) people would likely benefit from not removing  
stopwords, as they can enhance phrase based searching and thus improve  
relevance.  Obviously, with Solr, it is easy to keep stopwords by  
simply removing the StopwordFilterFactor from the analysis process and  
then dealing with them appropriately at query time.  However, it is  
likely the case that too many Solr users simply rely on the example  
schema when it comes to setup instead of actively investigating what  
the proper choices are for their situation.


Finally, an old baseball saying comes to mind: Pitchers only bother  
to throw at .300 hitters.  Solr is a pretty darn full featured search  
platform with a large and active community, a commercial friendly  
license, and it also performs quite well.


-Grant

Re: Solr vs Sphinx

2009-05-13 Thread Todd Benge

Our company has a large search deployment serving  50 M search hits / per
day.

We've been leveraging Lucene for several years and have recently deployed
Solr for the distributed search feature.  We were hitting scaling limits
with lucene due to our index size.

I did an evaluation of Sphinx and found Solr / Lucene to be more suitable
for our needs and much more flexible.  Performance in the Solr deployment (
especially with 1.4) has been better than expected.

Thanks to all the Solr developers for a great product.

Hopefully we'll have the opportunity to contribute to the project as it
moves forward.

Todd

On Wed, May 13, 2009 at 10:33 AM, Grant Ingersoll gsing...@apache.orgwrote:


 On May 13, 2009, at 11:55 AM, wojtekpia wrote:


 I came across this article praising Sphinx:
 http://www.theregister.co.uk/2009/05/08/dziuba_sphinx/. The article
 specifically mentions Solr as an 'aging' technology,


 Solr is the same age as Sphinx (2006), so if Solr is aging, then so is
 Sphinx.  But, hey aren't we all aging?  It sure beats not aging.  ;-)  That
 being said, we are always open to suggestions and improvements.  Lucene has
 seen a massive speedup on indexing that comes through in Solr in the past
 year (and it was fast before), and Solr 1.4 looks to be faster than 1.3 (and
 it was fast before, too.)  The Solr community is clearly interested in
 moving things forward and staying fresh, as is the Lucene community.

  and states that
 performance on Sphinx is 2x-4x faster than Solr. Has anyone compared
 Sphinx
 to Solr? Or used Sphinx in the past? I realize that you can't just say one
 is faster than the other because it depends so much on configuration,
 requirements, # docs, size of each doc, etc. I'm just looking for general
 observations. I've found other articles comparing Solr with Sphinx and
 most
 state that performance is similar between the two.


 I can't speak to Sphinx, as I haven't used it.

 As for performance tests, those are always apples and oranges.  If one camp
 does them, then the other camp says You don't know how to use our product
 and vice versa.  I think that applies here.  So, when you see things like
 Internal tests show that is always a red flag in my mind.  I've contacted
 others in the past who have done comparisons and after one round of
 emailing it was almost always clear that they didn't know what best
 practices are for any given product and thus were doing things
 sub-optimally.

 One thing in the article that is worthwhile to consider is the fact that
 some (most?) people would likely benefit from not removing stopwords, as
 they can enhance phrase based searching and thus improve relevance.
  Obviously, with Solr, it is easy to keep stopwords by simply removing the
 StopwordFilterFactor from the analysis process and then dealing with them
 appropriately at query time.  However, it is likely the case that too many
 Solr users simply rely on the example schema when it comes to setup instead
 of actively investigating what the proper choices are for their situation.

 Finally, an old baseball saying comes to mind: Pitchers only bother to
 throw at .300 hitters.  Solr is a pretty darn full featured search platform
 with a large and active community, a commercial friendly license, and it
 also performs quite well.

 -Grant

Re: master/slave failure scenario

2009-05-13 Thread Jay Hill

- Migrate configuration files from old master (or backup) to new master.
- Replicate from a slave to the new master.
- Resume indexing to new master.

-Jay

On Wed, May 13, 2009 at 4:26 AM, nk 11 nick.cass...@gmail.com wrote:

 Nice.
 What if the master fails permanently (like a disk crash...) and the new
 master is a clean machine?
 2009/5/13 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

  On Wed, May 13, 2009 at 12:10 PM, nk 11 nick.cass...@gmail.com wrote:
   Hello
  
   I'm kind of new to Solr and I've read about replication, and the fact
  that a
   node can act as both master and slave.
   I a replica fails and then comes back on line I suppose that it will
  resyncs
   with the master.
  right
  
   But what happnes if the master fails? A slave that is configured as
  master
   will kick in? What if that slave is not yes fully sync'ed with the
 failed
   master and has old data?
  if the master fails you can't index the data. but the slaves will
  continue serving the requests with the last index. You an bring back
  the master up and resume indexing.
 
  
   What happens when the original master comes back on line? He will
 remain
  a
   slave because there is another node with the master role?
  
   Thank you!
  
 
 
 
  --
  -
  Noble Paul | Principal Engineer| AOL | http://aol.com

Re: master/slave failure scenario

2009-05-13 Thread Bryan Talbot


Or ...

1. Promote existing slave to new master
2. Add new slave to cluster




-Bryan




On May 13, 2009, at May 13, 9:48 AM, Jay Hill wrote:

- Migrate configuration files from old master (or backup) to new  
master.

- Replicate from a slave to the new master.
- Resume indexing to new master.

-Jay

On Wed, May 13, 2009 at 4:26 AM, nk 11 nick.cass...@gmail.com wrote:


Nice.
What if the master fails permanently (like a disk crash...) and the  
new

master is a clean machine?
2009/5/13 Noble Paul നോബിള്‍ नोब्ळ्  
noble.p...@corp.aol.com


On Wed, May 13, 2009 at 12:10 PM, nk 11 nick.cass...@gmail.com  
wrote:

Hello

I'm kind of new to Solr and I've read about replication, and the  
fact

that a

node can act as both master and slave.
I a replica fails and then comes back on line I suppose that it  
will

resyncs

with the master.

right


But what happnes if the master fails? A slave that is configured as

master

will kick in? What if that slave is not yes fully sync'ed with the

failed

master and has old data?

if the master fails you can't index the data. but the slaves will
continue serving the requests with the last index. You an bring back
the master up and resume indexing.



What happens when the original master comes back on line? He will

remain

a

slave because there is another node with the master role?

Thank you!





--
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: camel-casing and dismax troubles

2009-05-13 Thread Yonik Seeley

On Wed, May 13, 2009 at 12:29 PM, Geoffrey Young
ge...@modperlcookbook.org wrote:
 However since the indexed term is simply leann, a
 WordDelimiterFilter configured to split won't match (a search for
 LeAnn will be translated into a search for le ann.

 but the concatparts and/or concatall should handle splicing the tokens
 back together, right?

Yes, but you can't do both at once on the query side (split and
concat)... you have to pick one or the other (hence the workaround of
using more than one field).

-Yonik
http://www.lucidimagination.com

Re: how to manually add data to indexes generated by nutch-1.0 using solr

2009-05-13 Thread alxsss


 I forget to say that when I do 

curl http://localhost:8983/solr/update -H Content-Type: text/xml 
--data-binary 'commit waitFlush=false waitSearcher=false/'
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint 
name=QTime453/int/lst
/response


and search for added keywords gives 0 results. Does status 0 mean that addition 
was successful?

Thanks.
Alex.


 


 

-Original Message-
From: Erik Hatcher e...@ehatchersolutions.com
To: solr-user@lucene.apache.org
Sent: Tue, 12 May 2009 6:48 pm
Subject: Re: how to manually add data to indexes generated by nutch-1.0 using 
solr









send a commit/ request afterwards, or you can add ?commit=true to the /update 
request with the adds.?
?

?  Erik?
?

On May 12, 2009, at 8:57 PM, alx...@aim.com wrote:?
?

?

 Tried to add a new record using?

?

?

?

 curl http://localhost:8983/solr/update -H Content-Type: text/xml -- 
 data-binary 'add?

 doc boost=2.5?

 field name=segment20090512170318/field?

 field name=digest86937aaee8e748ac3007ed8b66477624/field?

 field name=boost0.21189615/field?

 field name=urltest.com/field?

 field name=titletest test/field?

 field name=tstamp 20090513003210909/field?

 /doc /add'?

?

 I get?

?

 ?xml version=1.0 encoding=UTF-8??

 response?

 lst name=responseHeaderint name=status0/intint  
 name=QTime71/int/lst?

 /response?

?

?

 and added records are not found in the search.?

?

 Any ideas what went wrong??

?

?

 Thanks.?

 Alex.?

?

?

?

?

 -Original Message-?

 From: alx...@aim.com?

 To: solr-u...@lucene.apache.org?

 Sent: Mon, 11 May 2009 12:14 pm?

 Subject: how to manually add data to indexes generated by nutch-1.0  using 
 solr?

?

?

?

?

?

?

?

?

?

?

 Hello,?

?

 I had? Nutch -1.0 to crawl fetch and index a lot of files. Then I  needed 
 to??

?

 index a few files also. But I know keywords for those files and their??

 locations. I need to add them manually. I took a look to two  tutorials on 
 the?

 wiki, but did not find any info about this issue.?

 Is there a tutorial on, step by step procedure of adding data to?  nutch 
 index?

 using solr? manually??

?

 Thanks in advance.?

 Alex.?

?

?

?

?

?
?

Re: master/slave failure scenario

2009-05-13 Thread nk 11

This is more interesting.Such a procedure would involve taking down and
reconfiguring the slave?

On Wed, May 13, 2009 at 7:55 PM, Bryan Talbot btal...@aeriagames.comwrote:

 Or ...

 1. Promote existing slave to new master
 2. Add new slave to cluster




 -Bryan





 On May 13, 2009, at May 13, 9:48 AM, Jay Hill wrote:

  - Migrate configuration files from old master (or backup) to new master.
 - Replicate from a slave to the new master.
 - Resume indexing to new master.

 -Jay

 On Wed, May 13, 2009 at 4:26 AM, nk 11 nick.cass...@gmail.com wrote:

  Nice.
 What if the master fails permanently (like a disk crash...) and the new
 master is a clean machine?
 2009/5/13 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

  On Wed, May 13, 2009 at 12:10 PM, nk 11 nick.cass...@gmail.com wrote:

 Hello

 I'm kind of new to Solr and I've read about replication, and the fact

 that a

 node can act as both master and slave.
 I a replica fails and then comes back on line I suppose that it will

 resyncs

 with the master.

 right


 But what happnes if the master fails? A slave that is configured as

 master

 will kick in? What if that slave is not yes fully sync'ed with the

 failed

 master and has old data?

 if the master fails you can't index the data. but the slaves will
 continue serving the requests with the last index. You an bring back
 the master up and resume indexing.


 What happens when the original master comes back on line? He will

 remain

 a

 slave because there is another node with the master role?

 Thank you!




 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com

Re: Commits taking too long

Hi,

  This problem is still haunting us. I've reduced the merge factor to
50, but as my index get fat (anything over 20G), the commit starts
taking much longer. Some info,

1) Less than 20 G index size, 5000 records commit takes around 15sec
2) Over 20G the commit starts taking 50-70sec for 5K records
3) mergefactor = 50
4) Using multicore - each core is around 70G (currently there are 5
cores maintained by single Solr instance)
5) RAM = 6G
6) OS = OS X 10.5
7) JVM Options:

export JAVA_OPTS=-Xdebug
-Xrunjdwp:transport=dt_socket,server=y,address=3090,suspend=n \
  -server -Xms${MIN_JVM_HEAP}m -Xmx${MAX_JVM_HEAP}m \
  -XX:NewRatio=2 -XX:MaxPermSize=512m \
  -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=${AC_ROOT}/data/pmiJavaHeapDump.hprof \
  -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Xloggc:gc.log -Dsun.rmi.dgc.client.gcInterval=360
-Dsun.rmi.dgc.server.gcInterval=360 \
  -Droot.dir=$AC_ROOT

export CATALINA_OPTS=-server -Xms${MIN_JVM_HEAP}m -Xmx${MAX_JVM_HEAP}m 
\
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=50
-XX:-UseGCOverheadLimit

I also see following from GC log to coincide with commit slowness,

40387.691: [GC 40387.691: [ParNew (promotion failed):
132131K-149120K(149120K), 186.3768727 secs]40574.068: [CMSbailing out
to foreground collection
40736.670: [CMS-concurrent-mark: 168.574/356.749 secs] [Times:
user=276.41 sys=1192.51, real=356.77 secs]
 (concurrent mode failure): 6116976K-5908559K(6121088K), 174.0819842
secs] 6229178K-5908559K(6270208K), 360.4589949 secs] [Times:
user=267.90 sys=1185.49, real=360.48 secs]
40748.155: [GC [1 CMS-initial-mark: 5908559K(6121088K)]
5910029K(6270208K), 0.0014832 secs] [Times: user=0.00 sys=0.00,
real=0.00 secs]
40748.156: [CMS-concurrent-mark-start]
40748.513: [GC 40748.513: [ParNew: 127872K-21248K(149120K), 0.7482810
secs] 6036431K-6050277K(6270208K), 0.7483775 secs] [Times: user=1.66
sys=0.71, real=0.75 secs]
40749.613: [GC 40749.613: [ParNew: 149120K-149120K(149120K),
0.227 secs]40749.613: [CMS40784.961: [CMS-concurrent-mark:
36.055/36.805 secs] [Times: user=20.74 sys=31.41, real=36.81 secs]
 (concurrent mode failure): 6029029K-4899386K(6121088K), 44.2068275
secs] 6178149K-4899386K(6270208K), 44.2069457 secs] [Times:
user=26.05 sys=30.21, real=44.21 secs]

Few questions,

1) Should I lower the merge factor even more? Low merge factor seems
to cause more frequent commit pauses.
2)  Do I need more RAM to maintain large indexes?
3) Should I not have any core bigger than 20G?
4) Any other configuration (Solr or JVM) that can help with this?
5) Does search has to wait until commit completes? Right now the
search doesn't return while the commit is happening.

We are using Solr 1.4 (nightly build from 3/29/09).

Thanks,
-vivek

On Wed, Apr 15, 2009 at 11:41 AM, Mark Miller markrmil...@gmail.com wrote:
 vivek sar wrote:

 Hi,

  I've index where I commit every 50K records (using Solrj). Usually
 this commit takes 20sec to complete, but every now and then the commit
 takes way too long - from 10 min to 30 min. I see more delays as the
 index size continues to grow - once it gets over 5G I start seeing
 long commit cycles more frequently. See this for ex.,

 Apr 15, 2009 12:04:13 AM org.apache.solr.update.DirectUpdateHandler2
 commit
 INFO: start commit(optimize=false,waitFlush=false,waitSearcher=false)
 Apr 15, 2009 12:39:58 AM org.apache.solr.core.SolrDeletionPolicy onCommit
 INFO: SolrDeletionPolicy.onCommit: commits:num=2

  commit{dir=/Users/vivek/demo/afterchat/solr/multicore/20090414_1/data/index,segFN=segments_fq,version=1239747075391,generation=566,filenames=[_19m.cfs,
 _jm.cfs, _1bk.cfs, _193.cfx, _19z.cfs, _1b8.cfs, _1bf.cfs, _10g.cfs, _
 2s.cfs, _1bf.cfx, _18x.cfx, _19c.cfx, _193.cfs, _18x.cfs, _1b7.cfs,
 _1aw.cfs, _1aq.cfs, _1bi.cfx, _1a6.cfs, _19l.cfs, _1ad.cfs, _1a6.cfx,
 _1as.cfs, _19l.cfx, _1aa.cfs, _1an.cfs, _19d.cfs, _1a3.cfx, _1a3.cfs,
 _19g.cfs, _b7.cfs, _19
 e.cfs, _19b.cfs, _1ab.cfs, _1b3.cfx, _19j.cfs, _190.cfs, _uu.cfs,
 _1b3.cfs, _1ak.cfs, _19p.cfs, _195.cfs, _194.cfs, _19i.cfx, _199.cfs,
 _19i.cfs, _19o.cfx, _196.cfs, _199.cfx, _196.cfx, _19o.cfs, _190.cfx,
 _xn.cfs, _1b0.cfx, _1at.
 cfs, _1av.cfs, _1ao.cfs, _1a9.cfx, _1b0.cfs, _5l.cfs, _1ao.cfx,
 _1ap.cfs, _1b6.cfx, _19a.cfs, _139.cfs, _1a1.cfs, _s1.cfs, _1b6.cfs,
 _1a9.cfs, _197.cfs, _1bd.cfs, _19n.cfs, _1au.cfx, _1au.cfs, _1a5.cfs,
 _1be.cfs, segments_fq, _1b4.cfs, _gt.cfs, _1ag.cfs, _18z.cfs,
 _162.cfs, _1a4.cfs, _198.cfs, _19x.cfs, _1ah.cfs, _1ai.cfs, _19q.cfs,
 _1a7.cfs, _1ae.cfs, _19h.cfs, _19x.cfx, _1a2.cfs, _1bj.cfs, _1bb.cfs,
 _1b1.cfs, _1ai.cfx, _19r.cfs, _18y.cfs, _19u.cfx, _1a8.
 cfs, _19u.cfs, _1aj.cfs, _19r.cfx, _1ac.cfs, _1az.cfs, _1ac.cfx,
 _19y.cfs, _1bc.cfx, _19s.cfs, _1ar.cfs, _1al.cfx, _1bg.cfs, _18v.cfs,
 _1ar.cfx, _1bc.cfs, _1a0.cfx, _1b2.cfs, _1af.cfs, _1bi.cfs, _1af.cfx,
 _19f.cfs, _1a0.cfs, _1bh.cfs, _19f.cfx, _19c.cfs, _e0.cfs, _1ax.cfx,
 _1b5.cfs, _191.cfs, _18w.cfs, _19t.cfs,

Re: how to manually add data to indexes generated by nutch-1.0 using solr

2009-05-13 Thread Erik Hatcher

Try a search for *:* and see if you get results for that.  If so, you  
have your documents indexed, but you need to dig into things like  
query parser configuration and analysis to see why things aren't  
matching.  Perhaps you're not querying the field you think you are?


Erik

On May 13, 2009, at 1:15 PM, alx...@aim.com wrote:



I forget to say that when I do

curl http://localhost:8983/solr/update -H Content-Type: text/xml -- 
data-binary 'commit waitFlush=false waitSearcher=false/'

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint  
name=QTime453/int/lst

/response


and search for added keywords gives 0 results. Does status 0 mean  
that addition was successful?


Thanks.
Alex.







-Original Message-
From: Erik Hatcher e...@ehatchersolutions.com
To: solr-user@lucene.apache.org
Sent: Tue, 12 May 2009 6:48 pm
Subject: Re: how to manually add data to indexes generated by  
nutch-1.0 using solr










send a commit/ request afterwards, or you can add ?commit=true to  
the /update request with the adds.?

?

?  Erik?
?

On May 12, 2009, at 8:57 PM, alx...@aim.com wrote:?
?


?



Tried to add a new record using?



?



?



?


curl http://localhost:8983/solr/update -H Content-Type: text/xml  
-- data-binary 'add?



doc boost=2.5?



field name=segment20090512170318/field?



field name=digest86937aaee8e748ac3007ed8b66477624/field?



field name=boost0.21189615/field?



field name=urltest.com/field?



field name=titletest test/field?



field name=tstamp 20090513003210909/field?



/doc /add'?



?



I get?



?



?xml version=1.0 encoding=UTF-8??



response?


lst name=responseHeaderint name=status0/intint   
name=QTime71/int/lst?



/response?



?



?



and added records are not found in the search.?



?



Any ideas what went wrong??



?



?



Thanks.?



Alex.?



?



?



?



?



-Original Message-?



From: alx...@aim.com?



To: solr-u...@lucene.apache.org?



Sent: Mon, 11 May 2009 12:14 pm?


Subject: how to manually add data to indexes generated by nutch-1.0  
 using solr?



?



?



?



?



?



?



?



?



?



?



Hello,?



?


I had? Nutch -1.0 to crawl fetch and index a lot of files. Then I   
needed to??



?


index a few files also. But I know keywords for those files and  
their??


locations. I need to add them manually. I took a look to two   
tutorials on the?



wiki, but did not find any info about this issue.?


Is there a tutorial on, step by step procedure of adding data to?   
nutch index?



using solr? manually??



?



Thanks in advance.?



Alex.?



?



?



?



?



?

?

Re: Selective Searches Based on User Identity

2009-05-13 Thread Michael Ludwig


Hi Terence,

Terence Gannon schrieb:

Yes, the ownerUid will likely be assigned once and never changed.  But
you still need it, in order to keep track of who has contributed which
document.


Yes, of course!


I've been going over some of the simpler query scenarios, and Solr is
capable of handling them without having to resort to an external
RDBMS.


The database is only to store grants - it's not to help with searching.
It would look like this:

  grantee| grant
  ---+--
  fritz  | fred,frank,egon
  frank  | egon,fritz
  egon   | terence,frank
  ...

Each user is granted to access to his own documents and to those he
had received grants for.


In order to limit documents to those which a given user owns,
or those to which he has been granted access, the syntax fragment
would be something like;

ownerUid:ab2734 or grantedUid:ab2734


I think it could be:

  ownerUid:egon OR ownerUid:terence OR ownerUid:frank

No need to embed grants in the document.

Ah, I see my mistake now. You want grants based on the document, not on
the user - I had overlooked that fact. That makes my suggestion invalid.


I'll plead ignorance of the 'ineluctable filter query' and will have
to read up on that one.


I meant a filter query that the application tags onto the query on
behalf of the user and without the user being able to do anything about
it so he cannot circumvent the filter.

Best regards,

Michael Ludwig

Solr memory requirements?

Hi,

  I'm pretty sure this has been asked before, but I couldn't find a
complete answer in the forum archive. Here are my questions,

1) When solr starts up what does it loads up in the memory? Let's say
I've 4 cores with each core 50G in size. When Solr comes up how much
of it would be loaded in memory?

2) How much memory is required during index time? If I'm committing
50K records at a time (1 record = 1KB) using solrj, how much memory do
I need to give to Solr.

3) Is there a minimum memory requirement by Solr to maintain a certain
size index? Is there any benchmark on this?

Here are some of my configuration from solrconfig.xml,

1) ramBufferSizeMB64/ramBufferSizeMB
2) All the caches (under query tag) are commented out
3) Few others,
  a)  enableLazyFieldLoadingtrue/enableLazyFieldLoading==
would this require memory?
  b)  queryResultWindowSize50/queryResultWindowSize
  c) queryResultMaxDocsCached200/queryResultMaxDocsCached
  d) HashDocSet maxSize=3000 loadFactor=0.75/
  e) useColdSearcherfalse/useColdSearcher
  f)  maxWarmingSearchers2/maxWarmingSearchers

The problem we are having is following,

I've given Solr RAM of 6G. As the total index size (all cores
combined) start growing the Solr memory consumption  goes up. With 800
million documents, I see Solr already taking up all the memory at
startup. After that the commits, searches everything become slow. We
will be having distributed setup with multiple Solr instances (around
8) on four boxes, but our requirement is to have each Solr instance at
least maintain around 1.5 billion documents.

We are trying to see if we can somehow reduce the Solr memory
footprint. If someone can provide a pointer on what parameters affect
memory and what effects it has we can then decide whether we want that
parameter or not. I'm not sure if there is any minimum Solr
requirement for it to be able maintain large indexes. I've used Lucene
before and that didn't require anything by default - it used up memory
only during index and search times - not otherwise.

Any help is very much appreciated.

Thanks,
-vivek

Re: Solr memory requirements?


Hi,
Some answers:
1) .tii files in the Lucene index.  When you sort, all distinct values for the 
field(s) used for sorting.  Similarly for facet fields.  Solr caches.
2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will consume 
during indexing.  There is no need to commit every 50K docs unless you want to 
trigger snapshot creation.
3) see 1) above

1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's going 
to fly. :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 3:04:46 PM
 Subject: Solr memory requirements?
 
 Hi,
 
   I'm pretty sure this has been asked before, but I couldn't find a
 complete answer in the forum archive. Here are my questions,
 
 1) When solr starts up what does it loads up in the memory? Let's say
 I've 4 cores with each core 50G in size. When Solr comes up how much
 of it would be loaded in memory?
 
 2) How much memory is required during index time? If I'm committing
 50K records at a time (1 record = 1KB) using solrj, how much memory do
 I need to give to Solr.
 
 3) Is there a minimum memory requirement by Solr to maintain a certain
 size index? Is there any benchmark on this?
 
 Here are some of my configuration from solrconfig.xml,
 
 1) 64
 2) All the caches (under query tag) are commented out
 3) Few others,
   a)  true==
 would this require memory?
   b)  50
   c) 200
   d) 
   e) false
   f)  2
 
 The problem we are having is following,
 
 I've given Solr RAM of 6G. As the total index size (all cores
 combined) start growing the Solr memory consumption  goes up. With 800
 million documents, I see Solr already taking up all the memory at
 startup. After that the commits, searches everything become slow. We
 will be having distributed setup with multiple Solr instances (around
 8) on four boxes, but our requirement is to have each Solr instance at
 least maintain around 1.5 billion documents.
 
 We are trying to see if we can somehow reduce the Solr memory
 footprint. If someone can provide a pointer on what parameters affect
 memory and what effects it has we can then decide whether we want that
 parameter or not. I'm not sure if there is any minimum Solr
 requirement for it to be able maintain large indexes. I've used Lucene
 before and that didn't require anything by default - it used up memory
 only during index and search times - not otherwise.
 
 Any help is very much appreciated.
 
 Thanks,
 -vivek

Re: Replication master+slave


This looks nice and simple.  I don't know enough about this stuff to see any 
issues.  If there are no issues.?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Bryan Talbot btal...@aeriagames.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 11:26:41 AM
 Subject: Re: Replication master+slave
 
 I see that Nobel's final comment in SOLR-1154 is that config files need to be 
 able to include snippets from external files.  In my limited testing, a 
 simple 
 patch to enable XInclude support seems to work.
 
 
 
 --- src/java/org/apache/solr/core/Config.java   (revision 774137)
 +++ src/java/org/apache/solr/core/Config.java   (working copy)
 @@ -100,8 +100,10 @@
   if (lis == null) {
 lis = loader.openConfig(name);
   }
 -  javax.xml.parsers.DocumentBuilder builder = 
 DocumentBuilderFactory.newInstance().newDocumentBuilder();
 -  doc = builder.parse(lis);
 +  javax.xml.parsers.DocumentBuilderFactory dbf = 
 DocumentBuilderFactory.newInstance();
 +  dbf.setNamespaceAware(true);
 +  dbf.setXIncludeAware(true);
 +  doc = dbf.newDocumentBuilder().parse(lis);
 
 DOMUtil.substituteProperties(doc, loader.getCoreProperties());
 } catch (ParserConfigurationException e)  {
 
 
 
 This allows a clause like this to include the contents of replication.xml if 
 it 
 exists.  If it's not found an exception will be thrown.
 
 
 
 href=http://localhost:8983/solr/corename/admin/file/?file=replication.xml;
  xmlns:xi=http://www.w3.org/2001/XInclude;
 
 
 
 If the file is optional and no exception should be thrown if the file is 
 missing, simply include a fallback action: in this case the fallback is empty 
 and does nothing.
 
 
 
 href=http://localhost:8983/solr/forum_en/admin/file/?file=replication.xml;
  xmlns:xi=http://www.w3.org/2001/XInclude;
 
 
 
 
 -Bryan
 
 
 
 
 On May 12, 2009, at May 12, 8:05 PM, Jian Han Guo wrote:
 
  I was looking at the same problem, and had a discussion with Noble. You can
  use a hack to achieve what you want, see
  
  https://issues.apache.org/jira/browse/SOLR-1154
  
  Thanks,
  
  Jianhan
  
  
  On Tue, May 12, 2009 at 5:13 PM, Bryan Talbot wrote:
  
  So how are people managing solrconfig.xml files which are largely the same
  other than differences for replication?
  
  I don't think it's a good thing to maintain two copies of the same file
  and I'd like to avoid that.  Maybe enabling the XInclude feature in
  DocumentBuilders would make it possible to modularize configuration files 
  to
  make this possible?
  
  
  
 http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware(boolean)
  
  
  -Bryan
  
  
  
  
  
  On May 12, 2009, at May 12, 11:43 AM, Shalin Shekhar Mangar wrote:
  
  On Tue, May 12, 2009 at 10:42 PM, Bryan Talbot 
  wrote:
  
  For replication in 1.4, the wiki at
  http://wiki.apache.org/solr/SolrReplication says that a node can be both
  the master and a slave:
  
  A node can act as both master and slave. In that case both the master and
  slave configuration lists need to be present inside the
  ReplicationHandler
  requestHandler in the solrconfig.xml.
  
  What does this mean?  Does the core then poll itself for updates?
  
  
  
  No. This type of configuration is meant for repeaters. Suppose there are
  slaves in multiple data-centers (say data center A and B). There is always
  a
  single master (say in A). One of the slaves in B is used as a master for
  the
  other slaves in B. Therefore, this one slave in B is both a master as well
  as the slave.
  
  
  
  I'd like to have a single set of configuration files that are shared by
  masters and slaves and avoid duplicating configuration details in
  multiple
  files (one for master and one for slave) to ease management and failover.
  Is this possible?
  
  
  You wouldn't want the master to be a slave. So I guess you'd need to have
  a
  separate file. Also, it needs to be a separate file so that the slave does
  not become a master when the solrconfig.xml is replicated.
  
  
  
  When I attempt to setup a multi server master-slave configuration and
  include both master and slave replication configuration options, I into
  some
  problems.  I'm  running a nightly build from May 7.
  
  
  Not sure what happened. Is that the url for this solr (meaning same solr
  url
  is master and slave of itself)? If yes, that is not a valid configuration.
  
  --
  Regards,
  Shalin Shekhar Mangar.

Re: Replication master+slave

2009-05-13 Thread Peter Wolanin

Indeed - that looks nice - having some kind of conditional includes
would make many things easier.

-Peter

On Wed, May 13, 2009 at 4:22 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

This looks nice and simple. I don't know enough about this stuff to see any
issues. If there are no issues.?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Bryan Talbot btal...@aeriagames.com
To: solr-user@lucene.apache.org
Sent: Wednesday, May 13, 2009 11:26:41 AM
Subject: Re: Replication master+slave

I see that Nobel's final comment in SOLR-1154 is that config files need to be
able to include snippets from external files. In my limited testing, a
simple
patch to enable XInclude support seems to work.

--- src/java/org/apache/solr/core/Config.java (revision 774137)
+++ src/java/org/apache/solr/core/Config.java (working copy)
@@ -100,8 +100,10 @@
if (lis == null) {
lis = loader.openConfig(name);
}
- javax.xml.parsers.DocumentBuilder builder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
- doc = builder.parse(lis);
+ javax.xml.parsers.DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
+ dbf.setNamespaceAware(true);
+ dbf.setXIncludeAware(true);
+ doc = dbf.newDocumentBuilder().parse(lis);

DOMUtil.substituteProperties(doc, loader.getCoreProperties());
} catch (ParserConfigurationException e) {

This allows a clause like this to include the contents of replication.xml if
it
exists. If it's not found an exception will be thrown.

href=http://localhost:8983/solr/corename/admin/file/?file=replication.xml;
xmlns:xi=http://www.w3.org/2001/XInclude;

If the file is optional and no exception should be thrown if the file is
missing, simply include a fallback action: in this case the fallback is empty
and does nothing.

href=http://localhost:8983/solr/forum_en/admin/file/?file=replication.xml;
xmlns:xi=http://www.w3.org/2001/XInclude;

-Bryan

On May 12, 2009, at May 12, 8:05 PM, Jian Han Guo wrote:

I was looking at the same problem, and had a discussion with Noble. You can
use a hack to achieve what you want, see

https://issues.apache.org/jira/browse/SOLR-1154

Thanks,

Jianhan

On Tue, May 12, 2009 at 5:13 PM, Bryan Talbot wrote:

So how are people managing solrconfig.xml files which are largely the same
other than differences for replication?

I don't think it's a good thing to maintain two copies of the same file
and I'd like to avoid that. Maybe enabling the XInclude feature in
DocumentBuilders would make it possible to modularize configuration files
to
make this possible?

http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware(boolean)

-Bryan

On May 12, 2009, at May 12, 11:43 AM, Shalin Shekhar Mangar wrote:

On Tue, May 12, 2009 at 10:42 PM, Bryan Talbot
wrote:

For replication in 1.4, the wiki at
http://wiki.apache.org/solr/SolrReplication says that a node can be both
the master and a slave:

A node can act as both master and slave. In that case both the master
and
slave configuration lists need to be present inside the
ReplicationHandler
requestHandler in the solrconfig.xml.

What does this mean? Does the core then poll itself for updates?

No. This type of configuration is meant for repeaters. Suppose there
are
slaves in multiple data-centers (say data center A and B). There is
always
a
single master (say in A). One of the slaves in B is used as a master for
the
other slaves in B. Therefore, this one slave in B is both a master as
well
as the slave.

I'd like to have a single set of configuration files that are shared by
masters and slaves and avoid duplicating configuration details in
multiple
files (one for master and one for slave) to ease management and
failover.
Is this possible?

You wouldn't want the master to be a slave. So I guess you'd need to have
a
separate file. Also, it needs to be a separate file so that the slave
does
not become a master when the solrconfig.xml is replicated.

When I attempt to setup a multi server master-slave configuration and
include both master and slave replication configuration options, I into
some
problems. I'm running a nightly build from May 7.

Not sure what happened. Is that the url for this solr (meaning same solr
url
is master and slave of itself)? If yes, that is not a valid
configuration.

--
Regards,
Shalin Shekhar Mangar.

--
Peter M. Wolanin, Ph.D.
Momentum Specialist, Acquia. Inc.
peter.wola...@acquia.com

Re: Solr memory requirements?

Thanks Otis.

Our use case doesn't require any sorting or faceting. I'm wondering if
I've configured anything wrong.

I got total of 25 fields (15 are indexed and stored, other 10 are just
stored). All my fields are basic data type - which I thought are not
sorted. My id field is unique key.

Is there any field here that might be getting sorted?

 field name=id type=long indexed=true stored=true
required=true omitNorms=true compressed=false/

   field name=atmps type=integer indexed=false stored=true
compressed=false/
   field name=bcid type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=cmpcd type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=ctry type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=dlt type=date indexed=false stored=true
default=NOW/HOUR  compressed=false/
   field name=dmn type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=eaddr type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=emsg type=string indexed=false stored=true
compressed=false/
   field name=erc type=string indexed=false stored=true
compressed=false/
   field name=evt type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=from type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=lfid type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=lsid type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=prsid type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=rc type=string indexed=false stored=true
compressed=false/
   field name=rmcd type=string indexed=false stored=true
compressed=false/
   field name=rmscd type=string indexed=false stored=true
compressed=false/
   field name=scd type=string indexed=true stored=true
omitNorms=true compressed=false/
   field name=sip type=string indexed=false stored=true
compressed=false/
   field name=ts type=date indexed=true stored=false
default=NOW/HOUR omitNorms=true/


   !-- catchall field, containing all other searchable text fields (implemented
via copyField further on in this schema  --
   field name=all type=text_ws indexed=true stored=false
omitNorms=true multiValued=true/

Thanks,
-vivek

On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 Hi,
 Some answers:
 1) .tii files in the Lucene index.  When you sort, all distinct values for 
 the field(s) used for sorting.  Similarly for facet fields.  Solr caches.
 2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will consume 
 during indexing.  There is no need to commit every 50K docs unless you want 
 to trigger snapshot creation.
 3) see 1) above

 1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's 
 going to fly. :)

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 3:04:46 PM
 Subject: Solr memory requirements?

 Hi,

   I'm pretty sure this has been asked before, but I couldn't find a
 complete answer in the forum archive. Here are my questions,

 1) When solr starts up what does it loads up in the memory? Let's say
 I've 4 cores with each core 50G in size. When Solr comes up how much
 of it would be loaded in memory?

 2) How much memory is required during index time? If I'm committing
 50K records at a time (1 record = 1KB) using solrj, how much memory do
 I need to give to Solr.

 3) Is there a minimum memory requirement by Solr to maintain a certain
 size index? Is there any benchmark on this?

 Here are some of my configuration from solrconfig.xml,

 1) 64
 2) All the caches (under query tag) are commented out
 3) Few others,
       a)  true    ==
 would this require memory?
       b)  50
       c) 200
       d)
       e) false
       f)  2

 The problem we are having is following,

 I've given Solr RAM of 6G. As the total index size (all cores
 combined) start growing the Solr memory consumption  goes up. With 800
 million documents, I see Solr already taking up all the memory at
 startup. After that the commits, searches everything become slow. We
 will be having distributed setup with multiple Solr instances (around
 8) on four boxes, but our requirement is to have each Solr instance at
 least maintain around 1.5 billion documents.

 We are trying to see if we can somehow reduce the Solr memory
 footprint. If someone can provide a pointer on what parameters affect
 memory and what effects it has we can then decide whether we want that
 parameter or not. I'm not sure if there is any minimum Solr
 requirement for it to be able maintain large indexes. I've used Lucene
 before and that didn't require anything by default - it used up memory
 only during index and search times - not otherwise.

 Any

Re: Solr memory requirements?


Hi,

Sorting is triggered by the sort parameter in the URL, not a characteristic of 
a field. :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 4:42:16 PM
 Subject: Re: Solr memory requirements?
 
 Thanks Otis.
 
 Our use case doesn't require any sorting or faceting. I'm wondering if
 I've configured anything wrong.
 
 I got total of 25 fields (15 are indexed and stored, other 10 are just
 stored). All my fields are basic data type - which I thought are not
 sorted. My id field is unique key.
 
 Is there any field here that might be getting sorted?
 
 
 required=true omitNorms=true compressed=false/
 
   
 compressed=false/
   
 omitNorms=true compressed=false/
   
 omitNorms=true compressed=false/
   
 omitNorms=true compressed=false/
   
 default=NOW/HOUR  compressed=false/
   
 omitNorms=true compressed=false/
   
 omitNorms=true compressed=false/
   
 compressed=false/
   
 compressed=false/
   
 omitNorms=true compressed=false/
   
 omitNorms=true compressed=false/
   
 omitNorms=true compressed=false/
   
 omitNorms=true compressed=false/
   
 omitNorms=true compressed=false/
   
 compressed=false/
   
 compressed=false/
   
 compressed=false/
   
 omitNorms=true compressed=false/
   
 compressed=false/
   
 default=NOW/HOUR omitNorms=true/
 
 
   
   
 omitNorms=true multiValued=true/
 
 Thanks,
 -vivek
 
 On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
 wrote:
 
  Hi,
  Some answers:
  1) .tii files in the Lucene index.  When you sort, all distinct values for 
  the 
 field(s) used for sorting.  Similarly for facet fields.  Solr caches.
  2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will 
  consume 
 during indexing.  There is no need to commit every 50K docs unless you want 
 to 
 trigger snapshot creation.
  3) see 1) above
 
  1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's 
  going 
 to fly. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar 
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 3:04:46 PM
  Subject: Solr memory requirements?
 
  Hi,
 
I'm pretty sure this has been asked before, but I couldn't find a
  complete answer in the forum archive. Here are my questions,
 
  1) When solr starts up what does it loads up in the memory? Let's say
  I've 4 cores with each core 50G in size. When Solr comes up how much
  of it would be loaded in memory?
 
  2) How much memory is required during index time? If I'm committing
  50K records at a time (1 record = 1KB) using solrj, how much memory do
  I need to give to Solr.
 
  3) Is there a minimum memory requirement by Solr to maintain a certain
  size index? Is there any benchmark on this?
 
  Here are some of my configuration from solrconfig.xml,
 
  1) 64
  2) All the caches (under query tag) are commented out
  3) Few others,
a)  true==
  would this require memory?
b)  50
c) 200
d)
e) false
f)  2
 
  The problem we are having is following,
 
  I've given Solr RAM of 6G. As the total index size (all cores
  combined) start growing the Solr memory consumption  goes up. With 800
  million documents, I see Solr already taking up all the memory at
  startup. After that the commits, searches everything become slow. We
  will be having distributed setup with multiple Solr instances (around
  8) on four boxes, but our requirement is to have each Solr instance at
  least maintain around 1.5 billion documents.
 
  We are trying to see if we can somehow reduce the Solr memory
  footprint. If someone can provide a pointer on what parameters affect
  memory and what effects it has we can then decide whether we want that
  parameter or not. I'm not sure if there is any minimum Solr
  requirement for it to be able maintain large indexes. I've used Lucene
  before and that didn't require anything by default - it used up memory
  only during index and search times - not otherwise.
 
  Any help is very much appreciated.
 
  Thanks,
  -vivek

Re: Solr memory requirements?

Otis,

In that case, I'm not sure why Solr is taking up so much memory as
soon as we start it up. I checked for .tii file and there is only one,

-rw-r--r--  1 search  staff  20306 May 11 21:47 ./20090510_1/data/index/_3au.tii

I have all the cache disabled - so that shouldn't be a problem too. My
ramBuffer size is only 64MB.

I read note on sorting,
http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
something related to FieldCache. I don't see this as parameter defined
in either solrconfig.xml or schema.xml. Could this be something that
can load things in memory at startup? How can we disable it?

I'm trying to find out if there is a way to tell how much memory Solr
would consume and way to cap it.

Thanks,
-vivek




On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 Hi,

 Sorting is triggered by the sort parameter in the URL, not a characteristic 
 of a field. :)

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 4:42:16 PM
 Subject: Re: Solr memory requirements?

 Thanks Otis.

 Our use case doesn't require any sorting or faceting. I'm wondering if
 I've configured anything wrong.

 I got total of 25 fields (15 are indexed and stored, other 10 are just
 stored). All my fields are basic data type - which I thought are not
 sorted. My id field is unique key.

 Is there any field here that might be getting sorted?


 required=true omitNorms=true compressed=false/


 compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 default=NOW/HOUR  compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 compressed=false/

 compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 default=NOW/HOUR omitNorms=true/




 omitNorms=true multiValued=true/

 Thanks,
 -vivek

 On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
 wrote:
 
  Hi,
  Some answers:
  1) .tii files in the Lucene index.  When you sort, all distinct values for 
  the
 field(s) used for sorting.  Similarly for facet fields.  Solr caches.
  2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will 
  consume
 during indexing.  There is no need to commit every 50K docs unless you want 
 to
 trigger snapshot creation.
  3) see 1) above
 
  1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's 
  going
 to fly. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 3:04:46 PM
  Subject: Solr memory requirements?
 
  Hi,
 
    I'm pretty sure this has been asked before, but I couldn't find a
  complete answer in the forum archive. Here are my questions,
 
  1) When solr starts up what does it loads up in the memory? Let's say
  I've 4 cores with each core 50G in size. When Solr comes up how much
  of it would be loaded in memory?
 
  2) How much memory is required during index time? If I'm committing
  50K records at a time (1 record = 1KB) using solrj, how much memory do
  I need to give to Solr.
 
  3) Is there a minimum memory requirement by Solr to maintain a certain
  size index? Is there any benchmark on this?
 
  Here are some of my configuration from solrconfig.xml,
 
  1) 64
  2) All the caches (under query tag) are commented out
  3) Few others,
        a)  true    ==
  would this require memory?
        b)  50
        c) 200
        d)
        e) false
        f)  2
 
  The problem we are having is following,
 
  I've given Solr RAM of 6G. As the total index size (all cores
  combined) start growing the Solr memory consumption  goes up. With 800
  million documents, I see Solr already taking up all the memory at
  startup. After that the commits, searches everything become slow. We
  will be having distributed setup with multiple Solr instances (around
  8) on four boxes, but our requirement is to have each Solr instance at
  least maintain around 1.5 billion documents.
 
  We are trying to see if we can somehow reduce the Solr memory
  footprint. If someone can provide a pointer on what parameters affect
  memory and what effects it has we can then decide whether we want that
  parameter or not. I'm not sure if there is any minimum Solr
  requirement for it to be able maintain large indexes. I've used Lucene
  before and that didn't require anything by default - it used up memory
  only during index and search times - not otherwise.
 
  Any help is very much appreciated.
 
  Thanks,
  -vivek

SOLR date boost

2009-05-13 Thread Jack Godwin

With solr 1.3 I'm having a problem boosting new documents to the top.  I
used the recommended BoostFunction  recip(rord(created_at),1,1000,1000)
but older documents, sometimes 5 years old, make it to the top 3 documents.
 I've started using ord(created_at)^0.0005 and get better results, but I
don't think I should be... From what I understand rord is descending order
and ord is ascending order, so why does this work?  Does Solr 1.3 still have
issues with date fields?
Thanks,
Jack

Re: Solr memory requirements?

Have you done any profiling to see where the hotspots are?  I realize  
that may be difficult on an index of that size, but maybe you can  
approximate on a smaller version.  Also, do you have warming queries?


You might also look into setting the termIndexInterval at the Lucene  
level.  This is not currently exposed in Solr (AFAIK), but likely  
could be added fairly easily as part of the index parameters.  http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/index/IndexWriter.html#setTermIndexInterval(int)


-Grant

On May 13, 2009, at 5:12 PM, vivek sar wrote:


Otis,

In that case, I'm not sure why Solr is taking up so much memory as
soon as we start it up. I checked for .tii file and there is only one,

-rw-r--r--  1 search  staff  20306 May 11 21:47 ./20090510_1/data/ 
index/_3au.tii


I have all the cache disabled - so that shouldn't be a problem too. My
ramBuffer size is only 64MB.

I read note on sorting,
http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
something related to FieldCache. I don't see this as parameter defined
in either solrconfig.xml or schema.xml. Could this be something that
can load things in memory at startup? How can we disable it?

I'm trying to find out if there is a way to tell how much memory Solr
would consume and way to cap it.

Thanks,
-vivek




On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:


Hi,

Sorting is triggered by the sort parameter in the URL, not a  
characteristic of a field. :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: vivek sar vivex...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wednesday, May 13, 2009 4:42:16 PM
Subject: Re: Solr memory requirements?

Thanks Otis.

Our use case doesn't require any sorting or faceting. I'm  
wondering if

I've configured anything wrong.

I got total of 25 fields (15 are indexed and stored, other 10 are  
just

stored). All my fields are basic data type - which I thought are not
sorted. My id field is unique key.

Is there any field here that might be getting sorted?


required=true omitNorms=true compressed=false/


compressed=false/

omitNorms=true compressed=false/

omitNorms=true compressed=false/

omitNorms=true compressed=false/

default=NOW/HOUR  compressed=false/

omitNorms=true compressed=false/

omitNorms=true compressed=false/

compressed=false/

compressed=false/

omitNorms=true compressed=false/

omitNorms=true compressed=false/

omitNorms=true compressed=false/

omitNorms=true compressed=false/

omitNorms=true compressed=false/

compressed=false/

compressed=false/

compressed=false/

omitNorms=true compressed=false/

compressed=false/

default=NOW/HOUR omitNorms=true/




omitNorms=true multiValued=true/

Thanks,
-vivek

On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
wrote:


Hi,
Some answers:
1) .tii files in the Lucene index.  When you sort, all distinct  
values for the
field(s) used for sorting.  Similarly for facet fields.  Solr  
caches.
2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr  
will consume
during indexing.  There is no need to commit every 50K docs unless  
you want to

trigger snapshot creation.

3) see 1) above

1.5 billion docs per instance where each doc is cca 1KB?  I doubt  
that's going

to fly. :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: vivek sar
To: solr-user@lucene.apache.org
Sent: Wednesday, May 13, 2009 3:04:46 PM
Subject: Solr memory requirements?

Hi,

  I'm pretty sure this has been asked before, but I couldn't  
find a

complete answer in the forum archive. Here are my questions,

1) When solr starts up what does it loads up in the memory?  
Let's say
I've 4 cores with each core 50G in size. When Solr comes up how  
much

of it would be loaded in memory?

2) How much memory is required during index time? If I'm  
committing
50K records at a time (1 record = 1KB) using solrj, how much  
memory do

I need to give to Solr.

3) Is there a minimum memory requirement by Solr to maintain a  
certain

size index? Is there any benchmark on this?

Here are some of my configuration from solrconfig.xml,

1) 64
2) All the caches (under query tag) are commented out
3) Few others,
  a)  true==
would this require memory?
  b)  50
  c) 200
  d)
  e) false
  f)  2

The problem we are having is following,

I've given Solr RAM of 6G. As the total index size (all cores
combined) start growing the Solr memory consumption  goes up.  
With 800

million documents, I see Solr already taking up all the memory at
startup. After that the commits, searches everything become  
slow. We
will be having distributed setup with multiple Solr instances  
(around
8) on four boxes, but our requirement is to have each Solr  
instance at

least maintain around 1.5 billion documents.

We are trying to see if we can somehow reduce the Solr memory
footprint. If someone can

Re: Solr memory requirements?

Just an update on the memory issue - might be useful for others. I
read the following,

 http://wiki.apache.org/solr/SolrCaching?highlight=(SolrCaching)

and looks like the first and new searcher listeners would populate the
FieldCache. Commenting out these two listener entries seems to do the
trick - at least the heap size is not growing as soon as Solr starts
up.

I ran some searches and they all came out fine. Index rate is also
pretty good. Would there be any impact of disabling these listeners?

Thanks,
-vivek

On Wed, May 13, 2009 at 2:12 PM, vivek sar vivex...@gmail.com wrote:
 Otis,

 In that case, I'm not sure why Solr is taking up so much memory as
 soon as we start it up. I checked for .tii file and there is only one,

 -rw-r--r--  1 search  staff  20306 May 11 21:47 
 ./20090510_1/data/index/_3au.tii

 I have all the cache disabled - so that shouldn't be a problem too. My
 ramBuffer size is only 64MB.

 I read note on sorting,
 http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
 something related to FieldCache. I don't see this as parameter defined
 in either solrconfig.xml or schema.xml. Could this be something that
 can load things in memory at startup? How can we disable it?

 I'm trying to find out if there is a way to tell how much memory Solr
 would consume and way to cap it.

 Thanks,
 -vivek




 On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:

 Hi,

 Sorting is triggered by the sort parameter in the URL, not a characteristic 
 of a field. :)

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 4:42:16 PM
 Subject: Re: Solr memory requirements?

 Thanks Otis.

 Our use case doesn't require any sorting or faceting. I'm wondering if
 I've configured anything wrong.

 I got total of 25 fields (15 are indexed and stored, other 10 are just
 stored). All my fields are basic data type - which I thought are not
 sorted. My id field is unique key.

 Is there any field here that might be getting sorted?


 required=true omitNorms=true compressed=false/


 compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 default=NOW/HOUR  compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 compressed=false/

 compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 default=NOW/HOUR omitNorms=true/




 omitNorms=true multiValued=true/

 Thanks,
 -vivek

 On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
 wrote:
 
  Hi,
  Some answers:
  1) .tii files in the Lucene index.  When you sort, all distinct values 
  for the
 field(s) used for sorting.  Similarly for facet fields.  Solr caches.
  2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will 
  consume
 during indexing.  There is no need to commit every 50K docs unless you want 
 to
 trigger snapshot creation.
  3) see 1) above
 
  1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's 
  going
 to fly. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 3:04:46 PM
  Subject: Solr memory requirements?
 
  Hi,
 
    I'm pretty sure this has been asked before, but I couldn't find a
  complete answer in the forum archive. Here are my questions,
 
  1) When solr starts up what does it loads up in the memory? Let's say
  I've 4 cores with each core 50G in size. When Solr comes up how much
  of it would be loaded in memory?
 
  2) How much memory is required during index time? If I'm committing
  50K records at a time (1 record = 1KB) using solrj, how much memory do
  I need to give to Solr.
 
  3) Is there a minimum memory requirement by Solr to maintain a certain
  size index? Is there any benchmark on this?
 
  Here are some of my configuration from solrconfig.xml,
 
  1) 64
  2) All the caches (under query tag) are commented out
  3) Few others,
        a)  true    ==
  would this require memory?
        b)  50
        c) 200
        d)
        e) false
        f)  2
 
  The problem we are having is following,
 
  I've given Solr RAM of 6G. As the total index size (all cores
  combined) start growing the Solr memory consumption  goes up. With 800
  million documents, I see Solr already taking up all the memory at
  startup. After that the commits, searches everything become slow. We
  will be having distributed setup with multiple Solr instances (around
  8) on four boxes, but our requirement is to have each Solr instance at
  least

Re: Solr memory requirements?

Disabling first/new searchers did help for the initial load time, but
after 10-15 min the heap memory start climbing up again and reached
max within 20 min. Now the GC is coming up all the time, which is
slowing down the commit and search cycles.

This is still puzzling what does Solr holds in the memory and doesn't release?

I haven't been able to profile as the dump is too big. Would setting
termIndexInterval help - not sure how can that be set using Solr.

Some other query properties under solrconfig,

query
   maxBooleanClauses1024/maxBooleanClauses
   enableLazyFieldLoadingtrue/enableLazyFieldLoading
   queryResultWindowSize50/queryResultWindowSize
   queryResultMaxDocsCached200/queryResultMaxDocsCached
HashDocSet maxSize=3000 loadFactor=0.75/
   useColdSearcherfalse/useColdSearcher
   maxWarmingSearchers2/maxWarmingSearchers
 /query

Currently, I got 800 million documents and have specified 8G heap size.

Any other suggestion on what can I do to control the Solr memory consumption?

Thanks,
-vivek

On Wed, May 13, 2009 at 2:53 PM, vivek sar vivex...@gmail.com wrote:
 Just an update on the memory issue - might be useful for others. I
 read the following,

  http://wiki.apache.org/solr/SolrCaching?highlight=(SolrCaching)

 and looks like the first and new searcher listeners would populate the
 FieldCache. Commenting out these two listener entries seems to do the
 trick - at least the heap size is not growing as soon as Solr starts
 up.

 I ran some searches and they all came out fine. Index rate is also
 pretty good. Would there be any impact of disabling these listeners?

 Thanks,
 -vivek

 On Wed, May 13, 2009 at 2:12 PM, vivek sar vivex...@gmail.com wrote:
 Otis,

 In that case, I'm not sure why Solr is taking up so much memory as
 soon as we start it up. I checked for .tii file and there is only one,

 -rw-r--r--  1 search  staff  20306 May 11 21:47 
 ./20090510_1/data/index/_3au.tii

 I have all the cache disabled - so that shouldn't be a problem too. My
 ramBuffer size is only 64MB.

 I read note on sorting,
 http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
 something related to FieldCache. I don't see this as parameter defined
 in either solrconfig.xml or schema.xml. Could this be something that
 can load things in memory at startup? How can we disable it?

 I'm trying to find out if there is a way to tell how much memory Solr
 would consume and way to cap it.

 Thanks,
 -vivek




 On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:

 Hi,

 Sorting is triggered by the sort parameter in the URL, not a characteristic 
 of a field. :)

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 4:42:16 PM
 Subject: Re: Solr memory requirements?

 Thanks Otis.

 Our use case doesn't require any sorting or faceting. I'm wondering if
 I've configured anything wrong.

 I got total of 25 fields (15 are indexed and stored, other 10 are just
 stored). All my fields are basic data type - which I thought are not
 sorted. My id field is unique key.

 Is there any field here that might be getting sorted?


 required=true omitNorms=true compressed=false/


 compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 default=NOW/HOUR  compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 compressed=false/

 compressed=false/

 omitNorms=true compressed=false/

 compressed=false/

 default=NOW/HOUR omitNorms=true/




 omitNorms=true multiValued=true/

 Thanks,
 -vivek

 On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
 wrote:
 
  Hi,
  Some answers:
  1) .tii files in the Lucene index.  When you sort, all distinct values 
  for the
 field(s) used for sorting.  Similarly for facet fields.  Solr caches.
  2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will 
  consume
 during indexing.  There is no need to commit every 50K docs unless you 
 want to
 trigger snapshot creation.
  3) see 1) above
 
  1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's 
  going
 to fly. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 3:04:46 PM
  Subject: Solr memory requirements?
 
  Hi,
 
    I'm pretty sure this has been asked before, but I couldn't find a
  complete answer in the forum archive. Here are my questions,
 
  1) When solr starts up what does it loads up in the memory? Let's say
  I've 4 cores with each core 50G

Re: Solr memory requirements?

2009-05-13 Thread Jack Godwin

Have you checked the maxBufferedDocs?  I had to drop mine down to 1000 with
3 million docs.
Jack

On Wed, May 13, 2009 at 6:53 PM, vivek sar vivex...@gmail.com wrote:

 Disabling first/new searchers did help for the initial load time, but
 after 10-15 min the heap memory start climbing up again and reached
 max within 20 min. Now the GC is coming up all the time, which is
 slowing down the commit and search cycles.

 This is still puzzling what does Solr holds in the memory and doesn't
 release?

 I haven't been able to profile as the dump is too big. Would setting
 termIndexInterval help - not sure how can that be set using Solr.

 Some other query properties under solrconfig,

 query
   maxBooleanClauses1024/maxBooleanClauses
   enableLazyFieldLoadingtrue/enableLazyFieldLoading
   queryResultWindowSize50/queryResultWindowSize
   queryResultMaxDocsCached200/queryResultMaxDocsCached
HashDocSet maxSize=3000 loadFactor=0.75/
   useColdSearcherfalse/useColdSearcher
   maxWarmingSearchers2/maxWarmingSearchers
  /query

 Currently, I got 800 million documents and have specified 8G heap size.

 Any other suggestion on what can I do to control the Solr memory
 consumption?

 Thanks,
 -vivek

 On Wed, May 13, 2009 at 2:53 PM, vivek sar vivex...@gmail.com wrote:
  Just an update on the memory issue - might be useful for others. I
  read the following,
 
   http://wiki.apache.org/solr/SolrCaching?highlight=(SolrCaching)
 
  and looks like the first and new searcher listeners would populate the
  FieldCache. Commenting out these two listener entries seems to do the
  trick - at least the heap size is not growing as soon as Solr starts
  up.
 
  I ran some searches and they all came out fine. Index rate is also
  pretty good. Would there be any impact of disabling these listeners?
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 2:12 PM, vivek sar vivex...@gmail.com wrote:
  Otis,
 
  In that case, I'm not sure why Solr is taking up so much memory as
  soon as we start it up. I checked for .tii file and there is only one,
 
  -rw-r--r--  1 search  staff  20306 May 11 21:47
 ./20090510_1/data/index/_3au.tii
 
  I have all the cache disabled - so that shouldn't be a problem too. My
  ramBuffer size is only 64MB.
 
  I read note on sorting,
  http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
  something related to FieldCache. I don't see this as parameter defined
  in either solrconfig.xml or schema.xml. Could this be something that
  can load things in memory at startup? How can we disable it?
 
  I'm trying to find out if there is a way to tell how much memory Solr
  would consume and way to cap it.
 
  Thanks,
  -vivek
 
 
 
 
  On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
 
  Hi,
 
  Sorting is triggered by the sort parameter in the URL, not a
 characteristic of a field. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar vivex...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 4:42:16 PM
  Subject: Re: Solr memory requirements?
 
  Thanks Otis.
 
  Our use case doesn't require any sorting or faceting. I'm wondering if
  I've configured anything wrong.
 
  I got total of 25 fields (15 are indexed and stored, other 10 are just
  stored). All my fields are basic data type - which I thought are not
  sorted. My id field is unique key.
 
  Is there any field here that might be getting sorted?
 
 
  required=true omitNorms=true compressed=false/
 
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  default=NOW/HOUR  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  default=NOW/HOUR omitNorms=true/
 
 
 
 
  omitNorms=true multiValued=true/
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
  wrote:
  
   Hi,
   Some answers:
   1) .tii files in the Lucene index.  When you sort, all distinct
 values for the
  field(s) used for sorting.  Similarly for facet fields.  Solr caches.
   2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will
 consume
  during indexing.  There is no need to commit every 50K docs unless you
 want to
  trigger snapshot creation.
   3) see 1) above
  
   1.5 billion docs per instance where each doc is cca 1KB?  I doubt
 that's going
  to fly. :)
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: vivek sar
   To: solr-user@lucene.apache.org

acts_as_solr patch support for Solr Cell style requests

2009-05-13 Thread Thanh Doan

Hi Erik et all,

I am following this  tutorial link
http://www.lucidimagination.com/blog/tag/acts_as_solr/

to play with acts_as_solr and see if we can invoke solr cell right
from our Rails app.

following he tutorial i created  classSolrCellRequest  but dont
know where to save the solr_cell_request.rb file to.

Should I save file solr_cell_request.rb to
/path/to/resume/vendor/plugins/acts_as_solr/lib  directory
or
I have to save it to
/path/to/resume/vendor/plugins/acts_as_solr/lib/solr/request directory
where the Solr::Request::Select class locate?

Thanks!

Thanh Doan

Re: Solr memory requirements?

I think maxBufferedDocs has been deprecated in Solr 1.4 - it's
recommended to use ramBufferSizeMB instead. My ramBufferSizeMB=64.
This shouldn't be a problem I think.

There has to be something else that Solr is holding up in memory. Anyone else?

Thanks,
-vivek

On Wed, May 13, 2009 at 4:01 PM, Jack Godwin god...@gmail.com wrote:
 Have you checked the maxBufferedDocs?  I had to drop mine down to 1000 with
 3 million docs.
 Jack

 On Wed, May 13, 2009 at 6:53 PM, vivek sar vivex...@gmail.com wrote:

 Disabling first/new searchers did help for the initial load time, but
 after 10-15 min the heap memory start climbing up again and reached
 max within 20 min. Now the GC is coming up all the time, which is
 slowing down the commit and search cycles.

 This is still puzzling what does Solr holds in the memory and doesn't
 release?

 I haven't been able to profile as the dump is too big. Would setting
 termIndexInterval help - not sure how can that be set using Solr.

 Some other query properties under solrconfig,

 query
   maxBooleanClauses1024/maxBooleanClauses
   enableLazyFieldLoadingtrue/enableLazyFieldLoading
   queryResultWindowSize50/queryResultWindowSize
   queryResultMaxDocsCached200/queryResultMaxDocsCached
    HashDocSet maxSize=3000 loadFactor=0.75/
   useColdSearcherfalse/useColdSearcher
   maxWarmingSearchers2/maxWarmingSearchers
  /query

 Currently, I got 800 million documents and have specified 8G heap size.

 Any other suggestion on what can I do to control the Solr memory
 consumption?

 Thanks,
 -vivek

 On Wed, May 13, 2009 at 2:53 PM, vivek sar vivex...@gmail.com wrote:
  Just an update on the memory issue - might be useful for others. I
  read the following,
 
   http://wiki.apache.org/solr/SolrCaching?highlight=(SolrCaching)
 
  and looks like the first and new searcher listeners would populate the
  FieldCache. Commenting out these two listener entries seems to do the
  trick - at least the heap size is not growing as soon as Solr starts
  up.
 
  I ran some searches and they all came out fine. Index rate is also
  pretty good. Would there be any impact of disabling these listeners?
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 2:12 PM, vivek sar vivex...@gmail.com wrote:
  Otis,
 
  In that case, I'm not sure why Solr is taking up so much memory as
  soon as we start it up. I checked for .tii file and there is only one,
 
  -rw-r--r--  1 search  staff  20306 May 11 21:47
 ./20090510_1/data/index/_3au.tii
 
  I have all the cache disabled - so that shouldn't be a problem too. My
  ramBuffer size is only 64MB.
 
  I read note on sorting,
  http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
  something related to FieldCache. I don't see this as parameter defined
  in either solrconfig.xml or schema.xml. Could this be something that
  can load things in memory at startup? How can we disable it?
 
  I'm trying to find out if there is a way to tell how much memory Solr
  would consume and way to cap it.
 
  Thanks,
  -vivek
 
 
 
 
  On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
 
  Hi,
 
  Sorting is triggered by the sort parameter in the URL, not a
 characteristic of a field. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar vivex...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 4:42:16 PM
  Subject: Re: Solr memory requirements?
 
  Thanks Otis.
 
  Our use case doesn't require any sorting or faceting. I'm wondering if
  I've configured anything wrong.
 
  I got total of 25 fields (15 are indexed and stored, other 10 are just
  stored). All my fields are basic data type - which I thought are not
  sorted. My id field is unique key.
 
  Is there any field here that might be getting sorted?
 
 
  required=true omitNorms=true compressed=false/
 
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  default=NOW/HOUR  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  default=NOW/HOUR omitNorms=true/
 
 
 
 
  omitNorms=true multiValued=true/
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
  wrote:
  
   Hi,
   Some answers:
   1) .tii files in the Lucene index.  When you sort, all distinct
 values for the
  field(s) used for sorting.  Similarly for facet fields.  Solr caches.
   2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will
 consume
  during indexing.  There is no need to commit every 50K docs unless

Re: Solr memory requirements?

2009-05-13 Thread Erick Erickson

Warning: I'm wy out of my competency range when I comment
on SOLR, but I've seen the statement that string fields are NOT
tokenized while text fields are, and I notice that almost all of your fields
are string type.

Would someone more knowledgeable than me care to comment on whether
this is at all relevant? Offered in the spirit that sometimes there are
things
so basic that only an amateur can see them G

Best
Erick

On Wed, May 13, 2009 at 4:42 PM, vivek sar vivex...@gmail.com wrote:

 Thanks Otis.

 Our use case doesn't require any sorting or faceting. I'm wondering if
 I've configured anything wrong.

 I got total of 25 fields (15 are indexed and stored, other 10 are just
 stored). All my fields are basic data type - which I thought are not
 sorted. My id field is unique key.

 Is there any field here that might be getting sorted?

  field name=id type=long indexed=true stored=true
 required=true omitNorms=true compressed=false/

   field name=atmps type=integer indexed=false stored=true
 compressed=false/
   field name=bcid type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=cmpcd type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=ctry type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=dlt type=date indexed=false stored=true
 default=NOW/HOUR  compressed=false/
   field name=dmn type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=eaddr type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=emsg type=string indexed=false stored=true
 compressed=false/
   field name=erc type=string indexed=false stored=true
 compressed=false/
   field name=evt type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=from type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=lfid type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=lsid type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=prsid type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=rc type=string indexed=false stored=true
 compressed=false/
   field name=rmcd type=string indexed=false stored=true
 compressed=false/
   field name=rmscd type=string indexed=false stored=true
 compressed=false/
   field name=scd type=string indexed=true stored=true
 omitNorms=true compressed=false/
   field name=sip type=string indexed=false stored=true
 compressed=false/
   field name=ts type=date indexed=true stored=false
 default=NOW/HOUR omitNorms=true/


   !-- catchall field, containing all other searchable text fields
 (implemented
via copyField further on in this schema  --
   field name=all type=text_ws indexed=true stored=false
 omitNorms=true multiValued=true/

 Thanks,
 -vivek

 On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
 otis_gospodne...@yahoo.com wrote:
 
  Hi,
  Some answers:
  1) .tii files in the Lucene index.  When you sort, all distinct values
 for the field(s) used for sorting.  Similarly for facet fields.  Solr
 caches.
  2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will
 consume during indexing.  There is no need to commit every 50K docs unless
 you want to trigger snapshot creation.
  3) see 1) above
 
  1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's
 going to fly. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar vivex...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 3:04:46 PM
  Subject: Solr memory requirements?
 
  Hi,
 
I'm pretty sure this has been asked before, but I couldn't find a
  complete answer in the forum archive. Here are my questions,
 
  1) When solr starts up what does it loads up in the memory? Let's say
  I've 4 cores with each core 50G in size. When Solr comes up how much
  of it would be loaded in memory?
 
  2) How much memory is required during index time? If I'm committing
  50K records at a time (1 record = 1KB) using solrj, how much memory do
  I need to give to Solr.
 
  3) Is there a minimum memory requirement by Solr to maintain a certain
  size index? Is there any benchmark on this?
 
  Here are some of my configuration from solrconfig.xml,
 
  1) 64
  2) All the caches (under query tag) are commented out
  3) Few others,
a)  true==
  would this require memory?
b)  50
c) 200
d)
e) false
f)  2
 
  The problem we are having is following,
 
  I've given Solr RAM of 6G. As the total index size (all cores
  combined) start growing the Solr memory consumption  goes up. With 800
  million documents, I see Solr already taking up all the memory at
  startup. After that the commits, searches everything become slow. We
  will be having distributed setup with multiple Solr instances (around

Re: acts_as_solr patch support for Solr Cell style requests

2009-05-13 Thread Thanh Doan

I created Ruby class SolrCellRequest and saved it to
/path/to/resume/vendor/plugins/acts_as_solr/lib  directory.

Here is code original from the tutorial.

module ActsAsSolr
  class SolrCellRequest  Solr::Request::Select
def initialize(doc,file_name)
 .
 .
  def handler
  'update/extract'
end
  end

  class SolrCellResponse  Solr::Response::Ruby
  end

end

however when I start using it
$ script/console
Loading development environment (Rails 2.2.2)
 solr = Solr::Connection.new(http://localhost:8982/solr;)
 req = SolrCellRequest.new(Solr::Document.new(:id=1), '/path/to/resume.pdf')

I got this error

 req = SolrCellRequest.new(Solr::Document.new(:id=1), 
 '/Users/tcdoan/eric.pdf')
LoadError: Expected
/Users/tcdoan/resume/vendor/plugins/acts_as_solr/lib/solr_cell_request.rb
to define SolrCellRequest
from 
/Library/Ruby/Gems/1.8/gems/activesupport-2.3.2/lib/active_support/dependencies.rb:426:in
`load_missing_constant'
from 
/Library/Ruby/Gems/1.8/gems/activesupport-2.3.2/lib/active_support/dependencies.rb:80:in
`const_missing'
from 
/Library/Ruby/Gems/1.8/gems/activesupport-2.3.2/lib/active_support/dependencies.rb:92:in
`const_missing'
from (irb):2

Can you tell what was wrong here. Thanks.

Thanh


On Wed, May 13, 2009 at 6:11 PM, Thanh Doan tcd...@gmail.com wrote:
 Hi Erik et all,

 I am following this  tutorial link
 http://www.lucidimagination.com/blog/tag/acts_as_solr/

 to play with acts_as_solr and see if we can invoke solr cell right
 from our Rails app.

 following he tutorial i created  class    c  but dont
 know where to save the solr_cell_request.rb file to.

 Should I save file solr_cell_request.rb to
 /path/to/resume/vendor/plugins/acts_as_solr/lib  directory
 or
 I have to save it to
 /path/to/resume/vendor/plugins/acts_as_solr/lib/solr/request directory
 where the Solr::Request::Select class locate?

 Thanks!

 Thanh Doan




-- 
Regards,
Thanh Doan
713-884-0576
http://datamatter.blogspot.com/

Java Environment Problem on Vista

2009-05-13 Thread John Bennett

I'm having difficulty getting Solr running on Vista. I've got the 1.6 
JDK installed, and I've successfully compiled file and run other Java 
programs.


When I run java -jar start.jar in the Apache Solr example directory, I 
get a large number of INFO messages, including:


INFO: JNDI not configured for solr (NoInitialContextEx)

When I visit localhost:8983/solr/, I get a 404 error message:


   HTTP ERROR: 404

NOT_FOUND

RequestURI=/solr/

/Powered by jetty:// http://jetty.mortbay.org/

I've talked to a couple of engineers who suspect that the problem is 
with my Java environment. My environment is configured as follows:


CLASSPATH=.;C:\Program 
Files\Java\jdk1.6.0_13\lib\ext\QTJava.zip;C:\Users\John\Documents\Java;C:\Program 
Files\Java\jdk1.6.0_13;

JAVA_HOME=C:\Program_Files\Java\jdk1.6.0_13
Path=C:\Program Files\Snap\scripts;C:\Program 
Files\Snap;C:\Python25\Scripts;%SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;c:\Program 
Files\Microsoft SQL Server\90\Tools\binn\;C:\Program Files\Common 
Files\Roxio Shared\DLLShared\;C:\Program Files\Common Files\Roxio 
Shared\9.0\DLLShared\;C:\Program Files\QuickTime\QTSystem\;C:\Program 
Files\Java\jdk1.6.0_13\bin


Any ideas?

Regards,

John

Re: Solr memory requirements?


Even a simple command like this will help:

  jmap -histo:live java pid here | head -30

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 6:53:29 PM
 Subject: Re: Solr memory requirements?
 
 Disabling first/new searchers did help for the initial load time, but
 after 10-15 min the heap memory start climbing up again and reached
 max within 20 min. Now the GC is coming up all the time, which is
 slowing down the commit and search cycles.
 
 This is still puzzling what does Solr holds in the memory and doesn't release?
 
 I haven't been able to profile as the dump is too big. Would setting
 termIndexInterval help - not sure how can that be set using Solr.
 
 Some other query properties under solrconfig,
 
 
   1024
   true
   50
   200
 
   false
   2
 
 
 Currently, I got 800 million documents and have specified 8G heap size.
 
 Any other suggestion on what can I do to control the Solr memory consumption?
 
 Thanks,
 -vivek
 
 On Wed, May 13, 2009 at 2:53 PM, vivek sar wrote:
  Just an update on the memory issue - might be useful for others. I
  read the following,
 
   http://wiki.apache.org/solr/SolrCaching?highlight=(SolrCaching)
 
  and looks like the first and new searcher listeners would populate the
  FieldCache. Commenting out these two listener entries seems to do the
  trick - at least the heap size is not growing as soon as Solr starts
  up.
 
  I ran some searches and they all came out fine. Index rate is also
  pretty good. Would there be any impact of disabling these listeners?
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 2:12 PM, vivek sar wrote:
  Otis,
 
  In that case, I'm not sure why Solr is taking up so much memory as
  soon as we start it up. I checked for .tii file and there is only one,
 
  -rw-r--r--  1 search  staff  20306 May 11 21:47 
 ./20090510_1/data/index/_3au.tii
 
  I have all the cache disabled - so that shouldn't be a problem too. My
  ramBuffer size is only 64MB.
 
  I read note on sorting,
  http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
  something related to FieldCache. I don't see this as parameter defined
  in either solrconfig.xml or schema.xml. Could this be something that
  can load things in memory at startup? How can we disable it?
 
  I'm trying to find out if there is a way to tell how much memory Solr
  would consume and way to cap it.
 
  Thanks,
  -vivek
 
 
 
 
  On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
  wrote:
 
  Hi,
 
  Sorting is triggered by the sort parameter in the URL, not a 
  characteristic 
 of a field. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar 
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 4:42:16 PM
  Subject: Re: Solr memory requirements?
 
  Thanks Otis.
 
  Our use case doesn't require any sorting or faceting. I'm wondering if
  I've configured anything wrong.
 
  I got total of 25 fields (15 are indexed and stored, other 10 are just
  stored). All my fields are basic data type - which I thought are not
  sorted. My id field is unique key.
 
  Is there any field here that might be getting sorted?
 
 
  required=true omitNorms=true compressed=false/
 
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  default=NOW/HOUR  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  default=NOW/HOUR omitNorms=true/
 
 
 
 
  omitNorms=true multiValued=true/
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
  wrote:
  
   Hi,
   Some answers:
   1) .tii files in the Lucene index.  When you sort, all distinct values 
 for the
  field(s) used for sorting.  Similarly for facet fields.  Solr caches.
   2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will 
 consume
  during indexing.  There is no need to commit every 50K docs unless you 
  want 
 to
  trigger snapshot creation.
   3) see 1) above
  
   1.5 billion docs per instance where each doc is cca 1KB?  I doubt 
   that's 
 going
  to fly. :)
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: vivek sar
   To: solr-user@lucene.apache.org
   Sent: Wednesday, May 13, 2009 3:04:46 PM
   Subject: Solr memory requirements?
  
   Hi,
  
 I'm pretty sure this has been asked before, but I couldn't find a
   complete answer in

Re: Solr memory requirements?


Yeah, I'm not sure why this would help.  There should be nothing in FieldCaches 
unless you sort or use facets.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 5:53:45 PM
 Subject: Re: Solr memory requirements?
 
 Just an update on the memory issue - might be useful for others. I
 read the following,
 
 http://wiki.apache.org/solr/SolrCaching?highlight=(SolrCaching)
 
 and looks like the first and new searcher listeners would populate the
 FieldCache. Commenting out these two listener entries seems to do the
 trick - at least the heap size is not growing as soon as Solr starts
 up.
 
 I ran some searches and they all came out fine. Index rate is also
 pretty good. Would there be any impact of disabling these listeners?
 
 Thanks,
 -vivek
 
 On Wed, May 13, 2009 at 2:12 PM, vivek sar wrote:
  Otis,
 
  In that case, I'm not sure why Solr is taking up so much memory as
  soon as we start it up. I checked for .tii file and there is only one,
 
  -rw-r--r--  1 search  staff  20306 May 11 21:47 
 ./20090510_1/data/index/_3au.tii
 
  I have all the cache disabled - so that shouldn't be a problem too. My
  ramBuffer size is only 64MB.
 
  I read note on sorting,
  http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
  something related to FieldCache. I don't see this as parameter defined
  in either solrconfig.xml or schema.xml. Could this be something that
  can load things in memory at startup? How can we disable it?
 
  I'm trying to find out if there is a way to tell how much memory Solr
  would consume and way to cap it.
 
  Thanks,
  -vivek
 
 
 
 
  On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
  wrote:
 
  Hi,
 
  Sorting is triggered by the sort parameter in the URL, not a 
  characteristic 
 of a field. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar 
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 4:42:16 PM
  Subject: Re: Solr memory requirements?
 
  Thanks Otis.
 
  Our use case doesn't require any sorting or faceting. I'm wondering if
  I've configured anything wrong.
 
  I got total of 25 fields (15 are indexed and stored, other 10 are just
  stored). All my fields are basic data type - which I thought are not
  sorted. My id field is unique key.
 
  Is there any field here that might be getting sorted?
 
 
  required=true omitNorms=true compressed=false/
 
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  default=NOW/HOUR  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  default=NOW/HOUR omitNorms=true/
 
 
 
 
  omitNorms=true multiValued=true/
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
  wrote:
  
   Hi,
   Some answers:
   1) .tii files in the Lucene index.  When you sort, all distinct values 
   for 
 the
  field(s) used for sorting.  Similarly for facet fields.  Solr caches.
   2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will 
 consume
  during indexing.  There is no need to commit every 50K docs unless you 
  want 
 to
  trigger snapshot creation.
   3) see 1) above
  
   1.5 billion docs per instance where each doc is cca 1KB?  I doubt 
   that's 
 going
  to fly. :)
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: vivek sar
   To: solr-user@lucene.apache.org
   Sent: Wednesday, May 13, 2009 3:04:46 PM
   Subject: Solr memory requirements?
  
   Hi,
  
 I'm pretty sure this has been asked before, but I couldn't find a
   complete answer in the forum archive. Here are my questions,
  
   1) When solr starts up what does it loads up in the memory? Let's say
   I've 4 cores with each core 50G in size. When Solr comes up how much
   of it would be loaded in memory?
  
   2) How much memory is required during index time? If I'm committing
   50K records at a time (1 record = 1KB) using solrj, how much memory do
   I need to give to Solr.
  
   3) Is there a minimum memory requirement by Solr to maintain a certain
   size index? Is there any benchmark on this?
  
   Here are some of my configuration from solrconfig.xml,
  
   1) 64
   2) All the caches (under query tag) are commented out
   3) Few others,
 a)  true==
   would this require memory?
 b)  50
 c) 200
 d)
 e) false

Re: Solr memory requirements?


There is constant mixing of indexing concepts and searching concepts in this 
thread.  Are you having problems on the master (indexing) or on the slave 
(searching)?


That .tii is only 20K and you said this is a large index?  That doesn't smell 
right...

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 5:12:00 PM
 Subject: Re: Solr memory requirements?
 
 Otis,
 
 In that case, I'm not sure why Solr is taking up so much memory as
 soon as we start it up. I checked for .tii file and there is only one,
 
 -rw-r--r--  1 search  staff  20306 May 11 21:47 
 ./20090510_1/data/index/_3au.tii
 
 I have all the cache disabled - so that shouldn't be a problem too. My
 ramBuffer size is only 64MB.
 
 I read note on sorting,
 http://wiki.apache.org/solr/SchemaDesign?highlight=(sort), and see
 something related to FieldCache. I don't see this as parameter defined
 in either solrconfig.xml or schema.xml. Could this be something that
 can load things in memory at startup? How can we disable it?
 
 I'm trying to find out if there is a way to tell how much memory Solr
 would consume and way to cap it.
 
 Thanks,
 -vivek
 
 
 
 
 On Wed, May 13, 2009 at 1:50 PM, Otis Gospodnetic
 wrote:
 
  Hi,
 
  Sorting is triggered by the sort parameter in the URL, not a characteristic 
  of 
 a field. :)
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar 
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 4:42:16 PM
  Subject: Re: Solr memory requirements?
 
  Thanks Otis.
 
  Our use case doesn't require any sorting or faceting. I'm wondering if
  I've configured anything wrong.
 
  I got total of 25 fields (15 are indexed and stored, other 10 are just
  stored). All my fields are basic data type - which I thought are not
  sorted. My id field is unique key.
 
  Is there any field here that might be getting sorted?
 
 
  required=true omitNorms=true compressed=false/
 
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  default=NOW/HOUR  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  compressed=false/
 
  omitNorms=true compressed=false/
 
  compressed=false/
 
  default=NOW/HOUR omitNorms=true/
 
 
 
 
  omitNorms=true multiValued=true/
 
  Thanks,
  -vivek
 
  On Wed, May 13, 2009 at 1:10 PM, Otis Gospodnetic
  wrote:
  
   Hi,
   Some answers:
   1) .tii files in the Lucene index.  When you sort, all distinct values 
   for 
 the
  field(s) used for sorting.  Similarly for facet fields.  Solr caches.
   2) ramBufferSizeMB dictates, more or less, how much Lucene/Solr will 
 consume
  during indexing.  There is no need to commit every 50K docs unless you 
  want 
 to
  trigger snapshot creation.
   3) see 1) above
  
   1.5 billion docs per instance where each doc is cca 1KB?  I doubt that's 
 going
  to fly. :)
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: vivek sar
   To: solr-user@lucene.apache.org
   Sent: Wednesday, May 13, 2009 3:04:46 PM
   Subject: Solr memory requirements?
  
   Hi,
  
 I'm pretty sure this has been asked before, but I couldn't find a
   complete answer in the forum archive. Here are my questions,
  
   1) When solr starts up what does it loads up in the memory? Let's say
   I've 4 cores with each core 50G in size. When Solr comes up how much
   of it would be loaded in memory?
  
   2) How much memory is required during index time? If I'm committing
   50K records at a time (1 record = 1KB) using solrj, how much memory do
   I need to give to Solr.
  
   3) Is there a minimum memory requirement by Solr to maintain a certain
   size index? Is there any benchmark on this?
  
   Here are some of my configuration from solrconfig.xml,
  
   1) 64
   2) All the caches (under query tag) are commented out
   3) Few others,
 a)  true==
   would this require memory?
 b)  50
 c) 200
 d)
 e) false
 f)  2
  
   The problem we are having is following,
  
   I've given Solr RAM of 6G. As the total index size (all cores
   combined) start growing the Solr memory consumption  goes up. With 800
   million documents, I see Solr already taking up all the memory at
   startup. After that the commits, searches everything become slow. We
   will be having distributed setup with multiple Solr instances (around
   8) on four boxes, but our requirement is to

Re: Replication master+slave


Coincidentally, from 
http://www.cloudera.com/blog/2009/05/07/what%E2%80%99s-new-in-hadoop-core-020/ :

Hadoop configuration files now support XInclude elements for including 
portions of another configuration file (HADOOP-4944). This mechanism allows you 
to make configuration files more modular and reusable.

So others are doing it, too.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Bryan Talbot btal...@aeriagames.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 11:26:41 AM
 Subject: Re: Replication master+slave
 
 I see that Nobel's final comment in SOLR-1154 is that config files need to be 
 able to include snippets from external files.  In my limited testing, a 
 simple 
 patch to enable XInclude support seems to work.
 
 
 
 --- src/java/org/apache/solr/core/Config.java   (revision 774137)
 +++ src/java/org/apache/solr/core/Config.java   (working copy)
 @@ -100,8 +100,10 @@
   if (lis == null) {
 lis = loader.openConfig(name);
   }
 -  javax.xml.parsers.DocumentBuilder builder = 
 DocumentBuilderFactory.newInstance().newDocumentBuilder();
 -  doc = builder.parse(lis);
 +  javax.xml.parsers.DocumentBuilderFactory dbf = 
 DocumentBuilderFactory.newInstance();
 +  dbf.setNamespaceAware(true);
 +  dbf.setXIncludeAware(true);
 +  doc = dbf.newDocumentBuilder().parse(lis);
 
 DOMUtil.substituteProperties(doc, loader.getCoreProperties());
 } catch (ParserConfigurationException e)  {
 
 
 
 This allows a clause like this to include the contents of replication.xml if 
 it 
 exists.  If it's not found an exception will be thrown.
 
 
 
 href=http://localhost:8983/solr/corename/admin/file/?file=replication.xml;
  xmlns:xi=http://www.w3.org/2001/XInclude;
 
 
 
 If the file is optional and no exception should be thrown if the file is 
 missing, simply include a fallback action: in this case the fallback is empty 
 and does nothing.
 
 
 
 href=http://localhost:8983/solr/forum_en/admin/file/?file=replication.xml;
  xmlns:xi=http://www.w3.org/2001/XInclude;
 
 
 
 
 -Bryan
 
 
 
 
 On May 12, 2009, at May 12, 8:05 PM, Jian Han Guo wrote:
 
  I was looking at the same problem, and had a discussion with Noble. You can
  use a hack to achieve what you want, see
  
  https://issues.apache.org/jira/browse/SOLR-1154
  
  Thanks,
  
  Jianhan
  
  
  On Tue, May 12, 2009 at 5:13 PM, Bryan Talbot wrote:
  
  So how are people managing solrconfig.xml files which are largely the same
  other than differences for replication?
  
  I don't think it's a good thing to maintain two copies of the same file
  and I'd like to avoid that.  Maybe enabling the XInclude feature in
  DocumentBuilders would make it possible to modularize configuration files 
  to
  make this possible?
  
  
  
 http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware(boolean)
  
  
  -Bryan
  
  
  
  
  
  On May 12, 2009, at May 12, 11:43 AM, Shalin Shekhar Mangar wrote:
  
  On Tue, May 12, 2009 at 10:42 PM, Bryan Talbot 
  wrote:
  
  For replication in 1.4, the wiki at
  http://wiki.apache.org/solr/SolrReplication says that a node can be both
  the master and a slave:
  
  A node can act as both master and slave. In that case both the master and
  slave configuration lists need to be present inside the
  ReplicationHandler
  requestHandler in the solrconfig.xml.
  
  What does this mean?  Does the core then poll itself for updates?
  
  
  
  No. This type of configuration is meant for repeaters. Suppose there are
  slaves in multiple data-centers (say data center A and B). There is always
  a
  single master (say in A). One of the slaves in B is used as a master for
  the
  other slaves in B. Therefore, this one slave in B is both a master as well
  as the slave.
  
  
  
  I'd like to have a single set of configuration files that are shared by
  masters and slaves and avoid duplicating configuration details in
  multiple
  files (one for master and one for slave) to ease management and failover.
  Is this possible?
  
  
  You wouldn't want the master to be a slave. So I guess you'd need to have
  a
  separate file. Also, it needs to be a separate file so that the slave does
  not become a master when the solrconfig.xml is replicated.
  
  
  
  When I attempt to setup a multi server master-slave configuration and
  include both master and slave replication configuration options, I into
  some
  problems.  I'm  running a nightly build from May 7.
  
  
  Not sure what happened. Is that the url for this solr (meaning same solr
  url
  is master and slave of itself)? If yes, that is not a valid configuration.
  
  --
  Regards,
  Shalin Shekhar Mangar.

Re: Replication master+slave

2009-05-13 Thread Bryan Talbot

I think the patch I included earlier covers solr core, but it looks  
like at least some other extensions (DIH) create and use their own XML  
parser.  So, if this functionality is to extend to all XML files,  
those will need similar patches.


Here's one for DIH:

--- src/main/java/org/apache/solr/handler/dataimport/ 
DataImporter.java  (revision 774137)
+++ src/main/java/org/apache/solr/handler/dataimport/ 
DataImporter.java  (working copy)

@@ -148,8 +148,10 @@
   void loadDataConfig(String configFile) {

 try {
-  DocumentBuilder builder = DocumentBuilderFactory.newInstance()
-  .newDocumentBuilder();
+  DocumentBuilderFactory dbf =  
DocumentBuilderFactory.newInstance();

+  dbf.setNamespaceAware(true);
+  dbf.setXIncludeAware(true);
+  DocumentBuilder builder = dbf.newDocumentBuilder();
   Document document = builder.parse(new InputSource(new  
StringReader(

   configFile)));



The only down side I can see to this is it doesn't offer very  
expressive conditional inclusion: the file is included if it's present  
otherwise fallback inclusions can be used.  It's also specific to XML  
files and obviously won't work for other types of configuration  
files.  However, it is simple and effective.



-Bryan




On May 13, 2009, at May 13, 6:36 PM, Otis Gospodnetic wrote:



Coincidentally, from http://www.cloudera.com/blog/2009/05/07/what%E2%80%99s-new-in-hadoop-core-020/ 
 :


Hadoop configuration files now support XInclude elements for  
including portions of another configuration file (HADOOP-4944). This  
mechanism allows you to make configuration files more modular and  
reusable.


So others are doing it, too.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Bryan Talbot btal...@aeriagames.com
To: solr-user@lucene.apache.org
Sent: Wednesday, May 13, 2009 11:26:41 AM
Subject: Re: Replication master+slave

I see that Nobel's final comment in SOLR-1154 is that config files  
need to be
able to include snippets from external files.  In my limited  
testing, a simple

patch to enable XInclude support seems to work.



--- src/java/org/apache/solr/core/Config.java   (revision 774137)
+++ src/java/org/apache/solr/core/Config.java   (working copy)
@@ -100,8 +100,10 @@
 if (lis == null) {
   lis = loader.openConfig(name);
 }
-  javax.xml.parsers.DocumentBuilder builder =
DocumentBuilderFactory.newInstance().newDocumentBuilder();
-  doc = builder.parse(lis);
+  javax.xml.parsers.DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
+  dbf.setNamespaceAware(true);
+  dbf.setXIncludeAware(true);
+  doc = dbf.newDocumentBuilder().parse(lis);

   DOMUtil.substituteProperties(doc, loader.getCoreProperties());
} catch (ParserConfigurationException e)  {



This allows a clause like this to include the contents of  
replication.xml if it

exists.  If it's not found an exception will be thrown.



href=http://localhost:8983/solr/corename/admin/file/?file=replication.xml 


xmlns:xi=http://www.w3.org/2001/XInclude;



If the file is optional and no exception should be thrown if the  
file is
missing, simply include a fallback action: in this case the  
fallback is empty

and does nothing.



href=http://localhost:8983/solr/forum_en/admin/file/?file=replication.xml 


xmlns:xi=http://www.w3.org/2001/XInclude;




-Bryan




On May 12, 2009, at May 12, 8:05 PM, Jian Han Guo wrote:

I was looking at the same problem, and had a discussion with  
Noble. You can

use a hack to achieve what you want, see

https://issues.apache.org/jira/browse/SOLR-1154

Thanks,

Jianhan


On Tue, May 12, 2009 at 5:13 PM, Bryan Talbot wrote:

So how are people managing solrconfig.xml files which are largely  
the same

other than differences for replication?

I don't think it's a good thing to maintain two copies of the  
same file

and I'd like to avoid that.  Maybe enabling the XInclude feature in
DocumentBuilders would make it possible to modularize  
configuration files to

make this possible?




http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware(boolean)



-Bryan





On May 12, 2009, at May 12, 11:43 AM, Shalin Shekhar Mangar wrote:

On Tue, May 12, 2009 at 10:42 PM, Bryan Talbot

wrote:


For replication in 1.4, the wiki at
http://wiki.apache.org/solr/SolrReplication says that a node  
can be both

the master and a slave:

A node can act as both master and slave. In that case both the  
master and

slave configuration lists need to be present inside the
ReplicationHandler
requestHandler in the solrconfig.xml.

What does this mean?  Does the core then poll itself for updates?




No. This type of configuration is meant for repeaters. Suppose  
there are
slaves in multiple data-centers (say data center A and B). There  
is always

a
single master (say in A). One of the slaves in B is used as a  
master for

the
other slaves

Re: Replication master+slave


Bryan, maybe it's time to stick this in JIRA?
http://wiki.apache.org/solr/HowToContribute

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Bryan Talbot btal...@aeriagames.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, May 13, 2009 10:11:21 PM
 Subject: Re: Replication master+slave
 
 I think the patch I included earlier covers solr core, but it looks like at 
 least some other extensions (DIH) create and use their own XML parser.  So, 
 if 
 this functionality is to extend to all XML files, those will need similar 
 patches.
 
 Here's one for DIH:
 
 --- src/main/java/org/apache/solr/handler/dataimport/DataImporter.java  
 (revision 774137)
 +++ src/main/java/org/apache/solr/handler/dataimport/DataImporter.java  
 (working 
 copy)
 @@ -148,8 +148,10 @@
void loadDataConfig(String configFile) {
 
  try {
 -  DocumentBuilder builder = DocumentBuilderFactory.newInstance()
 -  .newDocumentBuilder();
 +  DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
 +  dbf.setNamespaceAware(true);
 +  dbf.setXIncludeAware(true);
 +  DocumentBuilder builder = dbf.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(
configFile)));
 
 
 
 The only down side I can see to this is it doesn't offer very expressive 
 conditional inclusion: the file is included if it's present otherwise 
 fallback 
 inclusions can be used.  It's also specific to XML files and obviously won't 
 work for other types of configuration files.  However, it is simple and 
 effective.
 
 
 -Bryan
 
 
 
 
 On May 13, 2009, at May 13, 6:36 PM, Otis Gospodnetic wrote:
 
  
  Coincidentally, from 
 http://www.cloudera.com/blog/2009/05/07/what%E2%80%99s-new-in-hadoop-core-020/
  :
  
  Hadoop configuration files now support XInclude elements for including 
 portions of another configuration file (HADOOP-4944). This mechanism allows 
 you 
 to make configuration files more modular and reusable.
  
  So others are doing it, too.
  
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
  - Original Message 
  From: Bryan Talbot 
  To: solr-user@lucene.apache.org
  Sent: Wednesday, May 13, 2009 11:26:41 AM
  Subject: Re: Replication master+slave
  
  I see that Nobel's final comment in SOLR-1154 is that config files need to 
  be
  able to include snippets from external files.  In my limited testing, a 
 simple
  patch to enable XInclude support seems to work.
  
  
  
  --- src/java/org/apache/solr/core/Config.java   (revision 774137)
  +++ src/java/org/apache/solr/core/Config.java   (working copy)
  @@ -100,8 +100,10 @@
   if (lis == null) {
 lis = loader.openConfig(name);
   }
  -  javax.xml.parsers.DocumentBuilder builder =
  DocumentBuilderFactory.newInstance().newDocumentBuilder();
  -  doc = builder.parse(lis);
  +  javax.xml.parsers.DocumentBuilderFactory dbf =
  DocumentBuilderFactory.newInstance();
  +  dbf.setNamespaceAware(true);
  +  dbf.setXIncludeAware(true);
  +  doc = dbf.newDocumentBuilder().parse(lis);
  
 DOMUtil.substituteProperties(doc, loader.getCoreProperties());
  } catch (ParserConfigurationException e)  {
  
  
  
  This allows a clause like this to include the contents of replication.xml 
  if 
 it
  exists.  If it's not found an exception will be thrown.
  
  
  
  href=http://localhost:8983/solr/corename/admin/file/?file=replication.xml;
  xmlns:xi=http://www.w3.org/2001/XInclude;
  
  
  
  If the file is optional and no exception should be thrown if the file is
  missing, simply include a fallback action: in this case the fallback is 
  empty
  and does nothing.
  
  
  
  href=http://localhost:8983/solr/forum_en/admin/file/?file=replication.xml;
  xmlns:xi=http://www.w3.org/2001/XInclude;
  
  
  
  
  -Bryan
  
  
  
  
  On May 12, 2009, at May 12, 8:05 PM, Jian Han Guo wrote:
  
  I was looking at the same problem, and had a discussion with Noble. You 
  can
  use a hack to achieve what you want, see
  
  https://issues.apache.org/jira/browse/SOLR-1154
  
  Thanks,
  
  Jianhan
  
  
  On Tue, May 12, 2009 at 5:13 PM, Bryan Talbot wrote:
  
  So how are people managing solrconfig.xml files which are largely the 
  same
  other than differences for replication?
  
  I don't think it's a good thing to maintain two copies of the same file
  and I'd like to avoid that.  Maybe enabling the XInclude feature in
  DocumentBuilders would make it possible to modularize configuration 
  files 
 to
  make this possible?
  
  
  
  
 http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware(boolean)
  
  
  -Bryan
  
  
  
  
  
  On May 12, 2009, at May 12, 11:43 AM, Shalin Shekhar Mangar wrote:
  
  On Tue, May 12, 2009 at 10:42 PM, Bryan Talbot
  wrote:
  
  For replication in 1.4, the wiki at

Re: Sorting by 'starts with'


Wojtek,

I believe 
http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/spans/SpanFirstQuery.html
 would help, though there is no support for Span queries in Solr.  But there is 
support for custom query parsers, and there is 
http://lucene.apache.org/java/2_4_1/api/contrib-snowball/index.html

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: wojtekpia wojte...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, May 7, 2009 2:41:29 PM
 Subject: Sorting by 'starts with'
 
 
 I have an index of product names. I'd like to sort results so that entries
 starting with the user query come first. 
 E.g. 
 
 q=kitchen
 
 Results would sort something like:
 1. kitchen appliance
 2. kitchenaid dishwasher
 3. fridge for kitchen
 
 It looks like using a query Function Query comes close, but I don't know how
 to write a subquery that only matches if the value starts with the query
 string. 
 
 Has anyone solved a similar need?
 
 Thanks,
 
 Wojtek
 -- 
 View this message in context: 
 http://www.nabble.com/Sorting-by-%27starts-with%27-tp23432815p23432815.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Creating new QParserPlugin


Andrey,

I urge you to use JIRA for this.  That's exactly what it's for and how it gets 
used.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Andrey Klochkov akloch...@griddynamics.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, May 7, 2009 5:14:26 AM
 Subject: Re: Creating new QParserPlugin
 
 Hi!
 
 I agree that Solr is difficult to extend in many cases. We just patch Solr,
 and I guess many other users patch it too. What I propose is to create some
 Solr-community site (Solr incubator?) to public patches there, and Solr core
 team could then look there and choose patches to apply to the Solr codebase.
 I know that one can use Jira for that, but it's not convinient to use it in
 this way.
 
 On Thu, May 7, 2009 at 2:41 AM, KaktuChakarabati wrote:
 
 
  Hello everyone,
  I am trying to write a new QParserPlugin+QParser, one that will work
  similar
  to how DisMax does, but will give me more control over the
  FunctionQuery-related part of the query processing (e.g in regards to a
  specified bf parameter).
 
  In specific, I want to be able to affect the way the queryNorm (and
  possibly
  other factors) interact with a
  pre-computed value I store in a static field (i.e I compute an index-time
  score for a document that I wish to use in a bf as a ValueSource, without
  being affected by queryNorm or other such extranous considerations.)
 
  While trying this, I notice I run alot into cases where some parts I try to
  override/inherit from are private to a java package namespace, and this
  makes the whole thing very cumbersome.
 
  Examples for this are the DismaxQParser class which is defined as a local
  class inside the DisMaxQParserPlugin.java file (i think this is bad
  practice
  - otherwise, FunctionQParserPlugin/FunctionQParser do have their own
  seperate files, so i think this is a good convention to follow generally).
  Another case is where i try to inherit from FunctionQParser and end up not
  being able to replicate some of the parse() logic, because it uses the
  QueryParsing.StrParser class which is a static inner class and so is only
  accessible from the solr.search namespace.
 
  In short, many such cases seem to arise and i think this poses a
  considerable limitation on
  the possibilities of extending solr.
  If this resonates with more people here, I'd take this issue up with
  solr-dev.
 
  Otherwise, if some of you have some notions about going about what i'm
  trying to do differently,
  I would be happy to hear.
 
  Thanks,
  -Chak
  --
  View this message in context:
  http://www.nabble.com/Creating-new-QParserPlugin-tp23416974p23416974.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
 -- 
 Andrew Klochkov

Re: Solr memory requirements?