Default result rows
Hi! Where can I define, how many rows must be returned in the result? Default is 10, and specifying other value each time through URL or advanced interface isn't comfortable. Ar cieņu, Mihails
Deleting Solr index
How can I clear the whole Solr index? Ar cieņu, Mihails
Re: Deleting Solr index
just rm -r SOLR_DIR/data/index. 2008/6/18 Mihails Agafonovs [EMAIL PROTECTED]: How can I clear the whole Solr index? Ar cieņu, Mihails -- regards j.L
Re: Default result rows
You can configure this in solrconfig.xml under the defaults section for StandardRequestHandler requestHandler name=standard class=solr.StandardRequestHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str int name=rows30/int str name=fl*/str str name=version2.1/str /lst /requestHandler 2008/6/18 Mihails Agafonovs [EMAIL PROTECTED]: Hi! Where can I define, how many rows must be returned in the result? Default is 10, and specifying other value each time through URL or advanced interface isn't comfortable. Ar cieņu, Mihails -- Regards, Shalin Shekhar Mangar.
Re: Deleting Solr index
You can delete by query *:* (which matches all documents) http://wiki.apache.org/solr/UpdateXmlMessages 2008/6/18 Mihails Agafonovs [EMAIL PROTECTED]: How can I clear the whole Solr index? Ar cieņu, Mihails -- Regards, Shalin Shekhar Mangar.
Re: Default result rows
Doesn't work :(. None of the parameters in the defaults section is being read. Solr still uses the predefined default parameters. P.S. In defaults section I should be able specify also what stylesheet to use, right? Quoting Shalin Shekhar Mangar : You can configure this in solrconfig.xml under the quot;defaultsquot; section for StandardRequestHandler lt;requestHandler name=quot;standardquot; class=quot;solr.StandardRequestHandlerquot; default=quot;truequot;gt; lt;!-- default values for query parameters --gt; lt;lst name=quot;defaultsquot;gt; lt;str name=quot;echoParamsquot;gt;explicitlt;/strgt; lt;int name=quot;rowsquot;gt;30lt;/intgt; lt;str name=quot;flquot;gt;*lt;/strgt; lt;str name=quot;versionquot;gt;2.1lt;/strgt; lt;/lstgt; lt;/requestHandlergt; 2008/6/18 Mihails Agafonovs lt;[EMAIL PROTECTED]gt;: gt; Hi! gt; gt; Where can I define, how many rows must be returned in the result? gt; Default is 10, and specifying other value each time through URL or gt; advanced interface isn't comfortable. gt; Ar cie#326;u, Mihails -- Regards, Shalin Shekhar Mangar. Ar cieņu, Mihails Links: -- [1] mailto:[EMAIL PROTECTED]
SOLR-236 patch works
I had the patch problem but I manually created that file and solr nightly builds fine. After replacing solr.war with apache-solr-solrj-1.3-dev.jar, in solrconfig.xml, I added this: searchComponent name=collapse class=org.apache.solr.handler.component.CollapseComponent / Then added this to the standard and dismax handler handler requestHandler name=standard ... arr name=components strcollapse/str /arr /requestHandler I added collapse.field=fieldcollapse.threshold=n, and the result collapsed as expected. Can you provide feedback about this particular patch once you try it? I'd like to get it on Solr 1.3, actually, so any feedback would help. Thanks, Otis
Re: Did you mean functionality
Yeah, i read it. Thanks a lot, I`m waiting for it! []s, Lucas Lucas Frare A. Teixeira [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Tel: +55 11 3660.1622 - R3018 Grant Ingersoll escreveu: Also see http://wiki.apache.org/solr/SpellCheckComponent I expect to commit fairly soon. On Jun 17, 2008, at 5:46 PM, Otis Gospodnetic wrote: Hi Lucas, Have a look at (the patch in) SOLR-572, lots of work happening there as we speak. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Lucas F. A. Teixeira [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, June 17, 2008 4:30:12 PM Subject: Did you mean functionality Hello everybody, I need to integrate the Lucene SpellChecker Contrib lib in my applycation, but I`m using the EmbeededSolrServer to access all indexes. I want to know what should I do (if someone have any step-by-step, link, tutorial or smoke signal) of what I need to do during indexing, and of course to search through this words generated by this API. I can use the lib itself to search the suggestions, w/out using solr, but I`m confused about how may I proceed when indexing this docs. Thanks a lot, []s, -- Lucas Frare A. Teixeira [EMAIL PROTECTED] Tel: +55 11 3660.1622 - R3018
Re: Feature idea - delete and commit from web interface ?
A patch for this had been posted before, though I don't know it can delete. It can add documents and commit from admin gui. https://issues.apache.org/jira/browse/SOLR-85 Koji JLIST wrote: It seems that the web interface only supports select but not delete. Is it possible to do delete from the browser? It would be nice to be able to do delete and commit, and even post (put XML in an html form) from the admin web interface :) Also, does delete have to be a POST? A GET should do.
Bug Solr/bin/commit problem - fails to commit correctly and render response
Hello, I am using the solr/bin/commit file to commit index changes after index distribution in the collection distribution operations model. The commit script is printed at the end of the email. When I run the script as is, I get the following error: commit request to Solr at port 8080 failed This is corrected with the following addition to the line: rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d commit/` Becomes: rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d commit/ -H 'Content-type:text/xml; charset=utf-8'` This works, but the log reports an error, because the response is not as expected. SOLR returns: int name=status0/int But the commit script expects: result.*status=0'[regular expression] Has anybody else had problems using this commit script? Where can I get the latest version? I got this script from the solr 1.2 package. Thanks, John --- #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the License); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an AS IS BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # Shell script to force a commit of all changes since last commit # for a Solr server orig_dir=$(pwd) cd ${0%/*}/.. solr_root=$(pwd) cd ${orig_dir} unset solr_hostname solr_port webapp_name user verbose debug . ${solr_root}/bin/scripts-util # set up variables prog=${0##*/} log=${solr_root}/logs/${prog}.log # define usage string USAGE=\ usage: $prog [-h hostname] [-p port] [-w webapp_name] [-u username] [-v] -h specify Solr hostname -p specify Solr port number -w specify name of Solr webapp (defaults to solr) -u specify user to sudo to before running script -v increase verbosity -V output debugging info # parse args while getopts h:p:w:u:vV OPTION do case $OPTION in h) solr_hostname=$OPTARG ;; p) solr_port=$OPTARG ;; w) webapp_name=$OPTARG ;; u) user=$OPTARG ;; v) verbose=v ;; V) debug=V ;; *) echo $USAGE exit 1 esac done [[ -n $debug ]] set -x if [[ -z ${solr_port} ]] then echo Solr port number missing in $confFile or command line. echo $USAGE exit 1 fi # use default hostname if not specified if [[ -z ${solr_hostname} ]] then solr_hostname=localhost fi # use default webapp name if not specified if [[ -z ${webapp_name} ]] then webapp_name=solr fi fixUser $@ start=`date +%s` logMessage started by $oldwhoami logMessage command: $0 $@ rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d commit/` if [[ $? != 0 ]] then logMessage failed to connect to Solr server at port ${solr_port} logMessage commit failed logExit failed 1 fi # check status of commit request echo $rs | grep 'result.*status=0' /dev/null 21 if [[ $? != 0 ]] then logMessage commit request to Solr at port ${solr_port} failed: logMessage $rs logExit failed 2 fi logExit ended 0 ---
RE: Bug Solr/bin/commit problem - fails to commit correctly and render response
Ok I checked out the nightly builds and the two changes have been made. I will use the SOLR 1.3 version of solr/bin/commit. Thanks, John -Original Message- From: McBride, John [mailto:[EMAIL PROTECTED] Sent: 18 June 2008 11:48 To: solr-user@lucene.apache.org Subject: Bug Solr/bin/commit problem - fails to commit correctly and render response Hello, I am using the solr/bin/commit file to commit index changes after index distribution in the collection distribution operations model. The commit script is printed at the end of the email. When I run the script as is, I get the following error: commit request to Solr at port 8080 failed This is corrected with the following addition to the line: rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d commit/` Becomes: rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d commit/ -H 'Content-type:text/xml; charset=utf-8'` This works, but the log reports an error, because the response is not as expected. SOLR returns: int name=status0/int But the commit script expects: result.*status=0'[regular expression] Has anybody else had problems using this commit script? Where can I get the latest version? I got this script from the solr 1.2 package. Thanks, John --- #!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the License); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an AS IS BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # Shell script to force a commit of all changes since last commit # for a Solr server orig_dir=$(pwd) cd ${0%/*}/.. solr_root=$(pwd) cd ${orig_dir} unset solr_hostname solr_port webapp_name user verbose debug . ${solr_root}/bin/scripts-util # set up variables prog=${0##*/} log=${solr_root}/logs/${prog}.log # define usage string USAGE=\ usage: $prog [-h hostname] [-p port] [-w webapp_name] [-u username] [-v] -h specify Solr hostname -p specify Solr port number -w specify name of Solr webapp (defaults to solr) -u specify user to sudo to before running script -v increase verbosity -V output debugging info # parse args while getopts h:p:w:u:vV OPTION do case $OPTION in h) solr_hostname=$OPTARG ;; p) solr_port=$OPTARG ;; w) webapp_name=$OPTARG ;; u) user=$OPTARG ;; v) verbose=v ;; V) debug=V ;; *) echo $USAGE exit 1 esac done [[ -n $debug ]] set -x if [[ -z ${solr_port} ]] then echo Solr port number missing in $confFile or command line. echo $USAGE exit 1 fi # use default hostname if not specified if [[ -z ${solr_hostname} ]] then solr_hostname=localhost fi # use default webapp name if not specified if [[ -z ${webapp_name} ]] then webapp_name=solr fi fixUser $@ start=`date +%s` logMessage started by $oldwhoami logMessage command: $0 $@ rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d commit/` if [[ $? != 0 ]] then logMessage failed to connect to Solr server at port ${solr_port} logMessage commit failed logExit failed 1 fi # check status of commit request echo $rs | grep 'result.*status=0' /dev/null 21 if [[ $? != 0 ]] then logMessage commit request to Solr at port ${solr_port} failed: logMessage $rs logExit failed 2 fi logExit ended 0 ---
never desallocate RAM...during search
Hi users, Somedays ago I made a question about RAM use during searchs but I didn't solve my problem with the ideas that some expert users told me. After making somes test I can make a more specific question hoping someone can help me. My problem is that i need highlighting and i have quite big docs (txt of 40MB). The conclusion of my tests is that if I set rows to 10, the content of the first 10 results are cached. This if something normal because its probable needed for the highlighting, but this memory is never desallocate although I set solr's caches to 0. With this, the memory grows up until is close to the heap, then the gc start to desallocate memory..but at that point the searches are quite slow. Is this a normal behavior? Can I configure some solr parameter to force the desallocation of results after each search? [I´m using solr 1.2] Another thing that I found is that although I comment (in solrconfig) all this options: filterCache, queryResultCache, documentCache, enableLazyFieldLoading, useFilterForSortedQuery, boolTofilterOptimizer In the stats always appear caching:true. I'm probably leaving some stupid thing but I can't find it. If anyone can help me..i'm quite desperate. Rober.
Re: Default result rows
Use rows=NNN in the URL. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mihails Agafonovs [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, June 18, 2008 4:30:53 AM Subject: Default result rows Hi! Where can I define, how many rows must be returned in the result? Default is 10, and specifying other value each time through URL or advanced interface isn't comfortable. Ar cieņu, Mihails
Re: SOLR-236 patch works
That looks right. CollapseComponent replaces QueryComponent. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: JLIST [EMAIL PROTECTED] To: Otis Gospodnetic solr-user@lucene.apache.org Sent: Wednesday, June 18, 2008 5:24:25 AM Subject: SOLR-236 patch works I had the patch problem but I manually created that file and solr nightly builds fine. After replacing solr.war with apache-solr-solrj-1.3-dev.jar, in solrconfig.xml, I added this: class=org.apache.solr.handler.component.CollapseComponent / Then added this to the standard and dismax handler handler collapse I added collapse.field=collapse.threshold=, and the result collapsed as expected. Can you provide feedback about this particular patch once you try it? I'd like to get it on Solr 1.3, actually, so any feedback would help. Thanks, Otis
Re: Feature idea - delete and commit from web interface ?
As for POST vs. GET - don't let REST purists hear you. :) Actually, isn't there a DELETE HTTP method that REST purists would say should be used in case of doc deletion? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: JLIST [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, June 18, 2008 4:13:09 AM Subject: Feature idea - delete and commit from web interface ? It seems that the web interface only supports select but not delete. Is it possible to do delete from the browser? It would be nice to be able to do delete and commit, and even post (put XML in an html form) from the admin web interface :) Also, does delete have to be a POST? A GET should do.
Re: never desallocate RAM...during search
Hi, I don't have the answer about why cache still shows true, but as far as memory usage goes, based on your description I'd guess the memory is allocated and used by the JVM which typically tries not to run GC unless it needs to. So if you want to get rid of that used memory, you need to talk to the JVM and persuade it to run GC. I don't think there is a way to manage memory usage directly. There is System.gc() that you can call, but that's only a suggestion for the JVM to run GC. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Roberto Nieto [EMAIL PROTECTED] To: solr-user solr-user@lucene.apache.org Sent: Wednesday, June 18, 2008 7:43:12 AM Subject: never desallocate RAM...during search Hi users, Somedays ago I made a question about RAM use during searchs but I didn't solve my problem with the ideas that some expert users told me. After making somes test I can make a more specific question hoping someone can help me. My problem is that i need highlighting and i have quite big docs (txt of 40MB). The conclusion of my tests is that if I set rows to 10, the content of the first 10 results are cached. This if something normal because its probable needed for the highlighting, but this memory is never desallocate although I set solr's caches to 0. With this, the memory grows up until is close to the heap, then the gc start to desallocate memory..but at that point the searches are quite slow. Is this a normal behavior? Can I configure some solr parameter to force the desallocation of results after each search? [I´m using solr 1.2] Another thing that I found is that although I comment (in solrconfig) all this options: filterCache, queryResultCache, documentCache, enableLazyFieldLoading, useFilterForSortedQuery, boolTofilterOptimizer In the stats always appear caching:true. I'm probably leaving some stupid thing but I can't find it. If anyone can help me..i'm quite desperate. Rober.
Re: missing document count?
: not hard, but useful information to have handy without additional : manipulations on my part. : our pages are the results of multiple queries. so, given a max number of : records per page (or total), the rows asked of query2 is max - query1, of in the common case, counting the number of docs in a result is just as easy as reading some attribute containing the count. It sounds like you have a more complicated case where what you really wnat is the count of how many docs there are in the entire response (ie: multiple result sections) ... that count is admitedly a little more work but would also be completley useless to most clients if it was included in the response (just as the number of fields in each doc, or the total number of strings in the response) ... there is a lot of metadata that *could* be included in the response, but we don't bother when the client can compute that metadata just as easily as the server -- among other things, it helps keep the response size smaller. This was actually one of the orriginal guiding principles of Solr: support features that are faster/cheaper/easier/more-efficient on the central server then they would be on the clients (sorting, docset caching, faceting, etc...) -Hoss
Re: missing document count?
Chris Hostetter wrote: : not hard, but useful information to have handy without additional : manipulations on my part. : our pages are the results of multiple queries. so, given a max number of : records per page (or total), the rows asked of query2 is max - query1, of in the common case, counting the number of docs in a result is just as easy as reading some attribute containing the count. I suppose :) in my mind, one (potentially) requires just a read, while the other requires some further manipulations. but I suppose most modern languages have optimizations for things like array size :) It sounds like you have a more complicated case where what you really wnat is the count of how many docs there are in the entire response I don't know how complex it is to ask for documents in the response, but yes :) (ie: multiple result sections) ... multiple results from multiple queries, not a single query. but really, I wasn't planning on having anyone (solr or otherwise) solving my needs. I just find it odd that I need to discern the number of returned results. that count is admitedly a little more work but would also be completley useless to most clients if it was included in the response perhaps :) (just as the number of fields in each doc, or the total number of strings in the response) ... there is a lot of metadata that *could* be included in the response, but we don't bother when the client can compute that metadata just as easily as the server -- among other things, it helps keep the response size smaller. agreed - smaller is better. as for client as easily as a the server, I assumed that solr was keeping track of the document count already, if only to see when the number of documents exceeds the rows parameter. if so, all the people who care about number of documents in the result (which, I'll assume, is more than those who care about total strings in the response ;) are all re-computing a known value. This was actually one of the orriginal guiding principles of Solr: support features that are faster/cheaper/easier/more-efficient on the central server then they would be on the clients (sorting, docset caching, faceting, etc...) sure, I'll buy that. but in my mind it was only exposing something solr already was calculating anyway. regardless, thanks for taking the time :) --Geoff
Re[2]: Feature idea - delete and commit from web interface ?
GET makes it possible to delete from a browser address bar, which you can not do with DELETE :) As for POST vs. GET - don't let REST purists hear you. :) Actually, isn't there a DELETE HTTP method that REST purists would say should be used in case of doc deletion?
Re[2]: Feature idea - delete and commit from web interface ?
Sounds like web designer's fault. No permission check and no confirmation for deletion? Never, never delete with a GET. The Ultraseek spider deleted 20K docments on an intranet once because they gave it admin perms and it followed the delete this page link on every page.
Re: Re[2]: Feature idea - delete and commit from web interface ?
The spider was given an admin login so it could access all content. Reasonable decision if the pages had been designed well. Even with a confirmation, never delete with a GET. Use POST. If the spider ever discovers the URL that the confirmation uses, it will still delete the content. Luckily, they had a backup. wunder On 6/18/08 1:55 PM, JLIST [EMAIL PROTECTED] wrote: Sounds like web designer's fault. No permission check and no confirmation for deletion? Never, never delete with a GET. The Ultraseek spider deleted 20K docments on an intranet once because they gave it admin perms and it followed the delete this page link on every page.
Re: Re[2]: Feature idea - delete and commit from web interface ?
On Wed, Jun 18, 2008 at 1:55 PM, JLIST [EMAIL PROTECTED] wrote: Sounds like web designer's fault. No permission check and no confirmation for deletion? Nope ... application designer's fault for misusing the web. Allowing deletes on a GET violates HTTP/1.1 requirements (not just RESTful ones) that GET requests not have side effects, so an app that works that way is going to mess up when HTTP caching is in use ... as lots of people found to their chagrin when they installed Google Desktop's caching capabilities, and the cache played by the standard HTTP rules (GETs are supposed to be idempotent, having no side effects, so it's just fine to issue the same GET as many times as desired. If you want an easy way to do deletes from a browser, just set up a little form that does a POST and includes the id of the document you want to delete. Then you're playing by the rules, and won't make a fool of yourself when crawlers or caches interact with your application. Craig McClanahan Never, never delete with a GET. The Ultraseek spider deleted 20K docments on an intranet once because they gave it admin perms and it followed the delete this page link on every page.
Re: scaling / sharding questions
This may be slightly off topic, for which I apologize, but is related to the question of searching several indexes as Lance describes below, quoting: We also found that searching a few smaller indexes via the Solr 1.3 Distributed Search feature is actually faster than searching one large index, YMMV. The wiki describing distributed search lists several limitations which set me to wonder about two limitations in particular and what the value is mainly with respect to scoring: 1) No distributed idf Does this mean that the Lucene scoring algorithm is computed without the idf factor, i.e. we just get term frequency scoring? 2) Doesn't support consistency between stages, e.g. a shard index can be changed between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS What does this mean or where can I find out what it means? Thanks! Phil Lance Norskog wrote: Yes, I've done this split-by-delete several times. The halved index still uses as much disk space until you optimize it. As to splitting policy: we use an MD5 signature as our unique ID. This has the lovely property that we can wildcard. 'contentid:f*' denotes 1/16 of the whole index. This 1/16 is a very random sample of the whole index. We use this for several things. If we use this for shards, we have a query that matches a shard's contents. The Solr/Lucene syntax does not support modular arithmetic,and so it will not let you query a subset that matches one of your shards. We also found that searching a few smaller indexes via the Solr 1.3 Distributed Search feature is actually faster than searching one large index, YMMV. So for us, a large pile of shards will be optimal anyway, so we have to need rebalance. It sounds like you're not storing the data in a backing store, but are storing all data in the index itself. We have found this challenging. Cheers, Lance Norskog -Original Message- From: Jeremy Hinegardner [mailto:[EMAIL PROTECTED] Sent: Friday, June 13, 2008 3:36 PM To: solr-user@lucene.apache.org Subject: Re: scaling / sharding questions Sorry for not keeping this thread alive, lets see what we can do... One option I've thought of for 'resharding' would splitting an index into two by just copying it, the deleting 1/2 the documents from one, doing a commit, and delete the other 1/2 from the other index and commit. That is: 1) Take original index 2) copy to b1 and b2 3) delete docs from b1 that match a particular query A 4) delete docs from b2 that do not match a particular query A 5) commit b1 and b2 Has anyone tried something like that? As for how to know where each document is stored, generally we're considering unique_document_id % N. If we rebalance we change N and redistribute, but that probably will take too much time.That makes us move more towards a staggered age based approach where the most recent docs filter down to permanent indexes based upon time. Another thought we've had recently is to have many many many physical shards, on the indexing writer side, but then merge groups of them into logical shards which are snapshotted to reader solrs' on a frequent basis. I haven't done any testing along these lines, but logically it seems like an idea worth pursuing. enjoy, -jeremy On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote: Cool sharding technique. We as well are thinking of howto move docs from one index to another because we need to re-balance the docs when we add new nodes to the cluster. We do only store id's in the index otherwise we could have moved stuff around with IndexReader.document(x) or so. Luke (http://www.getopt.org/luke/) is able to reconstruct the indexed Document data so it should be doable. However I'm thinking of actually just delete the docs from the old index and add new Documents to the new node. It would be cool to not waste cpu cycles by reindexing already indexed stuff but... And we as well will have data amounts in the range you are talking about. We perhaps could share ideas ? How do you plan to store where each document is located ? I mean you probably need to store info about the Document and it's location somewhere perhaps in a clustered DB ? We will probably go for HBase for this. I think the number of documents is less important than the actual data size (just speculating). We currently search 10M (will get much much larger) indexed blog entries on one machine where the JVM has 1G heap, the index size is 3G and response times are still quite fast. This is a readonly node though and is updated every morning with a freshly optimized index. Someone told me that you probably need twice the RAM if you plan to both index and search at the same time. If I were you I would just test to index X entries of your data and then start to search in the index with lower JVM settings each round and when response times get too slow or you hit OOE then you get a rough estimate of the bare minimum X RAM needed for Y entries. I think we
Re: scaling / sharding questions
On Wed, Jun 18, 2008 at 5:53 PM, Phillip Farber [EMAIL PROTECTED] wrote: Does this mean that the Lucene scoring algorithm is computed without the idf factor, i.e. we just get term frequency scoring? No, it means that the idf calculation is done locally on a single shard. With a big index that is randomly mixed, this should not have a practical impact. 2) Doesn't support consistency between stages, e.g. a shard index can be changed between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS What does this mean or where can I find out what it means? STAGE_EXECUTE_QUERY finds the ids of matching documents. STAGE_GET_FIELDS retrieves the fields of matching documents. A change to a document could possibly happen inbetween, and one would end up retrieving a document that no longer matched the query. In practice, this is rarely an issue. -Yonik
Re: Re[2]: Feature idea - delete and commit from web interface ?
The implementation may provide a form where user can type in a doc id to delete or a lucene query if it is a POST so be it. But let us have the functionality --Noble On Thu, Jun 19, 2008 at 2:40 AM, Craig McClanahan [EMAIL PROTECTED] wrote: On Wed, Jun 18, 2008 at 1:55 PM, JLIST [EMAIL PROTECTED] wrote: Sounds like web designer's fault. No permission check and no confirmation for deletion? Nope ... application designer's fault for misusing the web. Allowing deletes on a GET violates HTTP/1.1 requirements (not just RESTful ones) that GET requests not have side effects, so an app that works that way is going to mess up when HTTP caching is in use ... as lots of people found to their chagrin when they installed Google Desktop's caching capabilities, and the cache played by the standard HTTP rules (GETs are supposed to be idempotent, having no side effects, so it's just fine to issue the same GET as many times as desired. If you want an easy way to do deletes from a browser, just set up a little form that does a POST and includes the id of the document you want to delete. Then you're playing by the rules, and won't make a fool of yourself when crawlers or caches interact with your application. Craig McClanahan Never, never delete with a GET. The Ultraseek spider deleted 20K docments on an intranet once because they gave it admin perms and it followed the delete this page link on every page. -- --Noble Paul
Re: Slight issue with classloading and DataImportHandler
Hi, I am actually providing the fully qualified classname in the configuration and I was still getting a ClassNotFoundException. If you look at the code in SolrResourceLoader they actually explicitly add the jars in solr-home/lib to the classloader: static ClassLoader createClassLoader(File f, ClassLoader loader) { if( loader == null ) { loader = Thread.currentThread().getContextClassLoader(); } if (f.canRead() f.isDirectory()) { File[] jarFiles = f.listFiles(); URL[] jars = new URL[jarFiles.length]; try { for (int j = 0; j jarFiles.length; j++) { jars[j] = jarFiles[j].toURI().toURL(); log.info(Adding ' + jars[j].toString() + ' to Solr classloader); } return URLClassLoader.newInstance(jars, loader); } catch (MalformedURLException e) { SolrException.log(log,Can't construct solr lib class loader, e); } } log.info(Reusing parent classloader); return loader; } This seems to be me to be why my class is now found when I include my utilities jar in solr-home/lib. Thanks Brendan On Jun 18, 2008, at 11:49 PM, Noble Paul നോബിള് नोब्ळ् wrote: hi, DIH does not load class using the SolrResourceLoader. It tries a Class.forName() with the name you provide if it fails it prepends org.apache.solr.handler.dataimport. and retries. This is true for not just transformers but also for Entityprocessor, DataSource and Evaluator The reason for doing so is that we do not use any of the 'solr.' packages in DIH. All our implementations fall into the default package and we can directly use them w/o the package name. So , if you are writing your own implementations use the default package or provide the fully qualified class name. --Noble On Thu, Jun 19, 2008 at 8:09 AM, Jon Baer [EMAIL PROTECTED] wrote: Thanks. Yeah took me a while to figure out I needed to do something like transformer=com.mycompany.solr.MyTransformer on the entity before it would work ... - Jon On Jun 18, 2008, at 1:51 PM, Brendan Grainger wrote: Hi, I set up the new DataimportHandler last night to replace some custom import code I'd written and so far I'm loving it thank you. I had one issue you might want to know about it. I have some solr extensions I've written and packaged in a jar which I place in: solr-home/lib as per: http://wiki.apache.org/solr/SolrPlugins#head-59e2685df65335e82f8936ed55d260842dc7a4dc This works well for my handlers but a custom Transformer I wrote and packaged the same way was throwing a ClassNotFoundException. I tracked it down to the DocBuilder.loadClass method which was just doing a Class.forName. Anyway, I fixed it for the moment by probably do something stupid and creating a SolrResourceLoader (which I imagine could be an instance variable, but at 3am I just wanted to get it working). Anyway, this fixes the problem: @SuppressWarnings(unchecked) static Class loadClass(String name) throws ClassNotFoundException { SolrResourceLoader loader = new SolrResourceLoader( null ); return loader.findClass(name); // return Class.forName(name); } Brendan -- --Noble Paul
Re: Slight issue with classloading and DataImportHandler
aah!. We always assumed that people put the custom jars in the WEB-INF/lib folder of solr webapp and hence they are automatically in the classpath we shall make the necessary changes . --Noble On Thu, Jun 19, 2008 at 10:06 AM, Brendan Grainger [EMAIL PROTECTED] wrote: Hi, I am actually providing the fully qualified classname in the configuration and I was still getting a ClassNotFoundException. If you look at the code in SolrResourceLoader they actually explicitly add the jars in solr-home/lib to the classloader: static ClassLoader createClassLoader(File f, ClassLoader loader) { if( loader == null ) { loader = Thread.currentThread().getContextClassLoader(); } if (f.canRead() f.isDirectory()) { File[] jarFiles = f.listFiles(); URL[] jars = new URL[jarFiles.length]; try { for (int j = 0; j jarFiles.length; j++) { jars[j] = jarFiles[j].toURI().toURL(); log.info(Adding ' + jars[j].toString() + ' to Solr classloader); } return URLClassLoader.newInstance(jars, loader); } catch (MalformedURLException e) { SolrException.log(log,Can't construct solr lib class loader, e); } } log.info(Reusing parent classloader); return loader; } This seems to be me to be why my class is now found when I include my utilities jar in solr-home/lib. Thanks Brendan On Jun 18, 2008, at 11:49 PM, Noble Paul നോബിള് नोब्ळ् wrote: hi, DIH does not load class using the SolrResourceLoader. It tries a Class.forName() with the name you provide if it fails it prepends org.apache.solr.handler.dataimport. and retries. This is true for not just transformers but also for Entityprocessor, DataSource and Evaluator The reason for doing so is that we do not use any of the 'solr.' packages in DIH. All our implementations fall into the default package and we can directly use them w/o the package name. So , if you are writing your own implementations use the default package or provide the fully qualified class name. --Noble On Thu, Jun 19, 2008 at 8:09 AM, Jon Baer [EMAIL PROTECTED] wrote: Thanks. Yeah took me a while to figure out I needed to do something like transformer=com.mycompany.solr.MyTransformer on the entity before it would work ... - Jon On Jun 18, 2008, at 1:51 PM, Brendan Grainger wrote: Hi, I set up the new DataimportHandler last night to replace some custom import code I'd written and so far I'm loving it thank you. I had one issue you might want to know about it. I have some solr extensions I've written and packaged in a jar which I place in: solr-home/lib as per: http://wiki.apache.org/solr/SolrPlugins#head-59e2685df65335e82f8936ed55d260842dc7a4dc This works well for my handlers but a custom Transformer I wrote and packaged the same way was throwing a ClassNotFoundException. I tracked it down to the DocBuilder.loadClass method which was just doing a Class.forName. Anyway, I fixed it for the moment by probably do something stupid and creating a SolrResourceLoader (which I imagine could be an instance variable, but at 3am I just wanted to get it working). Anyway, this fixes the problem: @SuppressWarnings(unchecked) static Class loadClass(String name) throws ClassNotFoundException { SolrResourceLoader loader = new SolrResourceLoader( null ); return loader.findClass(name); // return Class.forName(name); } Brendan -- --Noble Paul -- --Noble Paul
Re: Slight issue with classloading and DataImportHandler
: aah!. We always assumed that people put the custom jars in the : WEB-INF/lib folder of solr webapp and hence they are automatically in : the classpath we shall make the necessary changes . It would be better to use the classloader from the SolrResourceLoader ... that should be safe for anyone with any setup. DIH does not load class using the SolrResourceLoader. It tries a Class.forName() with the name you provide if it fails it prepends org.apache.solr.handler.dataimport. and retries. ... The reason for doing so is that we do not use any of the 'solr.' packages in DIH. All our implementations fall into the default package and we can directly use them w/o the package name. FWIW: there isn't relaly a solr. package ... solr. can be used as an short form alias for the likely package when Solr resolves classes, where the likely package varies by context and there can be multiple options that it tries in order DIH could do the same thing, letting short form solr. signify that Transformers, Evaluators, etc are in the o.a.s.handler.dataimport package. the advantage of this over what it sounds like DIH currently does is that if there is an o.a.s.handler.dataimport.WizWatTransformer but someone wants to write their own (package less) WizWatTransformer they can and refer to it simply as WizWatTransformer (whereas to use the one that ships with DIH they would specify solr.WizWatTransformer). There's no ambiguity as to which one someone means unless they create a package called solr ... but then they'ed just be looking for trouble :) -Hoss
Re: Seeking suggestions - keyword related site promotion
Is there a fixed set of keywords? If so, I suppose you could simply index these keywords into a field for each site (either through some kind of automatic parser or manually - from personal experience I would recommend manually unless you have tens of thousands of these things), and then search that field with each word in the query (with or). Any site that had one of these keywords would match it if it were used in the query... If there is no list here and you're just indexing all the content of all these sites... isn't that what Nutch is designed for? -- Steve On Jun 18, 2008, at 11:05 PM, JLIST wrote: Hi all, This is what I'm trying to do: since some sources (say, some web sites) are more authoritative than other sources on certain subjects, I'd like to promote those sites when the query contains certain keywords. I'm not sure what is the best way to implement this. I suppose I can index the keywords in a field for all pages from that site but this isn't very efficient, and any changes in the keyword list would require re-indexing all pages of that site. I wonder if there is a more efficient way that can dynamically promote sites from a domain that is considered more related to the queries. Any suggestion is welcome. Thanks, Jack