Re: production solr - app server choice ?
On Mar 9, 2007, at 6:46 AM, rubdabadub wrote: On 3/9/07, Erik Hatcher [EMAIL PROTECTED] wrote: We use jetty on a few applications with no problem. I recommend it unless and until you outgrow it (but I doubt you will). Resin, in my past experience with it, is fantastic. But no need to even go there until you outgrow Jetty I don't think. lucenebook.com, for example, is entirely driven by Jetty. Is it the collex/nine where you have more then 4 mill docs you are using jetty? No at NINES - http://www.nines.org/collx - we have just over 60k documents currently (see the number in the footer). The index of the UVa library (3.7M records) is not currently deployed other than on my laptop. The number of documents shouldn't matter as far as what app server you use. Though I'm not really sure what the variables would be in determining which app. server is best with Solr. I don't think you'll go wrong with Jetty, Tomcat, or Resin - all will respond from Solr quite rapidly provided you take care of the core Solr caching concerns and set the JVM properties with enough heap and such to operate smoothly. I have a lot of docs i.e. 20 mil and it has bunch of fields i.e 25 per doc this is why i worry.. but i dont think my qps will as high as I hoped so jetty should be just fine. Testing is the best way to find out, and its fairly easy to switch app. servers and re-test. Again, I'd be surprised if the choice of app. server has much relation to performance in your case. Erik
Re: production solr - app server choice ?
On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote: ...The site is a local portal and the traffic is very high and I am not sure if Jetty is enough maybe it is Just an additional note on this: asking four people about what very high traffic means might also give you five different answers ;-) FWIW, I've been testing Solr on the plain Jetty example config at more than 100 semi-random queries per second and it ran just fine, on a medium-range server (dual Xeon 2Ghz IIRC). But this is with our data and our type of queries - I agree with Erik that testing is the only way to find out how your setup will perform with your own data and queries. Simply generating a lot of semi-random requests from a collection of possible query parameters, and feeding the resulting URLs to multiple instances of curl or wget to generate some load, will tell you a lot about how your setup performs, and where the hotspots are. -Bertrand
Re: production solr - app server choice ?
I use jetty and tomcat 6 under win2003. They all work well. 2007/3/10, Bertrand Delacretaz [EMAIL PROTECTED]: On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote: ...The site is a local portal and the traffic is very high and I am not sure if Jetty is enough maybe it is Just an additional note on this: asking four people about what very high traffic means might also give you five different answers ;-) FWIW, I've been testing Solr on the plain Jetty example config at more than 100 semi-random queries per second and it ran just fine, on a medium-range server (dual Xeon 2Ghz IIRC). But this is with our data and our type of queries - I agree with Erik that testing is the only way to find out how your setup will perform with your own data and queries. Simply generating a lot of semi-random requests from a collection of possible query parameters, and feeding the resulting URLs to multiple instances of curl or wget to generate some load, will tell you a lot about how your setup performs, and where the hotspots are. -Bertrand -- regards jl
Re: production solr - app server choice ?
Thanks for the feedback! I was planning to test but I wanted to know what other were using. I have been using tomcat extensively but got tired of it (no technical reason). Jetty sounds too simple so I thought I ask :-) Never tried Resin but it has some good reputation. The local portal is using tomcat and it serves approximately 20 req/ second in peak times. I don't know how high load is this as I have no other reference. I know for sure the local portal is no google :-) I think as Erik mentioned its probably Solr config that will increase or decrease performance. I am currently reading up/testing performance pages. Any other advice is always welcome. Thanks again for all the input. On 3/10/07, James liu [EMAIL PROTECTED] wrote: I use jetty and tomcat 6 under win2003. They all work well. 2007/3/10, Bertrand Delacretaz [EMAIL PROTECTED]: On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote: ...The site is a local portal and the traffic is very high and I am not sure if Jetty is enough maybe it is Just an additional note on this: asking four people about what very high traffic means might also give you five different answers ;-) FWIW, I've been testing Solr on the plain Jetty example config at more than 100 semi-random queries per second and it ran just fine, on a medium-range server (dual Xeon 2Ghz IIRC). But this is with our data and our type of queries - I agree with Erik that testing is the only way to find out how your setup will perform with your own data and queries. Simply generating a lot of semi-random requests from a collection of possible query parameters, and feeding the resulting URLs to multiple instances of curl or wget to generate some load, will tell you a lot about how your setup performs, and where the hotspots are. -Bertrand -- regards jl
Adding data as UTF-8
Hi, I've been working on adding some Solr-integration into my current project, but have run into a problem with non-ascii characters. I send a document like the following: --- ?xml version=1.0 encoding=UTF-8? adddoc field name=question_id228/field field name=question_titleVedhæft billede til min formular/field field name=userid26/field field name=question_textJeg har lavet en side som skal info om værkstedet Badsetuen i Odense, som er under kraftig omlægning af kommunen - dvs nedskæring. Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om deres håndværk udført på stedet. Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/ Nogle ideer ?/field field name=question_date2006-05-17T08:44:23Z/field field name=question_tagsUpload/field field name=question_tagsHTML/field field name=question_tagsEmail/field field name=question_tagsVedhæftning/field /doc/add --- But when I do a search like /solr/select/?q=billede (default search is the field text which is a multiValued copyField from question_title and question_text) I will get the document back as --- ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader ... /lst result name=response numFound=1 start=0 doc date name=question_date2006-05-17T08:44:23Z/date int name=question_id228/int arr name=question_tagsstrUpload/strstrHTML/strstrEmail/str strVedhæftning/str/arr str name=question_textJeg har lavet en side som skal info om værkstedet Badsetuen i Odense, som er under kraftig omlægning af kommunen - dvs nedskæring. Jeg har her oprettet en formular hvor brugere kan sende en tekst pÃ¥ email om deres hÃ¥ndværk udført pÃ¥ stedet. Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/ Nogle ideer ?/str str name=question_titleVedhæft billede til min formular/str str name=userid26/str /doc /result /response --- Which is basicly the same text, but displayed as ISO-8859-1. How can this be? Do I have to send off some header saying it is UTF-8, or should I just send the data as UTF-8 (that produces the correct encoding in answers, but sounds like a silly way of doing it) Any ideas? Btw, the install-script listed at http://wiki.apache.org/solr/SolrTomcat is a bit wrong. Should I just contribute the fixes (new solr dir and name to fetch) to the wiki, or will any of you guys rather do it yourself? Regards -fangel
Re: Adding data as UTF-8
On Saturday 10 March 2007 21:39, Bertrand Delacretaz wrote: On 3/10/07, Morten Fangel [EMAIL PROTECTED] wrote: ...I send a document like the following: --- ?xml version=1.0 encoding=UTF-8?... I assume you're using your own code to send the document? Indeed. Solr will be integrated (almost) transparently into my framework.. ;) It'll work pretty much like the act_as_solr RoR implementation, if I'm not totally mistaken about that particular implementation.. Currently you need to include a Content-type: text/xml; charset=UTF-8 header in your HTTP POST request, and (as you're doing) the XML needs to be encoded in UTF-8. Super. Indeed that fixed it, yes... -fangel
Re: Adding data as UTF-8
It is better to use application/xml. See RFC 3023. Using text/xml; charset=UTF-8 will override the XML encoding declaration. application/xml will not. wunder On 3/10/07 12:39 PM, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 3/10/07, Morten Fangel [EMAIL PROTECTED] wrote: ...I send a document like the following: --- ?xml version=1.0 encoding=UTF-8?... I assume you're using your own code to send the document? Currently you need to include a Content-type: text/xml; charset=UTF-8 header in your HTTP POST request, and (as you're doing) the XML needs to be encoded in UTF-8. See the source code of src/java/org/apache/solr/util/SimplePostTool.java for example. -Bertrand
Re: Adding data as UTF-8
On Saturday 10 March 2007 22:18, Walter Underwood wrote: It is better to use application/xml. See RFC 3023. Using text/xml; charset=UTF-8 will override the XML encoding declaration. application/xml will not. Thanks for the info. I've changed the header accordingly. -fangel
Re: Adding data as UTF-8
On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote: It is better to use application/xml. See RFC 3023. Using text/xml; charset=UTF-8 will override the XML encoding declaration. application/xml will not... I agree, but did you try this with our example setup, started with java -jar start.jar? It doesn't seem to work here: If I change our example/exampledocs/post.sh to use curl $URL --data-binary @$f -H 'Content-type:application/xml' instead of curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8' the encoding declaration of my posted XML is ignored, characters are interpreted according to my JVM encoding (-Dfile.encoding makes a difference in that case). Are you seeing something different, or do you know why this is so? -Bertrand
Re: Adding data as UTF-8
If it does something different, that is a bug. RFC 3023 is clear. --wunder On 3/10/07 1:49 PM, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote: It is better to use application/xml. See RFC 3023. Using text/xml; charset=UTF-8 will override the XML encoding declaration. application/xml will not... I agree, but did you try this with our example setup, started with java -jar start.jar? It doesn't seem to work here: If I change our example/exampledocs/post.sh to use curl $URL --data-binary @$f -H 'Content-type:application/xml' instead of curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8' the encoding declaration of my posted XML is ignored, characters are interpreted according to my JVM encoding (-Dfile.encoding makes a difference in that case). Are you seeing something different, or do you know why this is so? -Bertrand
Re: Adding data as UTF-8
On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote: If it does something different, that is a bug. RFC 3023 is clear. --wunder.. Sure - just wanted to confirm what I'm seeing, thanks! -Bertrand
Question About Boosting.
How can i boost some tokens over others in the same field (at Index time) ? If this is not supported directly, what's the best way around this problem (what's the hack to solve this :) ). Thanks, Shai
Re: Question About Boosting.
What are you trying to achieve? Let's start with the problem instead of picking one solution which Solr doesn't support. --wunder On 3/10/07 5:08 PM, shai deljo [EMAIL PROTECTED] wrote: How can i boost some tokens over others in the same field (at Index time) ? If this is not supported directly, what's the best way around this problem (what's the hack to solve this :) ). Thanks, Shai
Re: Federated Search
Venkatesh Seetharam wrote: The hash idea sounds really interesting and if I had a fixed number of indexes it would be perfect. I'm infact looking around for a reverse-hash algorithm where in given a docId, I should be able to find which partition contains the document so I can save cycles on broadcasting slaves. Many large databases partition their data either by load or by another logical manner, like by alphabet. I hear that Hotmail, for instance, partitions its users alphabetically. Having a broker will certainly abstract this mechninism, and of course your application(s) want to be able to bypass a broker when necessary. I mean, even if you use a DB, how have you solved the problem of distribution when a new server is added into the mix. http://www8.org/w8-papers/2a-webserver/caching/paper2.html I saw this link on the memcached list and the thread surrounding it certainly covered some similar ground. Some ideas have been discussed like: - high availability of memcached, redundant entries - scaling out clusters and facing the need to rebuild the entire cache on all nodes depending on your bucketing. I see some similarties with maintaining multiple indicies/lucene partitions and having a memcache deployment: mostly if you are hashing your keys to partitions (or buckets or machines) then you might be faced with a) availability issues if there's a machine/partition outtage b) rebuilding partitions if adding a partition/bucket changes the hash mapping. The ways I can think of to scale-out new indexes would be to have your application maintain two sets of bucket mappings for ids to indexes, and the second would be to key your documents and partition them by date. The former method would allow you to rebuild a second set of repartitioned indexes and buckets and allow you to update your application to use the new bucket mapping (when all the indexes has been rebuilt). The latter method would only apply if you could organize your document ids by date and only added new documents to the 'now' end or evenly across most dates. You'd have to add a new partition onto the end as time progressed, and rarely rebuild old indexes unless your documents grow unevenly. Interesting topic! I don't yet need to run multiple Lucene partitions, but I have a few memcached servers and increasing the number of them I expect will force my site to take a performance accordingly as I am forced to rebuild the caches. I can see similarly if I had multiple lucene partitions, that if I had to fission some of them, rebuilding the resulting partitions would be time intensive and I'd want to have procedures in place for availibility, scaling out and changing application code as necessary. Just having one fail-over Solr index is just so easy in comparison. Jed