Re: production solr - app server choice ?

2007-03-10 Thread Erik Hatcher


On Mar 9, 2007, at 6:46 AM, rubdabadub wrote:

On 3/9/07, Erik Hatcher [EMAIL PROTECTED] wrote:

We use jetty on a few applications with no problem.  I recommend it
unless and until you outgrow it (but I doubt you will).   Resin, in
my past experience with it, is fantastic.  But no need to even go
there until you outgrow Jetty I don't think.  lucenebook.com, for
example, is entirely driven by Jetty.


Is it the collex/nine where you have more then 4 mill docs you are
using jetty?


No at NINES - http://www.nines.org/collx - we have just over 60k  
documents currently (see the number in the footer).  The index of the  
UVa library (3.7M records) is not currently deployed other than on my  
laptop.


The number of documents shouldn't matter as far as what app server  
you use.  Though I'm not really sure what the variables would be in  
determining which app. server is best with Solr.  I don't think  
you'll go wrong with Jetty, Tomcat, or Resin - all will respond from  
Solr quite rapidly provided you take care of the core Solr caching  
concerns and set the JVM properties with enough heap and such to  
operate smoothly.



I
have a lot of docs i.e. 20 mil and it has bunch of fields i.e 25 per
doc this is why i worry..
but i dont think my qps will  as high as I hoped so jetty should be  
just fine.


Testing is the best way to find out, and its fairly easy to switch  
app. servers and re-test.  Again, I'd be surprised if the choice of  
app. server has much relation to performance in your case.


Erik





Re: production solr - app server choice ?

2007-03-10 Thread Bertrand Delacretaz

On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote:


...The site is a local portal and the traffic is very high and I am not
sure if Jetty is enough maybe it is


Just an additional note on this: asking four people about what very
high traffic means might also give you five different answers ;-)

FWIW, I've been testing Solr on the plain Jetty example config at more
than 100 semi-random queries per second and it ran just fine, on a
medium-range server (dual Xeon 2Ghz IIRC).

But this is with our data and our type of queries - I agree with Erik
that testing is the only way to find out how your setup will perform
with your own data and queries.

Simply generating a lot of semi-random requests from a collection of
possible query parameters, and feeding the resulting URLs to multiple
instances of curl or wget to generate some load, will tell you a lot
about how your setup performs, and where the hotspots are.

-Bertrand


Re: production solr - app server choice ?

2007-03-10 Thread James liu

I use jetty and tomcat 6 under win2003.

They all work well.




2007/3/10, Bertrand Delacretaz [EMAIL PROTECTED]:


On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote:

 ...The site is a local portal and the traffic is very high and I am not
 sure if Jetty is enough maybe it is

Just an additional note on this: asking four people about what very
high traffic means might also give you five different answers ;-)

FWIW, I've been testing Solr on the plain Jetty example config at more
than 100 semi-random queries per second and it ran just fine, on a
medium-range server (dual Xeon 2Ghz IIRC).

But this is with our data and our type of queries - I agree with Erik
that testing is the only way to find out how your setup will perform
with your own data and queries.

Simply generating a lot of semi-random requests from a collection of
possible query parameters, and feeding the resulting URLs to multiple
instances of curl or wget to generate some load, will tell you a lot
about how your setup performs, and where the hotspots are.

-Bertrand





--
regards
jl


Re: production solr - app server choice ?

2007-03-10 Thread rubdabadub

Thanks for the feedback! I was planning to test but I wanted to know what
other were using. I have been using tomcat extensively but got tired of it (no
technical reason).

Jetty sounds too simple so I thought I ask :-) Never tried Resin but it has some
good reputation.

The local portal is using tomcat and it serves approximately 20 req/ second in
peak times. I don't know how high load is this as I have no other
reference. I know for
sure the local portal is no google :-)

I think as Erik mentioned its probably Solr config that will increase
or decrease performance.
I am currently reading up/testing performance pages. Any other advice
is always welcome.

Thanks again for all the input.

On 3/10/07, James liu [EMAIL PROTECTED] wrote:

I use jetty and tomcat 6 under win2003.

They all work well.




2007/3/10, Bertrand Delacretaz [EMAIL PROTECTED]:

 On 3/9/07, rubdabadub [EMAIL PROTECTED] wrote:

  ...The site is a local portal and the traffic is very high and I am not
  sure if Jetty is enough maybe it is

 Just an additional note on this: asking four people about what very
 high traffic means might also give you five different answers ;-)

 FWIW, I've been testing Solr on the plain Jetty example config at more
 than 100 semi-random queries per second and it ran just fine, on a
 medium-range server (dual Xeon 2Ghz IIRC).

 But this is with our data and our type of queries - I agree with Erik
 that testing is the only way to find out how your setup will perform
 with your own data and queries.

 Simply generating a lot of semi-random requests from a collection of
 possible query parameters, and feeding the resulting URLs to multiple
 instances of curl or wget to generate some load, will tell you a lot
 about how your setup performs, and where the hotspots are.

 -Bertrand




--
regards
jl



Adding data as UTF-8

2007-03-10 Thread Morten Fangel
Hi,

I've been working on adding some Solr-integration into my current project, but 
have run into a problem with non-ascii characters.

I send a document like the following:

---
?xml version=1.0 encoding=UTF-8?
adddoc
  field name=question_id228/field
  field name=question_titleVedhæft billede til min formular/field
  field name=userid26/field
  field name=question_textJeg har lavet en side som skal info om 
værkstedet Badsetuen i Odense, som er under kraftig omlægning af kommunen - 
dvs nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om 
deres håndværk udført på stedet. 
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?/field
  field name=question_date2006-05-17T08:44:23Z/field
  field name=question_tagsUpload/field
  field name=question_tagsHTML/field
  field name=question_tagsEmail/field
  field name=question_tagsVedhæftning/field
/doc/add
---

But when I do a search like /solr/select/?q=billede (default search is the 
field text which is a multiValued copyField from question_title and 
question_text)

I will get the document back as

---
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeader
 ...
/lst
result name=response numFound=1 start=0
 doc
  date name=question_date2006-05-17T08:44:23Z/date
  int name=question_id228/int
  arr name=question_tagsstrUpload/strstrHTML/strstrEmail/str
strVedhæftning/str/arr
  str name=question_textJeg har lavet en side som skal info om værkstedet 
Badsetuen i Odense, som er under kraftig omlægning af kommunen - dvs 
nedskæring.
Jeg har her oprettet en formular hvor brugere kan sende en tekst på email om 
deres håndværk udført på stedet. 
Jeg mangler et felt til at vedhæfte billede http://www.badstuen.dannyboyd.dk/

Nogle ideer ?/str
  str name=question_titleVedhæft billede til min formular/str
  str name=userid26/str
 /doc
/result
/response
---

Which is basicly the same text, but displayed as ISO-8859-1. How can this be? 
Do I have to send off some header saying it is UTF-8, or should I just send 
the data as UTF-8 (that produces the correct encoding in answers, but sounds 
like a silly way of doing it)

Any ideas?

Btw, the install-script listed at http://wiki.apache.org/solr/SolrTomcat is a 
bit wrong. Should I just contribute the fixes (new solr dir and name to 
fetch) to the wiki, or will any of you guys rather do it yourself?

Regards
 -fangel


Re: Adding data as UTF-8

2007-03-10 Thread Morten Fangel
On Saturday 10 March 2007 21:39, Bertrand Delacretaz wrote:
 On 3/10/07, Morten Fangel [EMAIL PROTECTED] wrote:
  ...I send a document like the following:
 
  ---
  ?xml version=1.0 encoding=UTF-8?...

 I assume you're using your own code to send the document?
Indeed. Solr will be integrated (almost) transparently into my framework.. ;)

It'll work pretty much like the act_as_solr RoR implementation, if I'm not 
totally mistaken about that particular implementation.. 

 Currently you need to include a Content-type: text/xml;
 charset=UTF-8 header in your HTTP POST request, and (as you're doing)
 the XML needs to be encoded in UTF-8.
Super. Indeed that fixed it, yes...

-fangel



Re: Adding data as UTF-8

2007-03-10 Thread Walter Underwood
It is better to use application/xml. See RFC 3023.
Using text/xml; charset=UTF-8 will override the XML
encoding declaration. application/xml will not.

wunder

On 3/10/07 12:39 PM, Bertrand Delacretaz [EMAIL PROTECTED] wrote:

 On 3/10/07, Morten Fangel [EMAIL PROTECTED] wrote:
 
 ...I send a document like the following:
 
 ---
 ?xml version=1.0 encoding=UTF-8?...
 
 I assume you're using your own code to send the document?
 
 Currently you need to include a Content-type: text/xml;
 charset=UTF-8 header in your HTTP POST request, and (as you're doing)
 the XML needs to be encoded in UTF-8.
 
 See the source code of
 src/java/org/apache/solr/util/SimplePostTool.java for example.
 
 -Bertrand



Re: Adding data as UTF-8

2007-03-10 Thread Morten Fangel
On Saturday 10 March 2007 22:18, Walter Underwood wrote:
 It is better to use application/xml. See RFC 3023.
 Using text/xml; charset=UTF-8 will override the XML
 encoding declaration. application/xml will not.
Thanks for the info. I've changed the header accordingly.

-fangel


Re: Adding data as UTF-8

2007-03-10 Thread Bertrand Delacretaz

On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote:

It is better to use application/xml. See RFC 3023.
Using text/xml; charset=UTF-8 will override the XML
encoding declaration. application/xml will not...


I agree, but did you try this with our example setup, started with
java -jar start.jar?

It doesn't seem to work here: If I change our example/exampledocs/post.sh to use

  curl $URL --data-binary @$f -H 'Content-type:application/xml'

instead of

 curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'

the encoding declaration of my posted XML is ignored, characters are
interpreted according to my JVM encoding (-Dfile.encoding makes a
difference in that case).

Are you seeing something different, or do you know why this is so?

-Bertrand


Re: Adding data as UTF-8

2007-03-10 Thread Walter Underwood
If it does something different, that is a bug. RFC 3023 is clear. --wunder

On 3/10/07 1:49 PM, Bertrand Delacretaz [EMAIL PROTECTED] wrote:

 On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote:
 It is better to use application/xml. See RFC 3023.
 Using text/xml; charset=UTF-8 will override the XML
 encoding declaration. application/xml will not...
 
 I agree, but did you try this with our example setup, started with
 java -jar start.jar?
 
 It doesn't seem to work here: If I change our example/exampledocs/post.sh to
 use
 
curl $URL --data-binary @$f -H 'Content-type:application/xml'
 
 instead of
 
   curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
 
 the encoding declaration of my posted XML is ignored, characters are
 interpreted according to my JVM encoding (-Dfile.encoding makes a
 difference in that case).
 
 Are you seeing something different, or do you know why this is so?
 
 -Bertrand



Re: Adding data as UTF-8

2007-03-10 Thread Bertrand Delacretaz

On 3/10/07, Walter Underwood [EMAIL PROTECTED] wrote:

If it does something different, that is a bug. RFC 3023 is clear. --wunder..


Sure - just wanted to confirm what I'm seeing, thanks!

-Bertrand


Question About Boosting.

2007-03-10 Thread shai deljo

How can i boost some tokens over others in the same field (at Index
time) ? If this is not supported directly, what's the best way around
this problem (what's the hack to solve this :) ).
Thanks,
Shai


Re: Question About Boosting.

2007-03-10 Thread Walter Underwood
What are you trying to achieve? Let's start with the problem
instead of picking one solution which Solr doesn't support. --wunder

On 3/10/07 5:08 PM, shai deljo [EMAIL PROTECTED] wrote:

 How can i boost some tokens over others in the same field (at Index
 time) ? If this is not supported directly, what's the best way around
 this problem (what's the hack to solve this :) ).
 Thanks,
 Shai



Re: Federated Search

2007-03-10 Thread Jed Reynolds

   Venkatesh Seetharam wrote:



The hash idea sounds really interesting and if I had a fixed number of

indexes it would be perfect.
I'm infact looking around for a reverse-hash algorithm where in given a
docId, I should be able to find which partition contains the document 
so I

can save cycles on broadcasting slaves.


Many large databases partition their data either by load or by another 
logical manner, like by alphabet. I hear that Hotmail, for instance, 
partitions its users alphabetically. Having a broker will certainly 
abstract this mechninism, and of course your application(s) want to be 
able to bypass a broker when necessary.



I mean, even if you use a DB, how have you solved the problem of
distribution when a new server is added into the mix.


http://www8.org/w8-papers/2a-webserver/caching/paper2.html

I saw this link on the memcached list and the thread surrounding it 
certainly covered some similar ground. Some ideas have been discussed like:

- high availability of memcached, redundant entries
- scaling out clusters and facing the need to rebuild the entire cache 
on all nodes depending on your bucketing.
I see some similarties with maintaining multiple indicies/lucene 
partitions and having a memcache deployment: mostly if you are hashing 
your keys to partitions (or buckets or machines) then you might be faced 
with a) availability issues if there's a machine/partition outtage b) 
rebuilding partitions if adding a partition/bucket changes the hash mapping.


The ways I can think of to scale-out new indexes would be to have your 
application maintain two sets of bucket mappings for ids to indexes, and 
the second would be to key your documents and partition them by date. 
The former method would allow you to rebuild a second set of 
repartitioned indexes and buckets and allow you to update your 
application to use the new bucket mapping (when all the indexes has been 
rebuilt). The latter method would only apply if you could organize your 
document ids by date and only added new documents to the 'now' end or 
evenly across most dates. You'd have to add a new partition onto the end 
as time progressed, and rarely rebuild old indexes unless your documents 
grow unevenly.


Interesting topic! I don't yet need to run multiple Lucene partitions, 
but I have a few memcached servers and increasing the number of them I 
expect will force my site to take a performance accordingly as I am 
forced to rebuild the caches. I can see similarly if I had multiple 
lucene partitions, that if I had to fission some of them, rebuilding the 
resulting partitions would be time intensive and I'd want to have 
procedures in place for availibility, scaling out and changing 
application code as necessary. Just having one fail-over Solr index is 
just so easy in comparison.


Jed