Re: Embedded about 50% faster for indexing
On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, August 24, 2007 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing One thing I'd like to avoid is everyone trying to embed just for performance gains. If there is really that much difference, then we need a better way for people to get that without resorting to Java code. -Yonik Theoretically and practically, embedded solution will be faster than going through http/xml. This is only true if the http interface adds significant overhead to the cost of indexing a document, and I don't see why this should be so, as indexing is relatively heavyweight. setting up the connection could be expensive, but this can be greatly mitigated by sending more than one doc per http request, using persistent connections, and threading. -Mike
RE: Embedded about 50% faster for indexing
The embedded approach is at http://wiki.apache.org/solr/EmbeddedSolr For my testing I have a tunable setting for records to submit and did 10 per batch. Both approaches committed after every 1000 records, also tunable. A custom Lucene implementation I helped implement was even faster than embedded, using a ramdrive as a double buffer. However that did require a much larger memory footprint. The embedded class have little to no documentation and almost look like stub implementations, but they work well. While this project will succeed in a large part to how easy it is to integrate with non Java clients, I would actually like to see this project more java friendly, like a reference indexing implementation. There are a lot of tools that could be more widely useful like SimplePostTool. With a few API changes it could be used for the demo as well as a useful library. Instead I extended and then had to abandon that and resort to cut and paste reuse in the end. The functionality was 95% there, but just needed API tweaks to make it usable. It also seems unusual exposing fields directly instead of using accessors in the Java code. Accessors can be give a lot of flexibility that field access doesn't have. It would also be nice to able to get java objects back besides XML and JSON, like an Embedded equivalent for search. That way you could integrate more easily with Spring MVC, etc. There may also be some performance gains there. Paul Sundling -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, August 24, 2007 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote: > Created two indexer implementations to test HTTP Post versus Embedded > and the performance was 54.6% faster on embedded. > > Thought others might find that interesting that are using Java. Paul, were the documents posted one-per-message, or did you try multiple (like 50 to 100) per message? If one per message, the best way to increase performance is to have multiple threads adding docs. I'd be curious to know how a single CSV file would clock in at as well... One thing I'd like to avoid is everyone trying to embed just for performance gains. If there is really that much difference, then we need a better way for people to get that without resorting to Java code. -Yonik
RE: Embedded about 50% faster for indexing
> -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf > Of Yonik Seeley > Sent: Friday, August 24, 2007 2:07 PM > To: solr-user@lucene.apache.org > Subject: Re: Embedded about 50% faster for indexing > > One thing I'd like to avoid is everyone trying to embed just > for performance gains. If there is really that much > difference, then we need a better way for people to get that > without resorting to Java code. > > -Yonik > Theoretically and practically, embedded solution will be faster than going through http/xml. I would like to see solr has some sort of document source adapter architecture which will iterate through all the documents available in the document source. This way, if the documents come from database for example, it can be simply a sql query in the solr configuration file. Daniel
Re: Embedded about 50% faster for indexing
On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote: > Created two indexer implementations to test HTTP Post versus Embedded > and the performance was 54.6% faster on embedded. > > Thought others might find that interesting that are using Java. Paul, were the documents posted one-per-message, or did you try multiple (like 50 to 100) per message? If one per message, the best way to increase performance is to have multiple threads adding docs. I'd be curious to know how a single CSV file would clock in at as well... One thing I'd like to avoid is everyone trying to embed just for performance gains. If there is really that much difference, then we need a better way for people to get that without resorting to Java code. -Yonik
Re: Embedded about 50% faster for indexing
Sorry i'm new to the topic, can you point me to the Embedded approach? thanks, -Hui On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote: > > Created two indexer implementations to test HTTP Post versus Embedded > and the performance was 54.6% faster on embedded. > > Thought others might find that interesting that are using Java. > > Paul Sundling > -- Regards, -Hui
RE: clear index
If that happens, then using that specific query should be added to the FAQ for how to clear an index. Paul Sundling -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 21, 2007 6:41 PM To: solr-user@lucene.apache.org Subject: RE: clear index : I'm just seeing if there's an easy/performant way of doing it with Solr. : For a solution with raw Lucene, creating a new index with the same : directory cleared out an old index (even on Windows with it's file : locking) quickly. there has been talk of optimizing delete by query in the case of *:* to just reopen the index with the create flag set to true ... there just hasn't been a patch yet. -Hoss
Embedded about 50% faster for indexing
Created two indexer implementations to test HTTP Post versus Embedded and the performance was 54.6% faster on embedded. Thought others might find that interesting that are using Java. Paul Sundling
Re: Index HotSwap
On 8/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : I'm wondering what's the best way to completely change a big index > : without loosing any requests. > > use the snapinstaller script -- or adopt the same atomic copying approach > it uses. I'm having a look :) > : - Between the two mv's, the directory dir does not exists, which can > : cause some solr failure. > > this shouldn't cause any failure unless you tell Solr to try and reload > turing the move (ie: you send it a commit) ... either way an atomic copy > in place of a mv should work much better. Why, does the reloading of the searcher triggers a re loading of the files from disk ? Thx ! > > -Hoss > > -- Jerome Eteve. [EMAIL PROTECTED] http://jerome.eteve.free.fr/
Re: Effects of changing schema?
In 8/24/07, David Whalen <[EMAIL PROTECTED]> wrote: > I'm unclear on whether changing the schema.xml file > automatically causes a reindex or not. If I'm adding > a field to the schema (and removing some unused ones), > does solr do the reindex? Or, do I have to kick it > off myself. No... changing the schema does nothing to the index, it only affects how new documents are indexed and how the index is searched. If you make a backward compatible change (like adding a new field, or adding some more query-side synonyms) then you don't have to reindex. When in doubt, it's best to reindex. -Yonik
Re: How do I best store and retrieve ISO country codes?
Thanks Yonik! Cheers for the extra advice too. On 24 Aug 2007, at 17:14, Yonik Seeley wrote: On 8/24/07, Simon Peter Nicholls <[EMAIL PROTECTED]> wrote: I've just noticed that for ISO 2 character country codes such as "BE" and "IT", my queries are not working as expected. The field is being stored as country_t, dynamically from acts_as_solr v0.9, as follows (from schema.xml): The thing that sprang to my mind was that BE and IT are also valid words, and perhaps Solr is doing something I'm not expecting (ignoring them, which would make sense mid-text). With this in mind, perhaps an _s type of field is needed, since it is indeed a single important string rather than text composed of many strings. Right, type "text" by default in solr has stopword removal and stemmers (see the fieldType definition in the schema.xml) A string would give you exact values with no analysis at all. If you want to lowercase (for case insensitive matches) start off with a text field and configure it with keyword analyzer followed by lowercase filter). If it can have multiple words, an analyzer that had a whitespace analyzer followed by a lowercase filter would fit the bill. -Yonik smime.p7s Description: S/MIME cryptographic signature
Re: How do I best store and retrieve ISO country codes?
On 8/24/07, Simon Peter Nicholls <[EMAIL PROTECTED]> wrote: > I've just noticed that for ISO 2 character country codes such as "BE" > and "IT", my queries are not working as expected. > > The field is being stored as country_t, dynamically from acts_as_solr > v0.9, as follows (from schema.xml): > > > > The thing that sprang to my mind was that BE and IT are also valid > words, and perhaps Solr is doing something I'm not expecting > (ignoring them, which would make sense mid-text). With this in mind, > perhaps an _s type of field is needed, since it is indeed a single > important string rather than text composed of many strings. Right, type "text" by default in solr has stopword removal and stemmers (see the fieldType definition in the schema.xml) A string would give you exact values with no analysis at all. If you want to lowercase (for case insensitive matches) start off with a text field and configure it with keyword analyzer followed by lowercase filter). If it can have multiple words, an analyzer that had a whitespace analyzer followed by a lowercase filter would fit the bill. -Yonik
How do I best store and retrieve ISO country codes?
Hi all, I've just noticed that for ISO 2 character country codes such as "BE" and "IT", my queries are not working as expected. The field is being stored as country_t, dynamically from acts_as_solr v0.9, as follows (from schema.xml): The thing that sprang to my mind was that BE and IT are also valid words, and perhaps Solr is doing something I'm not expecting (ignoring them, which would make sense mid-text). With this in mind, perhaps an _s type of field is needed, since it is indeed a single important string rather than text composed of many strings. Am I on the right track here? Can anyone give me some quick advice about how best to store and query enumeration values and ISO codes in Solr? I hope to try these string field changes when I get back to my dev environment, but that will be next week, and it's preying on my mind. Any help would be gratefully received! Thanks, Si smime.p7s Description: S/MIME cryptographic signature
Effects of changing schema?
Hi All. I'm unclear on whether changing the schema.xml file automatically causes a reindex or not. If I'm adding a field to the schema (and removing some unused ones), does solr do the reindex? Or, do I have to kick it off myself. Ideally, we'd like to avoid a reindex... Thanks! DW
solr.py facet.field question
Hi, how can i specify more than one facet.field in the query method from solr.py? These trys doesn´t work: res = c.query(q="Klaus", facet="true", facet_limit="-1", facet_field=['Creator','system']) res = c.query(q="Klaus", facet="true", facet_limit="-1", facet_field='Creator', facet_field='system') thanks in advance Christian
Re: How to extract constrained fields from query
On Thu, 2007-08-23 at 10:44 -0700, Chris Hostetter wrote: > : Probably I'm also interested in PrefixQueries, as they also provide a > : Term, e.g. parsing "ipod AND brand:apple" gives a PrefixQuery for > : "brand:apple". > > uh? ... it shoudn't, not unless we're talking about some other > customization you've already made. My fault, this is returned for s.th. like "brand:appl*" - but perhaps I would also like to facet on such fields then... > > > : I want to do s.th. like "dynamic faceting" - so that the solr client > : does not have to request facets via facet.field, but that I can decide > : in my CustomRequestHandler which facets are returned. But I want to > : return only facets for fields that are not already constained, e.g. > : when the query contains s.th. like "brand:apple" I don't want to return > : a facet for the field "brand". > > Hmmm, i see ... well the easiest way to go is not to worry about it when > parsing the query, when you go to compute facets for all hte fields you > tink might be useful, you'll see that only one value for "brand" matches, > and you can just skip it. I would think that this is not the best option in terms of performance. > > that doesn't really work well for range queries -- but you can't exactly > use the same logic for picking what your facet contraints will be on a > field that makes sense to do a rnage query on anyway, so it's tricky > either way. > > the custom QueryParser is still probably your best bet... > > : Ok, so I would override getFieldQuery, getPrefixQuery, getRangeQuery and > : getWildcardQuery(?) and record the field names? And I would use this > : QueryParser for both parsing of the query (q) and the filter queries > : (fq)? > > yep. Alright, then I'll choose this door. > > (Also Note there is also an extractTerms method on Query that can help in > some cases, but the impl for ConstantScoreQuery (which is used when the > SolrQueryParser sees a range query or a prefix query) doesn't really work > at the moment.) Yep, I already had tried this, but it always failed with an UnsupportedOperationException... Thanx a lot, cheers, Martin > > -Hoss > -- Martin Grotzke http://www.javakaffee.de/blog/ signature.asc Description: This is a digitally signed message part