Re: Embedded about 50% faster for indexing

2007-08-24 Thread Mike Klaas

On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote:


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of Yonik Seeley
Sent: Friday, August 24, 2007 2:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Embedded about 50% faster for indexing

One thing I'd like to avoid is everyone trying to embed just
for performance gains. If there is really that much
difference, then we need a better way for people to get that
without resorting to Java code.

-Yonik



Theoretically and practically, embedded solution will be faster than
going through http/xml.


This is only true if the http interface adds significant overhead to  
the cost of indexing a document, and I don't see why this should be  
so, as indexing is relatively heavyweight.  setting up the connection  
could be expensive, but this can be greatly mitigated by sending more  
than one doc per http request, using persistent connections, and  
threading.


-Mike


RE: Embedded about 50% faster for indexing

2007-08-24 Thread Sundling, Paul
The embedded approach is at http://wiki.apache.org/solr/EmbeddedSolr

For my testing I have a tunable setting for records to submit and did 10
per batch.  Both approaches committed after every 1000 records, also
tunable.  

A custom Lucene implementation I helped implement was even faster than
embedded, using a ramdrive as a double buffer.  However that did require
a much larger memory footprint.

The embedded class have little to no documentation and almost look like
stub implementations, but they work well.

While this project will succeed in a large part to how easy it is to
integrate with non Java clients, I would actually like to see this
project more java friendly, like a reference indexing implementation.
There are a lot of tools that could be more widely useful like
SimplePostTool.  

With a few API changes it could be used for the demo as well as a useful
library.  Instead I extended and then had to abandon that and resort to
cut and paste reuse in the end.  The functionality was 95% there, but
just needed API tweaks to make it usable.  It also seems unusual
exposing fields directly instead of using accessors in the Java code.
Accessors can be give a lot of flexibility that field access doesn't
have.

It would also be nice to able to get java objects back besides XML and
JSON, like an Embedded equivalent for search.  That way you could
integrate more easily with Spring MVC, etc.  There may also be some
performance gains there.  

Paul Sundling

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Friday, August 24, 2007 2:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Embedded about 50% faster for indexing


On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
> Created two indexer implementations to test HTTP Post versus Embedded 
> and the performance was 54.6% faster on embedded.
>
> Thought others might find that interesting that are using Java.

Paul, were the documents posted one-per-message, or did you try multiple
(like 50 to 100) per message?  If one per message, the best way to
increase performance is to have multiple threads adding docs.

I'd be curious to know how a single CSV file would clock in at as
well...

One thing I'd like to avoid is everyone trying to embed just for
performance gains. If there is really that much difference, then we need
a better way for people to get that without resorting to Java code.

-Yonik



RE: Embedded about 50% faster for indexing

2007-08-24 Thread Wu, Daniel


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf 
> Of Yonik Seeley
> Sent: Friday, August 24, 2007 2:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Embedded about 50% faster for indexing
> 
> One thing I'd like to avoid is everyone trying to embed just 
> for performance gains. If there is really that much 
> difference, then we need a better way for people to get that 
> without resorting to Java code.
> 
> -Yonik
> 

Theoretically and practically, embedded solution will be faster than
going through http/xml.  I would like to see solr has some sort of
document source adapter architecture which will iterate through all the
documents available in the document source.  This way, if the documents
come from database for example, it can be simply a sql query in the solr
configuration file.

Daniel


Re: Embedded about 50% faster for indexing

2007-08-24 Thread Yonik Seeley
On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
> Created two indexer implementations to test HTTP Post versus Embedded
> and the performance was 54.6% faster on embedded.
>
> Thought others might find that interesting that are using Java.

Paul, were the documents posted one-per-message, or did you try
multiple (like 50 to 100) per message?  If one per message, the best
way to increase performance is to have multiple threads adding docs.

I'd be curious to know how a single CSV file would clock in at as well...

One thing I'd like to avoid is everyone trying to embed just for
performance gains.
If there is really that much difference, then we need a better way for
people to get that without resorting to Java code.

-Yonik


Re: Embedded about 50% faster for indexing

2007-08-24 Thread Yu-Hui Jin
Sorry i'm new to the topic, can you point me to the Embedded approach?


thanks,

-Hui

On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
>
> Created two indexer implementations to test HTTP Post versus Embedded
> and the performance was 54.6% faster on embedded.
>
> Thought others might find that interesting that are using Java.
>
> Paul Sundling
>



-- 
Regards,

-Hui


RE: clear index

2007-08-24 Thread Sundling, Paul
If that happens, then using that specific query should be added to the
FAQ for how to clear an index. 

Paul Sundling

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 21, 2007 6:41 PM
To: solr-user@lucene.apache.org
Subject: RE: clear index



: I'm just seeing if there's an easy/performant way of doing it with
Solr.
: For a solution with raw Lucene, creating a new index with the same
: directory cleared out an old index (even on Windows with it's file
: locking) quickly.

there has been talk of optimizing delete by query in the case of *:* to
just reopen the index with the create flag set to true ... there just
hasn't been a patch yet.



-Hoss




Embedded about 50% faster for indexing

2007-08-24 Thread Sundling, Paul
Created two indexer implementations to test HTTP Post versus Embedded
and the performance was 54.6% faster on embedded.  
 
Thought others might find that interesting that are using Java.
 
Paul Sundling


Re: Index HotSwap

2007-08-24 Thread Jérôme Etévé
On 8/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> :  I'm wondering what's the best way to completely change a big index
> : without loosing any requests.
>
> use the snapinstaller script -- or adopt the same atomic copying approach
> it uses.

I'm having a look :)

> :   - Between the two mv's, the directory dir does not exists, which can
> : cause some solr failure.
>
> this shouldn't cause any failure unless you tell Solr to try and reload
> turing the move (ie: you send it a commit) ... either way an atomic copy
> in place of a mv should work much better.

Why, does the reloading of the searcher triggers a re loading of the
files from disk ?
Thx !

>
> -Hoss
>
>


-- 
Jerome Eteve.
[EMAIL PROTECTED]
http://jerome.eteve.free.fr/


Re: Effects of changing schema?

2007-08-24 Thread Yonik Seeley
In 8/24/07, David Whalen <[EMAIL PROTECTED]> wrote:
> I'm unclear on whether changing the schema.xml file
> automatically causes a reindex or not.  If I'm adding
> a field to the schema (and removing some unused ones),
> does solr do the reindex?  Or, do I have to kick it
> off myself.

No... changing the schema does nothing to the index, it only affects
how new documents are indexed and how the index is searched.

If you make a backward compatible change (like adding a new field, or
adding some more query-side synonyms) then you don't have to reindex.
When in doubt, it's best to reindex.

-Yonik


Re: How do I best store and retrieve ISO country codes?

2007-08-24 Thread Simon Peter Nicholls

Thanks Yonik! Cheers for the extra advice too.

On 24 Aug 2007, at 17:14, Yonik Seeley wrote:


On 8/24/07, Simon Peter Nicholls <[EMAIL PROTECTED]> wrote:

I've just noticed that for ISO 2 character country codes such as "BE"
and "IT", my queries are not working as expected.

The field is being stored as country_t, dynamically from acts_as_solr
v0.9, as follows (from schema.xml):



The thing that sprang to my mind was that BE and IT are also valid
words, and perhaps Solr is doing something I'm not expecting
(ignoring them, which would make sense mid-text). With this in mind,
perhaps an _s type of field is needed, since it is indeed a single
important string rather than text composed of many strings.


Right, type "text" by default in solr has stopword removal and
stemmers (see the fieldType definition in the schema.xml)

A string would give you exact values with no analysis at all.  If you
want to lowercase (for case insensitive matches) start off with a text
field and configure it with keyword analyzer followed by lowercase
filter).  If it can have multiple words, an analyzer that had a
whitespace analyzer followed by a lowercase filter would fit the bill.

-Yonik




smime.p7s
Description: S/MIME cryptographic signature


Re: How do I best store and retrieve ISO country codes?

2007-08-24 Thread Yonik Seeley
On 8/24/07, Simon Peter Nicholls <[EMAIL PROTECTED]> wrote:
> I've just noticed that for ISO 2 character country codes such as "BE"
> and "IT", my queries are not working as expected.
>
> The field is being stored as country_t, dynamically from acts_as_solr
> v0.9, as follows (from schema.xml):
>
> 
>
> The thing that sprang to my mind was that BE and IT are also valid
> words, and perhaps Solr is doing something I'm not expecting
> (ignoring them, which would make sense mid-text). With this in mind,
> perhaps an _s type of field is needed, since it is indeed a single
> important string rather than text composed of many strings.

Right, type "text" by default in solr has stopword removal and
stemmers (see the fieldType definition in the schema.xml)

A string would give you exact values with no analysis at all.  If you
want to lowercase (for case insensitive matches) start off with a text
field and configure it with keyword analyzer followed by lowercase
filter).  If it can have multiple words, an analyzer that had a
whitespace analyzer followed by a lowercase filter would fit the bill.

-Yonik


How do I best store and retrieve ISO country codes?

2007-08-24 Thread Simon Peter Nicholls

Hi all,

I've just noticed that for ISO 2 character country codes such as "BE"  
and "IT", my queries are not working as expected.


The field is being stored as country_t, dynamically from acts_as_solr  
v0.9, as follows (from schema.xml):




The thing that sprang to my mind was that BE and IT are also valid  
words, and perhaps Solr is doing something I'm not expecting  
(ignoring them, which would make sense mid-text). With this in mind,  
perhaps an _s type of field is needed, since it is indeed a single  
important string rather than text composed of many strings.


Am I on the right track here? Can anyone give me some quick advice  
about how best to store and query enumeration values and ISO codes in  
Solr? I hope to try these string field changes when I get back to my  
dev environment, but that will be next week, and it's preying on my  
mind. Any help would be gratefully received!


Thanks,
Si

smime.p7s
Description: S/MIME cryptographic signature


Effects of changing schema?

2007-08-24 Thread David Whalen
Hi All.

I'm unclear on whether changing the schema.xml file
automatically causes a reindex or not.  If I'm adding
a field to the schema (and removing some unused ones),
does solr do the reindex?  Or, do I have to kick it
off myself.

Ideally, we'd like to avoid a reindex...

Thanks!

DW


solr.py facet.field question

2007-08-24 Thread Christian Klinger

Hi,

how can i specify more than one facet.field in
the query method from solr.py?

These trys doesn´t work:

res = c.query(q="Klaus", facet="true", facet_limit="-1", 
facet_field=['Creator','system'])


res = c.query(q="Klaus", facet="true", facet_limit="-1", 
facet_field='Creator', facet_field='system')


thanks in advance

Christian



Re: How to extract constrained fields from query

2007-08-24 Thread Martin Grotzke
On Thu, 2007-08-23 at 10:44 -0700, Chris Hostetter wrote:
> : Probably I'm also interested in PrefixQueries, as they also provide a
> : Term, e.g. parsing "ipod AND brand:apple" gives a PrefixQuery for
> : "brand:apple".
> 
> uh? ... it shoudn't, not unless we're talking about some other
> customization you've already made.
My fault, this is returned for s.th. like "brand:appl*" - but perhaps
I would also like to facet on such fields then...

> 
> 
> : I want to do s.th. like "dynamic faceting" - so that the solr client
> : does not have to request facets via facet.field, but that I can decide
> : in my CustomRequestHandler which facets are returned. But I want to
> : return only facets for fields that are not already constained, e.g.
> : when the query contains s.th. like "brand:apple" I don't want to return
> : a facet for the field "brand".
> 
> Hmmm, i see ... well the easiest way to go is not to worry about it when
> parsing the query, when you go to compute facets for all hte fields you
> tink might be useful, you'll see that only one value for "brand" matches,
> and you can just skip it.
I would think that this is not the best option in terms of performance.

> 
> that doesn't really work well for range queries -- but you can't exactly
> use the same logic for picking what your facet contraints will be on a
> field that makes sense to do a rnage query on anyway, so it's tricky
> either way.
> 
> the custom QueryParser is still probably your best bet...
> 
> : Ok, so I would override getFieldQuery, getPrefixQuery, getRangeQuery and
> : getWildcardQuery(?) and record the field names? And I would use this
> : QueryParser for both parsing of the query (q) and the filter queries
> : (fq)?
> 
> yep.
Alright, then I'll choose this door.

> 
> (Also Note there is also an extractTerms method on Query that can help in
> some cases, but the impl for ConstantScoreQuery (which is used when the
> SolrQueryParser sees a range query or a prefix query) doesn't really work
> at the moment.)
Yep, I already had tried this, but it always failed with an
UnsupportedOperationException...

Thanx a lot,
cheers,
Martin


> 
> -Hoss
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part