date:20070824

Re: Embedded about 50% faster for indexing

2007-08-24 Thread Mike Klaas

On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote:

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of Yonik Seeley
Sent: Friday, August 24, 2007 2:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Embedded about 50% faster for indexing

One thing I'd like to avoid is everyone trying to embed just
for performance gains. If there is really that much
difference, then we need a better way for people to get that
without resorting to Java code.

-Yonik

Theoretically and practically, embedded solution will be faster than
going through http/xml.

This is only true if the http interface adds significant overhead to  
the cost of indexing a document, and I don't see why this should be  
so, as indexing is relatively heavyweight.  setting up the connection  
could be expensive, but this can be greatly mitigated by sending more  
than one doc per http request, using persistent connections, and  
threading.

-Mike

RE: Embedded about 50% faster for indexing

2007-08-24 Thread Sundling, Paul

The embedded approach is at http://wiki.apache.org/solr/EmbeddedSolr

For my testing I have a tunable setting for records to submit and did 10
per batch.  Both approaches committed after every 1000 records, also
tunable.  

A custom Lucene implementation I helped implement was even faster than
embedded, using a ramdrive as a double buffer.  However that did require
a much larger memory footprint.

The embedded class have little to no documentation and almost look like
stub implementations, but they work well.

While this project will succeed in a large part to how easy it is to
integrate with non Java clients, I would actually like to see this
project more java friendly, like a reference indexing implementation.
There are a lot of tools that could be more widely useful like
SimplePostTool.  

With a few API changes it could be used for the demo as well as a useful
library.  Instead I extended and then had to abandon that and resort to
cut and paste reuse in the end.  The functionality was 95% there, but
just needed API tweaks to make it usable.  It also seems unusual
exposing fields directly instead of using accessors in the Java code.
Accessors can be give a lot of flexibility that field access doesn't
have.

It would also be nice to able to get java objects back besides XML and
JSON, like an Embedded equivalent for search.  That way you could
integrate more easily with Spring MVC, etc.  There may also be some
performance gains there.  

Paul Sundling

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Friday, August 24, 2007 2:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Embedded about 50% faster for indexing

On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
> Created two indexer implementations to test HTTP Post versus Embedded 
> and the performance was 54.6% faster on embedded.
>
> Thought others might find that interesting that are using Java.

Paul, were the documents posted one-per-message, or did you try multiple
(like 50 to 100) per message?  If one per message, the best way to
increase performance is to have multiple threads adding docs.

I'd be curious to know how a single CSV file would clock in at as
well...

One thing I'd like to avoid is everyone trying to embed just for
performance gains. If there is really that much difference, then we need
a better way for people to get that without resorting to Java code.

-Yonik

RE: Embedded about 50% faster for indexing

2007-08-24 Thread Wu, Daniel



> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf 
> Of Yonik Seeley
> Sent: Friday, August 24, 2007 2:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Embedded about 50% faster for indexing
> 
> One thing I'd like to avoid is everyone trying to embed just 
> for performance gains. If there is really that much 
> difference, then we need a better way for people to get that 
> without resorting to Java code.
> 
> -Yonik
> 

Theoretically and practically, embedded solution will be faster than
going through http/xml.  I would like to see solr has some sort of
document source adapter architecture which will iterate through all the
documents available in the document source.  This way, if the documents
come from database for example, it can be simply a sql query in the solr
configuration file.

Daniel

Re: Embedded about 50% faster for indexing

2007-08-24 Thread Yonik Seeley

On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
> Created two indexer implementations to test HTTP Post versus Embedded
> and the performance was 54.6% faster on embedded.
>
> Thought others might find that interesting that are using Java.

Paul, were the documents posted one-per-message, or did you try
multiple (like 50 to 100) per message?  If one per message, the best
way to increase performance is to have multiple threads adding docs.

I'd be curious to know how a single CSV file would clock in at as well...

One thing I'd like to avoid is everyone trying to embed just for
performance gains.
If there is really that much difference, then we need a better way for
people to get that without resorting to Java code.

-Yonik

Re: Embedded about 50% faster for indexing

2007-08-24 Thread Yu-Hui Jin

Sorry i'm new to the topic, can you point me to the Embedded approach?


thanks,

-Hui

On 8/24/07, Sundling, Paul <[EMAIL PROTECTED]> wrote:
>
> Created two indexer implementations to test HTTP Post versus Embedded
> and the performance was 54.6% faster on embedded.
>
> Thought others might find that interesting that are using Java.
>
> Paul Sundling
>



-- 
Regards,

-Hui

RE: clear index

2007-08-24 Thread Sundling, Paul

If that happens, then using that specific query should be added to the
FAQ for how to clear an index. 

Paul Sundling

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 21, 2007 6:41 PM
To: solr-user@lucene.apache.org
Subject: RE: clear index

: I'm just seeing if there's an easy/performant way of doing it with
Solr.
: For a solution with raw Lucene, creating a new index with the same
: directory cleared out an old index (even on Windows with it's file
: locking) quickly.

there has been talk of optimizing delete by query in the case of *:* to
just reopen the index with the create flag set to true ... there just
hasn't been a patch yet.

-Hoss

Embedded about 50% faster for indexing

2007-08-24 Thread Sundling, Paul

Created two indexer implementations to test HTTP Post versus Embedded
and the performance was 54.6% faster on embedded.  
 
Thought others might find that interesting that are using Java.
 
Paul Sundling

Re: Index HotSwap

2007-08-24 Thread Jérôme Etévé

On 8/21/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> :  I'm wondering what's the best way to completely change a big index
> : without loosing any requests.
>
> use the snapinstaller script -- or adopt the same atomic copying approach
> it uses.

I'm having a look :)

> :   - Between the two mv's, the directory dir does not exists, which can
> : cause some solr failure.
>
> this shouldn't cause any failure unless you tell Solr to try and reload
> turing the move (ie: you send it a commit) ... either way an atomic copy
> in place of a mv should work much better.

Why, does the reloading of the searcher triggers a re loading of the
files from disk ?
Thx !

>
> -Hoss
>
>


-- 
Jerome Eteve.
[EMAIL PROTECTED]
http://jerome.eteve.free.fr/

Re: Effects of changing schema?

2007-08-24 Thread Yonik Seeley

In 8/24/07, David Whalen <[EMAIL PROTECTED]> wrote:
> I'm unclear on whether changing the schema.xml file
> automatically causes a reindex or not.  If I'm adding
> a field to the schema (and removing some unused ones),
> does solr do the reindex?  Or, do I have to kick it
> off myself.

No... changing the schema does nothing to the index, it only affects
how new documents are indexed and how the index is searched.

If you make a backward compatible change (like adding a new field, or
adding some more query-side synonyms) then you don't have to reindex.
When in doubt, it's best to reindex.

-Yonik

Re: How do I best store and retrieve ISO country codes?

2007-08-24 Thread Simon Peter Nicholls


Thanks Yonik! Cheers for the extra advice too.

On 24 Aug 2007, at 17:14, Yonik Seeley wrote:


On 8/24/07, Simon Peter Nicholls <[EMAIL PROTECTED]> wrote:

I've just noticed that for ISO 2 character country codes such as "BE"
and "IT", my queries are not working as expected.

The field is being stored as country_t, dynamically from acts_as_solr
v0.9, as follows (from schema.xml):



The thing that sprang to my mind was that BE and IT are also valid
words, and perhaps Solr is doing something I'm not expecting
(ignoring them, which would make sense mid-text). With this in mind,
perhaps an _s type of field is needed, since it is indeed a single
important string rather than text composed of many strings.


Right, type "text" by default in solr has stopword removal and
stemmers (see the fieldType definition in the schema.xml)

A string would give you exact values with no analysis at all.  If you
want to lowercase (for case insensitive matches) start off with a text
field and configure it with keyword analyzer followed by lowercase
filter).  If it can have multiple words, an analyzer that had a
whitespace analyzer followed by a lowercase filter would fit the bill.

-Yonik




smime.p7s
Description: S/MIME cryptographic signature

Re: How do I best store and retrieve ISO country codes?

2007-08-24 Thread Yonik Seeley

On 8/24/07, Simon Peter Nicholls <[EMAIL PROTECTED]> wrote:
> I've just noticed that for ISO 2 character country codes such as "BE"
> and "IT", my queries are not working as expected.
>
> The field is being stored as country_t, dynamically from acts_as_solr
> v0.9, as follows (from schema.xml):
>
> 
>
> The thing that sprang to my mind was that BE and IT are also valid
> words, and perhaps Solr is doing something I'm not expecting
> (ignoring them, which would make sense mid-text). With this in mind,
> perhaps an _s type of field is needed, since it is indeed a single
> important string rather than text composed of many strings.

Right, type "text" by default in solr has stopword removal and
stemmers (see the fieldType definition in the schema.xml)

A string would give you exact values with no analysis at all.  If you
want to lowercase (for case insensitive matches) start off with a text
field and configure it with keyword analyzer followed by lowercase
filter).  If it can have multiple words, an analyzer that had a
whitespace analyzer followed by a lowercase filter would fit the bill.

-Yonik

How do I best store and retrieve ISO country codes?

2007-08-24 Thread Simon Peter Nicholls


Hi all,

I've just noticed that for ISO 2 character country codes such as "BE"  
and "IT", my queries are not working as expected.


The field is being stored as country_t, dynamically from acts_as_solr  
v0.9, as follows (from schema.xml):




The thing that sprang to my mind was that BE and IT are also valid  
words, and perhaps Solr is doing something I'm not expecting  
(ignoring them, which would make sense mid-text). With this in mind,  
perhaps an _s type of field is needed, since it is indeed a single  
important string rather than text composed of many strings.


Am I on the right track here? Can anyone give me some quick advice  
about how best to store and query enumeration values and ISO codes in  
Solr? I hope to try these string field changes when I get back to my  
dev environment, but that will be next week, and it's preying on my  
mind. Any help would be gratefully received!


Thanks,
Si

smime.p7s
Description: S/MIME cryptographic signature

Effects of changing schema?

2007-08-24 Thread David Whalen

Hi All.

I'm unclear on whether changing the schema.xml file
automatically causes a reindex or not.  If I'm adding
a field to the schema (and removing some unused ones),
does solr do the reindex?  Or, do I have to kick it
off myself.

Ideally, we'd like to avoid a reindex...

Thanks!

DW

solr.py facet.field question

2007-08-24 Thread Christian Klinger


Hi,

how can i specify more than one facet.field in
the query method from solr.py?

These trys doesn´t work:

res = c.query(q="Klaus", facet="true", facet_limit="-1", 
facet_field=['Creator','system'])


res = c.query(q="Klaus", facet="true", facet_limit="-1", 
facet_field='Creator', facet_field='system')


thanks in advance

Christian

Re: How to extract constrained fields from query

2007-08-24 Thread Martin Grotzke

On Thu, 2007-08-23 at 10:44 -0700, Chris Hostetter wrote:
> : Probably I'm also interested in PrefixQueries, as they also provide a
> : Term, e.g. parsing "ipod AND brand:apple" gives a PrefixQuery for
> : "brand:apple".
> 
> uh? ... it shoudn't, not unless we're talking about some other
> customization you've already made.
My fault, this is returned for s.th. like "brand:appl*" - but perhaps
I would also like to facet on such fields then...

> 
> 
> : I want to do s.th. like "dynamic faceting" - so that the solr client
> : does not have to request facets via facet.field, but that I can decide
> : in my CustomRequestHandler which facets are returned. But I want to
> : return only facets for fields that are not already constained, e.g.
> : when the query contains s.th. like "brand:apple" I don't want to return
> : a facet for the field "brand".
> 
> Hmmm, i see ... well the easiest way to go is not to worry about it when
> parsing the query, when you go to compute facets for all hte fields you
> tink might be useful, you'll see that only one value for "brand" matches,
> and you can just skip it.
I would think that this is not the best option in terms of performance.

> 
> that doesn't really work well for range queries -- but you can't exactly
> use the same logic for picking what your facet contraints will be on a
> field that makes sense to do a rnage query on anyway, so it's tricky
> either way.
> 
> the custom QueryParser is still probably your best bet...
> 
> : Ok, so I would override getFieldQuery, getPrefixQuery, getRangeQuery and
> : getWildcardQuery(?) and record the field names? And I would use this
> : QueryParser for both parsing of the query (q) and the filter queries
> : (fq)?
> 
> yep.
Alright, then I'll choose this door.

> 
> (Also Note there is also an extractTerms method on Query that can help in
> some cases, but the impl for ConstantScoreQuery (which is used when the
> SolrQueryParser sees a range query or a prefix query) doesn't really work
> at the moment.)
Yep, I already had tried this, but it always failed with an
UnsupportedOperationException...

Thanx a lot,
cheers,
Martin


> 
> -Hoss
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part

Re: Embedded about 50% faster for indexing

RE: Embedded about 50% faster for indexing

RE: Embedded about 50% faster for indexing

Re: Embedded about 50% faster for indexing

Re: Embedded about 50% faster for indexing

RE: clear index

Embedded about 50% faster for indexing

Re: Index HotSwap

Re: Effects of changing schema?

Re: How do I best store and retrieve ISO country codes?

Re: How do I best store and retrieve ISO country codes?

How do I best store and retrieve ISO country codes?

Effects of changing schema?

solr.py facet.field question

Re: How to extract constrained fields from query

15 matches

Site Navigation

Mail list logo

Footer information