Re: How do I best store and retrieve ISO country codes?

2007-08-24 Thread Yonik Seeley
On 8/24/07, Simon Peter Nicholls [EMAIL PROTECTED] wrote:
 I've just noticed that for ISO 2 character country codes such as BE
 and IT, my queries are not working as expected.

 The field is being stored as country_t, dynamically from acts_as_solr
 v0.9, as follows (from schema.xml):

 dynamicField name=*_t type=text indexed=true stored=false/

 The thing that sprang to my mind was that BE and IT are also valid
 words, and perhaps Solr is doing something I'm not expecting
 (ignoring them, which would make sense mid-text). With this in mind,
 perhaps an _s type of field is needed, since it is indeed a single
 important string rather than text composed of many strings.

Right, type text by default in solr has stopword removal and
stemmers (see the fieldType definition in the schema.xml)

A string would give you exact values with no analysis at all.  If you
want to lowercase (for case insensitive matches) start off with a text
field and configure it with keyword analyzer followed by lowercase
filter).  If it can have multiple words, an analyzer that had a
whitespace analyzer followed by a lowercase filter would fit the bill.

-Yonik


Re: sort problem

2007-09-02 Thread Yonik Seeley
On 9/2/07, michael ravits [EMAIL PROTECTED] wrote:
 this is the field definition:
field name=msgid type=slong indexed=true stored=true 
 required=true /

 holds message id's, values range from 0 to 127132531
 can I disable this cache?

No, sorting wouldn't work without it.

The cache structure certainly isn't optimal for this (every doc
probably has a different value).
If you could live with a cap of 2B on message id, switching to type
int would decrease the memory usage to 4 bytes per doc (presumably
you don't need range queries?)

-Yonik


Re: sort problem

2007-09-03 Thread Yonik Seeley
On 9/3/07, Marcus Stratmann [EMAIL PROTECTED] wrote:
  If you could live with a cap of 2B on message id, switching to type
  int would decrease the memory usage to 4 bytes per doc (presumably
  you don't need range queries?)

 I haven't found exact definitions of the fieldTypes anywhere. Does
 integer span the common range from -2^31 to 2^31-1?
 And there seems to be no unsigned int, am i right?

Right, these map to Java native types, so it's signed.

-Yonik


Re: -field:[* TO *] doesn't seem to work

2007-09-03 Thread Yonik Seeley
Can you provide the full query response (with debugging output)?

-Yonik

On 9/3/07, Jérôme Etévé [EMAIL PROTECTED] wrote:
 Hi all
  I've got a problem here with the '-field:[* TO *]' syntax. It doesn't
 seem to work as expected


Re: Multiple Values -Structured?

2007-09-04 Thread Yonik Seeley
You could index both a compound field and the components separately.
This could be simplified by sending the value in once as the compound format:
  review,1 Jan 2007
  revision, 2 Jan 200
And then use a copyField with a regex tokenizer to extract and index
the date into a separate field.  You could index the type separately
via the same mechanism.

-Yonik

On 9/3/07, Bharani [EMAIL PROTECTED] wrote:

 Hi,

 I have got two sets of document

 1) Primary Document
 2) Occurrences of primary document

 Since there is no such thing as join i can either

 a) Post the primary document with occurrences as multi valued field
  or
 b) Post the primary document for every occurrences i.e. classic
 de-normalized route

 My problem with

 Option a) This works great as long as the occurrence is a single field but
 if i had a group of fields that describes the occurrence then the search
 returns wrong results becuase of the nature of text search

 i.e date1 Jan 2007/date
 type review/type

 date 2 Jan 2007 /date
 type revision/type

 if i search for 2 Jan 2007 and date 1 Jan 2007 /date i will get a hit
 (which is wrong)  becuase there is no grouping of fields to associate date
 and type as one unit. If i merge them as one entity then i cant use the
 range quieries for date

 Option B) This would result in large number of documents and even if i try
 with index only and not store i am still have to deal with duplicate hit -
 becuase all i want is the primary document


 Is there a better approach to the problem?

 Thanks
 Bharani


 --
 View this message in context: 
 http://www.nabble.com/Multiple-Values--Structured--tf4370282.html#a12456399
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: solr.py problems with german Umlaute

2007-09-06 Thread Yonik Seeley
On 9/6/07, Brian Carmalt [EMAIL PROTECTED] wrote:
 Try it with title.encode('utf-8').
 As in: kw =
 {'id':'12','title':title.encode('utf-8'),'system':'plone','url':'http://www.google.de'}

It seems like the client library should be responsible for encoding,
not the user.
So try changing
title=Übersicht
  into a unicode string via
title=uÜbersicht

And that should hopefully get your test program working.
If it doesn't it's probably a solr.py bug and should be fixed there.

-Yonik


Re: Replication broken.. no helpful errors?

2007-09-06 Thread Yonik Seeley
On 9/6/07, Matthew Runo [EMAIL PROTECTED] wrote:
 The thing is that a new searcher is not opened if I look in the
 stats.jsp page. The index version never changes.

The index version is read from the index... hence if the lucene index
doesn't change (even if a ew snapshot was taken), the version won't
change even if a new searcher was opened.

Is the problem on the master side now since it looks like the slave is
pulling a temp-snapshot?

-Yonik


Re: searching where a value is not null?

2007-09-06 Thread Yonik Seeley
On 9/6/07, David Whalen [EMAIL PROTECTED] wrote:
 Hi all.

 I'm trying to construct a query that in pseudo-code would read
 like this:

 field != ''

 I'm finding it difficult to write this as a solr query, though.
 Stuff like:

 NOT field:()

 doesn't seem to do the trick.

 any ideas?

perhaps field:[* TO *]

-Yonik


Re: caching query result

2007-09-06 Thread Yonik Seeley
On 9/6/07, Jae Joo [EMAIL PROTECTED] wrote:
 I have 13 millions and have facets by states (50). If there is a mechasim to
 chche, I may get faster result back.

How fast are you getting results back with standard field faceting
(facet.field=state)?


Re: FW: Minor mistake on the Wiki

2007-09-07 Thread Yonik Seeley
On 9/7/07, Lance Norskog [EMAIL PROTECTED] wrote:
 In the page http://wiki.apache.org/solr/UpdateXmlMessages

 We find:

 Optional attributes on doc

 *   boost = float - default is 1.0 (See Lucene docs for
 definition of boost.)
 *   NOTE: make sure norms are enabled (omitNorms=false
 in the schema.xml) for any fields where the index-time boost should be
 stored.

 This NOTE appears to be block-copied from the following entry about
 field-level boosts, and makes no sense here.

Perhaps it could be worded better, but there is some sense behind it.
There is no document boost in a lucene lindex... a doc boost is simply
multipled into the boost for each field as the document is indexed.

-Yonik


Re: adding without overriding dups - DirectUpdateHandler2.java does not implement?

2007-09-07 Thread Yonik Seeley
On 9/7/07, Lance Norskog [EMAIL PROTECTED] wrote:
 It appears that DirectUpdateHandler2.java does not actually implement the
 parameters that control whether to override existing documents.

It's been proposed that most of these be deprecated anyway and
replaced with a simple overwrite=true/false.  Are you trying to do
something different than standard overwriting?

-Yonik


Re: adding without overriding dups - DirectUpdateHandler2.java does not implement?

2007-09-07 Thread Yonik Seeley
On 9/7/07, Lance Norskog [EMAIL PROTECTED] wrote:
 No, I'm just doing standard overwriting. It just took a little digging to be
 able to do it :)

Overwriting is the default... you shouldn't have to do specify
anything extra when indexing the document.

-Yonik


Re: quirks with sorting

2007-09-10 Thread Yonik Seeley
On 9/10/07, David Whalen [EMAIL PROTECTED] wrote:
 I'm seeing a weird problem with sorting that I can't figure out.

 I have a query that uses two fields -- a source column and a
 date column.  I search on the source and I sort by the date
 descending.

 What I'm seeing is that depending on the value in the source,
 the date sort works in reverse.

 For example, the query:

 content_source:(mv); content_date desc

 returns 2007-09-10T09:25:00.000Z in its first row, which is what
 I expect.

 BUT, the query:

 content_source:(thomson); content_date desc

 returns 2008-08-17T00:00:00.000Z, which is the first date we
 put into SOLR.

It is it the last (highest date) since it's 2008?

-Yonik


Re: My Solr index keeps growing

2007-09-10 Thread Yonik Seeley
On 9/10/07, Robin Bonin [EMAIL PROTECTED] wrote:
 I had created a new index over the weekend, and the final size was a
 few hundred megs.
 I just checked and now the index folder is up to 1.7 Gig. Is this due
 to results being cached? can I set a limit to how large the index will
 grow? is there anything else that could be effecting this file size?

index normally refers to the index files on the disk... is this what you mean?
If so, it shouldn't grow unless new documents are added.

-Yonik


Re: Solr and KStem

2007-09-10 Thread Yonik Seeley
Some other notes:
I just read the license... it's nice and short, and appears to be ASL
compatible to me.
We could either include the source in Solr and build it, or add it as
a pre-compiled jar into lib.
The FilterFactory should probably have it's package changed to
org.apache.solr.analysis (definitely if it will be included in source
form in our repository).


-Yonik

On 9/10/07, Mike Klaas [EMAIL PROTECTED] wrote:
 Hi Harry,

 Thanks for your contribution!  Unfortunately, we can't include it in
 Solr unless the necessary legal hurdles are cleared.

 An issue needs to be opened on http://issues.apache.org/jira/browse/
 SOLR and you have to attach the file and check the Grant License to
 ASF button.  It is also important to verify that you have the legal
 right to grant the code to ASF (since it is probably your employer's
 intellectual property).

 Legal issues are a hassle, but are unavoidable, I'm afraid.

 Thanks again,
 -Mike

 On 10-Sep-07, at 10:22 AM, Wagner,Harry wrote:

  Hi Yonik,
  The modified KStemmer source is attached. The original KStemFilter is
  now wrapped (and replaced) by KStemFilterFactory.  I also changed the
  path to avoid any naming collisions with existing Lucene code.
 
  I included the jar file also, for anyone who wants to just drop and
  play:
 
  - put KStem2.jar in your solr/lib directory.
  - change your schema to use: filter
  class=org.oclc.solr.analysis.KStemFilterFactory cacheSize=2/
  - restart your app server
 
  I don't know if you credit contributions, but if so please include
  OCLC.
  Seems only fair since I did this on their dime :)
 
  Cheers!
  harry
 
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
  Seeley
  Sent: Friday, September 07, 2007 3:59 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr and KStem
 
  On 9/7/07, Wagner,Harry [EMAIL PROTECTED] wrote:
  I've implemented a Solr plug-in that wraps KStem for Solr use.  KStem
  is
  considered to be more appropriate for library usage since it is much
  less aggressive than Porter (i.e., searches for organization do NOT
  match on organ!). If there is any interest in feeding this back into
  Solr I would be happy to contribute it.
 
  Absolutely.
  We need to make sure that the license for that k-stemmer is ASL
  compatible of course.
 
  -Yonik
  kstem_solr.tar.gz




Re: Removing lengthNorm from the calculation

2007-09-10 Thread Yonik Seeley
If you aren't using index-time document boosting, or field boosting
for that field specifically, then set omitNorms=true for that field
in the schema, shut down solr, completely remove the index, and then
re-index.

The norms for each field consist of the index-time boost multiplied by
the length normalization.

-Yonik


On 9/10/07, Kyle Banerjee [EMAIL PROTECTED] wrote:
 I know I'm missing something really obvious, but I'm spinning my
 wheels figuring out how to eliminate lengthNorm from the calculations.

 The specific problem I'm trying to solve is that naive queries are
 resulting in crummy short records near the top of the list. The
 reality is that the longer records tend to be higher quality, so if
 anything, they need to be emphasized.

 However, I'm missing something simple. Any advice or a pointer to an
 example I could model off would be greatly appreciated. Thanks,

 kyle


Re: largish test data set?

2007-09-17 Thread Yonik Seeley
If you want to see what performance will be like on the next release,
you could try upgrading Solr's internal version of lucene to trunk
(current dev version)... there have been some fantastic improvements
in indexing speed.

For query speed/throughput, Solr 1.2 or trunk should do fine.

-Yonik

On 9/17/07, David Welton [EMAIL PROTECTED] wrote:
 Hi,

 I'm in the process of evaluating solr and sphinx, and have come to
 realize that actually having a large data set to run them against
 would be handy.  However, I'm pretty new to both systems, so thought
 that perhaps asking around my produce something useful.

 What *I* mean by largish is something that won't fit into memory - say
 5 or 6 gigs, which is probably puny for some and huge for others.

 BTW, I would also welcome any input from others who have done the
 above comparison, although what we'll be using it for is specific
 enough that of course I'll need to do my own testing.

 Thanks!
 --
 David N. Welton
 http://www.welton.it/davidw/



Re: EdgeNGramTokenFilter, term position?

2007-09-17 Thread Yonik Seeley
On 9/16/07, Ryan McKinley [EMAIL PROTECTED] wrote:
 Should the EdgeNGramFilter use the same term position for the ngrams
 within a single token?

It feels like that is the right approach.
I don't see value in having them sequential, and I can think of uses
for having them overlap.

-Yonik


Re: Customize the way relevancy is calculated

2007-09-18 Thread Yonik Seeley
On 9/18/07, Amitha Talasila [EMAIL PROTECTED] wrote:
   The 65% of the relevance can be computed while indexing the document and
 posted as a field. But the keyword match is a run time score .Is there any
 way of getting the relevance score as a combination of this 65% and 35%?

A FunctionQuery can get you the value of a field to use in a relevancy
score.  Put that it in a boolean query with the relevanct query and
boost each portion to give the correct weight.

+text:foo^.65  _val_:scorefield^.35

-Yonik


Re: pluggable functions

2007-09-18 Thread Yonik Seeley
On 9/18/07, Jon Pierce [EMAIL PROTECTED] wrote:
 Reflection could be used to look up and invoke the constructor with
 appropriately-typed arguments.  If we assume only primitive types
 and ValueSources are used, I don't think it would be too hard to craft
 a drop-in replacement that works with existing implementations.  In
 any case, the more flexible alternative would probably be to do as
 you're suggesting (if I understand you correctly) -- let the function
 handle the parsing,

The parser is a quick hack I threw together, and any value source
factories should not be exposed to it.  It seems like either
1) a value source factory would expose the types it expects
or
2) a value source factory would take a ListValueSource and throw a
ParseException if it didn't get what it expected

Reflection might be fine if the cost of construction via reflection
ends up being small compared to the parsing itself.

-Yonik


Re: How can i make a distribute search on Solr?

2007-09-19 Thread Yonik Seeley
On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote:
 On Wed, 19 Sep 2007 01:46:53 -0400
 Ryan McKinley [EMAIL PROTECTED] wrote:

  Stu is referring to Federated Search - where each index has some of the

It really should be Distributed Search I think (my mistake... I
started out calling it Federated).  I think Federated search is more
about combining search results from different data sources.

  data and results are combined before they are returned.  This is not yet
  supported out of the box

 Maybe this is related. How does this compare to the map-reduce functionality 
 in Nutch/Hadoop ?

map-reduce is more for batch jobs.  Nutch only uses map-reduce for
parallel indexing, not searching.

-Yonik


Re: useColdSearcher = false... not working in 1.2?

2007-09-19 Thread Yonik Seeley
On 9/19/07, Adam Goldband [EMAIL PROTECTED] wrote:
 Anyone else using this, and finding it not working in Solr 1.2?  Since
 we've got an automated release process, I really need to be able to have
 the appserver not see itself as done warming up until the firstSearcher
 is ready to go... but with 1.2 this no longer seems to be the case.

I took a quick peek at the code, and it should still work (it's pretty simple).
false is also the default.

How are you determining that it isn't working?

-Yonik


Re: Getting only size of getFacetCounts , to simulate count(group by( a field) ) using facets

2007-09-19 Thread Yonik Seeley
On 9/19/07, Laurent Hoss [EMAIL PROTECTED] wrote:
 We want to (mis)use facet search to get the number of (unique) field
 values appearing in a document resultset.

We have paging of facets, so just like normal search results, it does
make sense to list the total number of facets matching.

The main problem with implementing this is trying to figure out where
to put the info in a backward compatible manner.  Here is how the info
is currently returned (JSON format):

 facet_fields:{
cat:[
   camera,1,
   card,2,
   connector,2,
   copier,1,
   drive,2
  ]
},


Unfortunately, there's not a good place to put this extra info without
older clients choking on it.  Within cat there should have been
another element called values or something... then we could easily
add extra fields like nvalues:

cat:{
 nvalues:5042,
 values:[
   camera,1,
   card,2,
   connector,2,
   copier,1,
   drive,2
  ]
 }

-Yonik


Re: How can i make a distribute search on Solr?

2007-09-20 Thread Yonik Seeley
On 9/19/07, Norberto Meijome [EMAIL PROTECTED] wrote:
 Maybe I got this wrong...but isn't this what mapreduce is meant to deal with?

Not really... you could force a *lot* of different problems into
map-reduce (that's sort of the point... being able to automatically
parallelize a lot of different problems).  It really isn't the best
fit though, and would end up being much slower than a custom job.

Then there is the issue that the way map-reduce is implemented (like
hadoop) is also tuned for longer running batch jobs on huge data
(temporary files are used, external sorts, initial input, final output
is via files, etc).  Check out the google map-reduce paper - they
don't use it for their search side either.


Things are already progressing in the distributed search area:
https://issues.apache.org/jira/browse/SOLR-303
Hopefully I'll have time to dig into it more myself in a few weeks.

-Yonik


Re: Term extraction

2007-09-20 Thread Yonik Seeley
On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote:
 However, I'd like to be able to
 analyze documents more intelligently to recognize phrase keywords such as
 open source, Microsoft Office, Bill Gates rather than splitting each
 word into separate tokens (the field is never used in search queries so
 matching is not an issue).  I've been looking at SynonymFilterFactory as a
 possible solution to this problem but haven't been able to work out the
 specifics of how to configure it for phrase mappings.

SynonymFilter works out-of-the-box with multi-token synonyms...

Microsoft Office = microsoft_office
Bill Gates, William Gates = bill_gates

Just don't use a word-delimiter filter if you use underscore to join words.

-Yonik


Re: Solr and FieldCache

2007-09-20 Thread Yonik Seeley
On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote:
 I'm just wondering, as this cached object could be (theoretically)
 pretty big, do I need to be aware of some OOM? I know that FieldCache
 use weakmaps, so I presume the cached array for the older reader(s) will
 be gc-ed when the reader is no longer referenced (i.e. when solr load
 the new one, after its warmup and so on), is that right?

Right.  You will need room for two entries (one for the current
searcher and one for the warming searcher).

-Yonik


Re: Solr and FieldCache

2007-09-20 Thread Yonik Seeley
On 9/20/07, Walter Ferrara [EMAIL PROTECTED] wrote:
 I have an index with several fields, but just one stored: ID (string,
 unique).
 I need to access that ID field for each of the tops nodes docs in my
 results (this is done inside a handler I wrote), code looks like:

  Hits hits = searcher.search(query);
  for(int i=0; inodes; i++) {
 id[i]=hits.doc(i).get(ID);
 score[i]=hits.score(i);
  }

What is the higher level use-case you are trying to address that makes
it necessary to write a plugin?

-Yonik


Re: Problem getting the FacetCount

2007-09-21 Thread Yonik Seeley
On 9/21/07, Amitha Talasila [EMAIL PROTECTED] wrote:
 But when we make a facet query like,
 http://localhost:8983/solr/select?q=ipodrows=0facet=truefacet.limit=-1fac
 et.query=weight:{0m TO 100m}, the facet count is coming as 0.We are indexing
 it as a string field because if the user searches for 12m he needs to see
 that result. Can anyone suggest a better way of querying this?

In a string field, 12m is greater than 100m, so won't be in the range.
You need to index that field as a numeric type where range queries
work: use type sint or sfloat.

As for the m, you should have a frontend that allows input in the
form desire and converts it to a valid query to solr.

-Yonik


Re: Term extraction

2007-09-21 Thread Yonik Seeley
On 9/21/07, Pieter Berkel [EMAIL PROTECTED] wrote:
 Yonik: This is the approach I had in mind, will it still work if I put the
 SynonymFilter after the word-delimiter filter in the schema config?

SynonymFilter doesn't currently have the capability to handle multiple
tokens at the same position in the input.  You could simply remove the
WordDelimiterFilter unless you need it.

 Ideally
 I want to strip out the underscore char before it gets indexed

Why's that?

You could just define your synonyms like that initially:
Bill Gates, William Gates = billgates

-Yonik


Re: I can't delete, why?

2007-09-25 Thread Yonik Seeley
On 9/25/07, Ben Shlomo, Yatir [EMAIL PROTECTED] wrote:
 I know I can delete multiple docs with the following:
 deletequerymediaId:(6720 OR 6721 OR  )/query/delete

 My question is can I do something like this?
 deletequerylanguageId:123 AND manufacturer:456 /query/delete
 (It does not work for me and I didn't forget to commit)

Do you get an error, or do you just not see this document deleted?
Does a query identical to this show matching documents after a commit?

Also keep in mind that delete by id is currently more efficient than
delete by query, so if mediaId is your uniqueKeyField, you would be
better served by using that.

-Yonik


Re: How to get debug information while indexing?

2007-09-26 Thread Yonik Seeley
On 9/26/07, Urvashi Gadi [EMAIL PROTECTED] wrote:
 Hi,

 I am trying to create my own application using SOLR and while trying to
 index my data i get

 Server returned HTTP response code: 400 for URL:
 http://localhost:8983/solr/update or
 Server returned HTTP response code: 500 for URL:
 http://localhost:8983/solr/update

 Is there a way to get more debug information than this (any logs, which file
 is wrong, schema.xml? etc)

Both the HTTP reason and response body should contain more information.
What are you using to communicate with Solr?
Try a bad request with curl and you can see the info that comes back:

[EMAIL PROTECTED] /cygdrive/f/code/lucene
$ curl -i http://localhost:8983/solr/select?q=foo:bar
HTTP/1.1 400 undefined_field_foo
Content-Type: text/html; charset=iso-8859-1
Content-Length: 1398
Server: Jetty(6.1.3)

html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 400 /title
/head
bodyh2HTTP ERROR: 400/h2preundefined field foo/pre
pRequestURI=/solr/select/ppismalla href=http://jetty.mortbay.org/;P
owered by Jetty:///a/small/i/pbr/

br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/

/body
/html


Errors should also be logged.

-Yonik


Re: searching for non-empty fields

2007-09-27 Thread Yonik Seeley
On 9/27/07, Pieter Berkel [EMAIL PROTECTED] wrote:
 While in theory -URL: should be valid syntax, the Lucene query parser
 doesn't accept it and throws a ParseException.

I don't have time to work on that now, but I did just open a bug:
https://issues.apache.org/jira/browse/LUCENE-1006

-Yonik


Re: moving index

2007-09-27 Thread Yonik Seeley
On 9/27/07, Jae Joo [EMAIL PROTECTED] wrote:
 I do need to move the index files, but have a concerns any potential problem
 including performance?
 Do I have to keep the original document for querying?

I assume you posted XML documents in Solr XML format (like adddoc...)?
If so, that is just an example way to get the data into Solr.  Those
XML files aren't needed, and any high-speed indexing will avoid
creating files at all - just create the XML doc in memory and send to
solr via HTTP-POST.

-Yonik


Re: searching for non-empty fields

2007-09-27 Thread Yonik Seeley
On 9/27/07, Yonik Seeley [EMAIL PROTECTED] wrote:
 On 9/27/07, Pieter Berkel [EMAIL PROTECTED] wrote:
  While in theory -URL: should be valid syntax, the Lucene query parser
  doesn't accept it and throws a ParseException.

 I don't have time to work on that now,

OK, I lied :-)  It was simple (and a nice diversion).

-Yonik

 but I did just open a bug:
 https://issues.apache.org/jira/browse/LUCENE-1006


Re: custom sorting

2007-09-27 Thread Yonik Seeley
On 9/27/07, Erik Hatcher [EMAIL PROTECTED] wrote:
 Using something like this, how would the custom SortComparatorSource
 get a parameter from the request to use in sorting calculations?

perhaps hook in via function query:
  dist(10.4,20.2,geoloc)

And either manipulate the score with that and sort by score,

q=+(foo bar)^0 dist(10.4,20.2,geoloc)
sort=score asc

or extend solr's sorting mechanisms to allow specifying a function to sort by.

sort=dist(10.4,20.2,geoloc) asc

-Yonik


Re: Color search

2007-09-28 Thread Yonik Seeley
If it were just a couple of colors, you could have a separate field
for each color and then index the percent in that field.

black:70
grey:20

and then you could use a function query to influence the score (or you
could sort by the color percent).

However, this doesn't scale well to a large index with a large number of colors.
Each field used like that will take up 4 bytes per document in the index.

so if you have 1M documents, that's 1Mdocs * 100colors * 4bytes = 400MB
Doable depending on your index size (use int or float and not
sint or sfloat type for this... it will be better on the memory).

If you needed to be better on the memory, you could encode all of the
colors into a single value (perhaps into a compact string... one
percentile per byte or something) and then have a custom function that
extracts the value for a particular color.  (this involves some java
development)

-Yonik


On 9/28/07, Guangwei Yuan [EMAIL PROTECTED] wrote:
 Hi,

 We're running an e-commerce site that provides product search. We've been
 able to extract colors from product images, and we think it'd be cool and
 useful to search products by color. A product image can have up to 5 colors
 (from a color space of about 100 colors), so we can implement it easily with
 Solr's facet search (thanks all who've developed Solr).

 The problem arises when we try to sort the results by the color relevancy.
 What's different from a normal facet search is that colors are weighted. For
 example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
 search query color:black should return results in which the black dress
 ranks higher than other products with less percentage of black.

 My question is: how to configure and index the color field so that products
 with higher percentage of color X ranks higher for query color:X?

 Thanks for your help!

 - Guangwei



Re: small rsync index question

2007-09-28 Thread Yonik Seeley
On 9/28/07, Brian Whitman [EMAIL PROTECTED] wrote:
 For some reason sending a
 commit/ is not refreshing the index

It should... are there any errors in the logs?  do you see the commit
in the logs?
Check the stats page to see info about when the current searcher was
last opened too.

-Yonik


Re: Schema version question

2007-09-28 Thread Yonik Seeley
On 9/28/07, Robert Purdy [EMAIL PROTECTED] wrote:
 I was wondering if anyone could help me, I just completed a full index of my
 data (about 4 million documents) and noticed that when I was first setting
 up the schema I set the version number to 1.2 thinking that solr 1.2 uses
 schema version 1.2... ugh... so I am wondering if I can just set the schema
 to 1.1 without having to rebuild the full index? I ask because I am hoping
 that given an invalid schema version number, that version 1.0 is not used by
 default and all my fields are now mulitvalued. Any help would be greatly
 appreciated. Thanks in advance

Yes, it should be OK to set it back to 1.1 w/o reindexing.
The index format does not differentiate between single and
multi-valued fields so you should be fine there.

-Yonik


Re: Request for graphics

2007-09-28 Thread Yonik Seeley
On 9/28/07, Clay Webster [EMAIL PROTECTED] wrote:
 i'm late for dinner out, so i'm just attaching it here.

Most attachments are stripped :-)

-Yonik


Re: Searching combined English-Japanese index

2007-10-01 Thread Yonik Seeley
On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 When I search using an English term, I get results but the Japanese is
 not encoded correctly in the response. (although it is UTF-8 encoded)

One quick thing to try is the python writer (wt=python) to see the
actual unicode values of what you are getting back (since the python
writer automatically escapes non-ascii).  That can help rule out
incorrect charset handling by clients.

-Yonik


Re: Major CPU performance problems under heavy user load with solr 1.2

2007-10-01 Thread Yonik Seeley
On 10/1/07, Robert Purdy [EMAIL PROTECTED] wrote:
 Hi there, I am having some major CPU performance problems with heavy user
 load with solr 1.2. I currently have approximately 4 million documents in
 the index and I am doing some pretty heavy faceting on multi-valued columns.
 I know that doing facets are expensive on multi-valued columns but the CPU
 seems to max out (400%) with apache bench with just 5 identical concurrent
 requests

One can always max out CPU (unless one is IO bound) with concurrent
requests greater than the number of CPUs on the system.  This isn't a
problem by itself and would exist even if Solr were an order of
magnitude slower or faster.

You should be looking at things the peak throughput (queries per sec)
you need to support and the latency of the requests (look at the 90
percentile, or whatever).


 and I have the potential for a lot more concurrent requests then
 that with my large number of users that hit our site per day and I am
 wondering if there are any workarounds. Currently I am running the out of
 the box solr solution (Example jetty application with my own schema.xml and
 solrconfig.xml) on a dual Intel Duo core 64 bit box with 8 gigs of ram
 allocated to the start.jar process dedicated to solr with no slaves.

 I have set up some aggressive caching in the solrconfig.xml for the
 filtercache (class=solr.LRUCachesize=300 initialSize=200) and
 have the HashDocSet set to 1 to help with faceting, but still I am
 getting some pretty poor performance. I have also tried autowarming the
 facets by performing a query that hits all my multivalued facets with no
 facet limits across all the documents in the index. This does seem to reduce
 my query times by a lot because the filtercache grows to about 2.1 million
 lookups and finishes the query in about 70 secs.

OK, that's long.  So focus on the latency of a single request instead
of jumping straight to load testing.

2.1 million is a lot - what's the field with the largest number of
unique values that you are faceting on?

 However I have noticed an
 issue with this because each time I do an optimize or a commit after
 prewarming the facets the cache gets cleared, according to the stats on the
 admin page, but the RSize does not shink for the process, and the queries
 get slow again, so I prewarm the facets again and the memory usage keeps
 growing like the cache is not being recycled

The old searcher and cache won't be discarded until all requests using
it have completed.

 and as a results the prewarm
 query starts to get slower and slower as each time this occurs (after about
 5 times of prewarms and then commit the query takes about 30 mins... ugh)
 and almost run out of memory.

 Any thoughts on how to help improve this and fix the memory issue?

You could try the minDf param to reduce the number of facets stored in
the cache and reduce memory consumption.

-Yonik


Re: Searching combined English-Japanese index

2007-10-01 Thread Yonik Seeley
On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 Yonik Seeley schrieb:
  On 10/1/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
  When I search using an English term, I get results but the Japanese is
  not encoded correctly in the response. (although it is UTF-8 encoded)
 
  One quick thing to try is the python writer (wt=python) to see the
  actual unicode values of what you are getting back (since the python
  writer automatically escapes non-ascii).  That can help rule out
  incorrect charset handling by clients.
 
  -Yonik
 
 Thanks for the tip, it turns out that the unicode values are wrong... I
 mean the browser displays correctly what is send. But I don't know how
 solr gets these values.

OK, so they never got into the index correctly.
The most likely explanation is that the charset wasn't set correctly
when the update message was sent to Solr.

-Yonik


Re: Searching combined English-Japanese index

2007-10-02 Thread Yonik Seeley
On 10/2/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 Are you sure, they are wrong in the index?

It's not an issue with Jetty output encoding since the python writer
takes the string and converts it to ascii before that.  Since Solr
does no charset encoding itself on output, that must mean that it's in
the index incorrectly.

 When I use the Lucene Index
 Monitor (http://limo.sourceforge.net/) to look at the document in the
 index the Japanese is displayed correctly.

I've never really used limo, but it's possible it's incorrectly
interpreting what's in the index (and by luck doing the reverse
transformation that got the data in there incorrectly).

Try indexing a document with a unicode character specified via an
entity, to remove the issues of input char encodings.  For example if
a Japanese char has a unicode value of \u1234, then in the XML doc,
use #x1234

-Yonik


Re: Seeing if an entry exists in an index for a set of terms

2007-10-03 Thread Yonik Seeley
On 10/3/07, Ian Holsman [EMAIL PROTECTED] wrote:
 Hi.

 I was wondering if there was a easy way to give solr a list of things
 and finding out which have entries.


 ie I pass it a list

 Bill Clinton
 George Bush
 Mary Papas
 (and possibly 20 others)

 to a solr index which contains news articles about presidents. I would
 like a response saying

 bill Clinton was found in 20 records
 George Bush was found in 15.

 possibly with the links, but thats not too important.

 I know I can do this by doing ~20 individual queries, but I thought
 there may be a more efficient way

How about
facet.query=Bill Clintonfacet.query=George Bush, etc

Will give you counts, but not the links

-Yonik


Re: Best way to change weighting based on the presence of a field

2007-10-05 Thread Yonik Seeley
On 10/5/07, Mike Klaas [EMAIL PROTECTED] wrote:
 The other option is to use a function query on the value stored in a
 field (which could represent a range of 'badness').  This can be used
 directly in the dismax handler using the bf (boost function) query
 parameter.

In the near future, you can do a real query-time boost (score multiplication)
by another field or function
https://issues.apache.org/jira/browse/SOLR-334

And even quickly update all the values of the field being used as the boost:
https://issues.apache.org/jira/browse/SOLR-351

-Yonik


Re: Urldecode Problem

2007-10-07 Thread Yonik Seeley
On 10/6/07, Frederik M. Kraus [EMAIL PROTECTED] wrote:
 Looks like we ran into a urldecode problem when having certain query
 strings. This is what happens:

 Client:  Jeffrey's Bay   -   Jeffrey%26%2339%3Bs+Bay   (php 5.2
 urlencode/rawurlencode)

It looks like the client is doing XML escaping as it replaces ' with #39;
Then each char of the #39; is URL encoded.  This is incorrect of
course, urlencoding has nothing to do with XML.

-Yonik


Re: High-Availability deployment

2007-10-08 Thread Yonik Seeley
On 10/8/07, Daniel Alheiros [EMAIL PROTECTED] wrote:
 I'm about to deploy SOLR in a production environment

Cool, can you share exactly what it will be used for?

 and so far I'm a bit
 concerned about availability.

 I have a system that is responsible for fetching data from a database and
 then pushing it to SOLR using its XML/HTTP interface.

 So I'm going to deploy N instances of my application so it's going to be
 redundant enough.

 And I'm deploying SOLR in a Master / Slaves structure, so I'm using the
 slaves nodes as a way to keep my index replicated and to be able to use them
 to serve my queries. But my problem lies on the indexing side of things. Is
 there a good alternative like a Master/Master structure that I could use so
 if my current master dies I can automatically switch to my secondary master
 keeping my index integrity?

In all the setups I've dealt with, master redundancy wasn't an issue.
If something bad happens to corrupt the index, shut off replication to
the slaves and do a complete rebuild on the master.  If the master
hardware dies, reconfigure one of the slaves to be the new master.
These are manual steps and assumes that it's not the end of the world
if your search is stale for a couple of hours.  A schema change that
required reindexing would also cause this window of staleness.

If your index build takes a long time, you could set up a secondary
master to pull from the primary (just like another slave).  But
there's no support for automatically switching over slaves, and the
secondary wouldn't have stuff between the last commit and the primary
crash... so something would need to update it... (query for latest doc
and start from there).

You could also have two search tiers... another copy of the master and
multiple slaves.  If one was down, being upgraded, or being rebuilt,
you could direct search traffic to the other set of servers.

-Yonik


Re: High-Availability deployment

2007-10-08 Thread Yonik Seeley
On 10/8/07, Daniel Alheiros [EMAIL PROTECTED] wrote:
 Well I believe I can live with some staleness at certain moments, but it's
 not good as users are supposed to need it 24x7. So the common practice is to
 make one of the slaves as the new master and switch things over to it and
 after the outage put them in sync again and do the proper switch back? OK,
 I'll follow this, but I'm still concerned about the amount of manual steps
 to be done...

That was the plan - never needed it though... (never had a master
completely die that I know of).  Having the collection not be updated
for an hour or so while the ops folks fixed things always worked fine.

 And other important issue is
 how frequently have you seen indexes getting corrupted?

Just once I think - no idea of the cause (and I think it was quite an
old version of lucene).

 If I try to run a
 commit or optimize on a Solr master instance and it's index got corrupted
 will it run the command?

Almost all of the cases I've seen of a master failing was an OOM
error, often during segment merging (again, older versions of Lucene,
and someone forgot to change the JVM heap size from the default).
This could cause a situation where you added a document but the old
one was not deleted (overwritten).  Not corrupted at the Lucene
level, but if the JVM died at the wrong spot, search results could
possibly return two documents for the same unique key.  We normally
just rebuilt after a crash.

 And more importantly, will it run the
 postOptimize/postCommit scripts generating snapshots and then possibly
 propagating the bad index?

Normally not, I think... the JVM crash/restart left the lucene write
lock aquired on the index and further attempts to modify it failed.

-Yonik


Re: High-Availability deployment

2007-10-08 Thread Yonik Seeley
On 10/8/07, Daniel Alheiros [EMAIL PROTECTED] wrote:
 Hmm, is there any exception thrown in case the index get corrupted (if it's
 not caused by OOM and the JVM crashes)? The document uniqueness SOLR offers
 is one of the many reasons I'm using it and should be excellent to know when
 it's gone. :)
 Does it mean that after recovering from a JVM crash should be recommended to
 rebuild my indexes instead of just re-starting it?

Yes, it's safer to do so.
I think in a future release we will be able to guarantee document
uniqueness even in the face of a crash.

-Yonik


Re: Availability Issues

2007-10-08 Thread Yonik Seeley
On 10/8/07, David Whalen [EMAIL PROTECTED] wrote:
  Have you taken a thread dump to see what is going on?

 We can't do it b/c during the unresponsive time we can't access
 the admin site (/solr/admin) at all.  I don't know how to do a
 thread dump via the command line

kill -3 pid_of_jvm_running_solr

Start with the thread dump.
I bet it's multiple queries piling up around some synchronization
points in lucene (sometimes caused by multiple threads generating the
same big filter that isn't yet cached).

-Yonik


Re: Availability Issues

2007-10-08 Thread Yonik Seeley
On 10/8/07, David Whalen [EMAIL PROTECTED] wrote:
 The logs show nothing but regular activity.  We do a tail -f
 on the logfile and we can read it during the unresponsive period
 and we don't see any errors.

You don't see log entries for requests until after they complete.
When a server becomes unresponsive, try shutting off further traffic
to it, and let it finish whatever requests it's working on (assuming
that's the issue) so you can see them in the log.  Do you see any
requests that took a really long time to finish?

-Yonik


Re: Availability Issues

2007-10-08 Thread Yonik Seeley
On 10/8/07, David Whalen [EMAIL PROTECTED] wrote:
  Do you see any requests that took a really long time to finish?

 The requests that take a long time to finish are just simple
 queries.  And the same queries run at a later time come back
 much faster.

 Our logs contain 99% inserts and 1% queries.  We are constantly
 adding documents to the index at a rate of 10,000 per minute,
 so the logs show mostly that.

Oh, so you are using the same boxes for updating and querying?
When you insert, are you using multiple threads?  If so, how many?

What is the full URL of those slow query requests?
Do the slow requests start after a commit?

  Start with the thread dump.
  I bet it's multiple queries piling up around some
  synchronization points in lucene (sometimes caused by
  multiple threads generating the same big filter that isn't
  yet cached).

 What would be my next steps after that?  I'm not sure I'd
 understand enough from the dump to make heads-or-tails of
 it.  Can I share that here?

Yes, post it here.  Most likely a majority of the threads will be
blocked somewhere deep in lucene code, and you will probably need help
from people here to figure it out.

-Yonik


Re: Facets and running out of Heap Space

2007-10-09 Thread Yonik Seeley
On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
 I run a faceted query against a very large index on a
 regular schedule.  Every now and then the query throws
 an out of heap space error, and we're sunk.

 So, naturally we increased the heap size and things worked
 well for a while and then the errors would happen again.
 We've increased the initial heap size to 2.5GB and it's
 still happening.

 Is there anything we can do about this?

Try facet.enum.cache.minDf param:
http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik


Re: Facets and running out of Heap Space

2007-10-09 Thread Yonik Seeley
On 10/9/07, David Whalen [EMAIL PROTECTED] wrote:
  This is only used during the term enumeration method of
  faceting (facet.field type faceting on multi-valued or
  full-text fields).

 What if I'm faceting on just a plain String field?  It's
 not full-text, and I don't have multiValued set for it

Then you will be using the FieldCache counting method, and this param
is not applicable :-)
Are all your field that you facet on like this?

The FieldCache entry might be taking up too much room, esp if the
number of entries is high, and the entries are big.  The requests
themselves can take up a good chunk of memory temporarily (4 bytes *
nValuesInField).

You could try a memory profiling tool and see where all the memory is
being taken up too.

-Yonik


Re: Facets and running out of Heap Space

2007-10-10 Thread Yonik Seeley
On 10/10/07, Mike Klaas [EMAIL PROTECTED] wrote:
 Have you tried setting multivalued=true without reindexing?  I'm not
 sure, but I think it will work.

Yes, that will work fine.
One thing that will change is the response format for stored fields
arr name=foostrval1/str/arr
instead of
str name=fooval1/str

Hopefully in the future we can specify a faceting method w/o having to
change the schema.

-Yonik


Re: Internal Server Error and waitSearcher=false for commit/optimize

2007-10-11 Thread Yonik Seeley
On 10/10/07, Jason Rennie [EMAIL PROTECTED] wrote:
 We're using solr 1.2 and a nightly build of the solrj client code.  We very
 occasionally see things like this:

 org.apache.solr.client.solrj.SolrServerException: Error executing query
 at org.apache.solr.client.solrj.request.QueryRequest.process(
 QueryRequest.java:86)
 at org.apache.solr.client.solrj.impl.BaseSolrServer.query(
 BaseSolrServer.java:99)
 ...
 Caused by: org.apache.solr.common.SolrException: Internal Server Error

Is there a longer stack trace somewhere concerning the internal server error?

 We also occasionally see solr taking too long to respond.  We currently make
 our commit/optimize calls without any arguments.  I'm wondering whether
 setting waitSearcher=false might allow search queries to be served while a
 commit/optimize is being run.  I found this in an old message from this
 list:

While commit/optimize is being run, requests are served using the old
searcher - there shouldn't be any blocking.

 Is waitSearcher=false designed to
 allow queries to be processed while a commit/optimize is being run?

No, waitSearcher=true was designed such that a client could do a
commit, and wait for a new searcher to be registered such that a new
query request is guaranteed to see the changes.
waitSearcher=true/false only affects the thread calling commit... it
has no effect on other query requests which will continue to use the
previous searcher until the new one is registered.

-Yonik


Re: doubled/halved performance?

2007-10-11 Thread Yonik Seeley
On 10/11/07, Mike Klaas [EMAIL PROTECTED] wrote:
 I'm seeing some interesting behaviour when doing benchmarks of query
 and facet performance.  Note that the query cache is disabled, and
 the index is entirely in the OS disk cache.  filterCache is fully
 primed.

 Often when repeatedly measuring the same query, I'll see pretty
 consistent results (within a few ms), but occasionally a run which is
 almost exactly half the time:

 240ms vs. 120ms:

 solr: DEBUGINFO: /select/
 facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism
 axversion=2.2rows=1 0 239
 solr: DEBUGINFO: /select/
 facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism
 axversion=2.2rows=1 0 237
 solr: DEBUGINFO: /select/
 facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism
 axversion=2.2rows=1 0 120
 solr: DEBUGINFO: /select/
 facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism
 axversion=2.2rows=1 0 120
 solr: DEBUGINFO: /select/
 facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism
 axversion=2.2rows=1 0 237
 solr: DEBUGINFO: /select/
 facet=truedebugQuery=trueindent=onstart=0q=wwwfacet.field=tqt=dism
 axversion=2.2rows=1 0 238

 The strange thing is that the execution time is halved across _all_
 parts of query processing:

 101.0   total time
1.0  setup/query parsing
68.0 main query
30.0 faceting
0.0  pre fetch
2.0  debug

 201.0   total time
1.0  setup/query parsing
138.0main query
58.0 faceting
0.0  pre fetch
4.0  debug

 I can't really think of a plausible explanation.  Fortuitous
 instruction pipelining?  It is hard to imagine a cause that wouldn't
 exhibit consistency.

So the queries are one at a time, the index isn't changing, and
nothing else happening in the system?
It would be easier to explain an occasional long query than an
occasional short one.
It's weird how the granularity seems to be on the basis of a request
(if the speedup sometimes happened half way through, then you'd get an
average of the times).

You could try -Xbatch to see if it's hotspot somehow, but I doubt that's it.

-Yonik


Re: Instant deletes without committing

2007-10-13 Thread Yonik Seeley
On 10/11/07, BrendanD [EMAIL PROTECTED] wrote:
 Yes, we have some huge performance issues with non-cached queries. So doing a
 commit is very expensive for us. We have our autowarm count for our
 filterCache and queryResultCache both set to 4096. But I don't think that's
 near high enough. We did have it as high as 16384 before, but it took over
 an hour to warm.

Look in the logs... what took an hour to warm?   there are separate
autowarm log messages for the query and filter caches.

 Some of our queries take 30-60 seconds to complete if
 they're not cached.

1) Configure static warming requesst for any faceting that's common
2) Configure static warming requests for any filters (fq) that are common
3) size the filter cache larger than what's needed to hold all the
facets (if that's too much memory, try the minDf param... see the
wiki)
4) if indexing performance isn't an issue, lower mergeFactor to lower
the average number of segments in the index (or optimize if you can)

-Yonik


Re: query syntax performance difference?

2007-10-13 Thread Yonik Seeley
On 10/11/07, BrendanD [EMAIL PROTECTED] wrote:
 Is there a difference in the performance for the following 2 variations on
 query syntax? The first query was a response from Solr by using a single fq
 parameter in the URL. The second query was a response from Solr by using
 separate fq parameter in the URL, one for each field.

 str name=fq
 product_is_active:true AND product_status_code:complete AND
 category_id:1001570 AND attribute_id_value_en_pair:1005758\:Elvis
 Presley
 /str

 vs:
 arr name=fq
strproduct_is_active:true/str
strproduct_status_code:complete/str
strcategory_id:1001570/str
strattribute_id_value_en_pair:1005758\:Elvis Presley/str
 /arr

 I'm just wondering if the queries get executed differently and whether it's
 better to split out each individual query into it's own statement or combine
 them using the AND operator.

If they almost always appear together, then use an AND and put them in
the same filter.
If they are relatively independent, use different filters.  Having
solr intersect a few filters is normally very fast, so independent
filters is usually fine.

-Yonik


Re: Non-sortable types in sample schema

2007-10-13 Thread Yonik Seeley
On 10/13/07, Lance Norskog [EMAIL PROTECTED] wrote:
 The sample schema in Solr 1.2 supplies two variants of integers, longs,
 floats, doubles. One variant is sortable and one is not.

 What is the point of having both? Why would I choose the non-sorting
 variants? Do they store fewer bytes per record?

They both sort (because sorting uses the un-inverted FieldCache
entry) ... but they don't both do range queries correctly (which
relies on term index oder).

One might choose integer for reading a legacy lucene index, or
because they only need it for sorting or for function queries and the
FieldCache entry is smaller.

-Yonik


Re: comment-out a filter?

2007-10-15 Thread Yonik Seeley
On 10/15/07, David Whalen [EMAIL PROTECTED] wrote:
 I want to comment-out a filter in my schema.xml, specifically
 the solr.EnglishPorterFilterFactory filter.

 I want to know -- will this cause me to have to re-build my
 index?  Or will a restart of SOLR get the job done?

Yes, you will need to rebuild because the index will have stemmed
terms and queries will no longer match those terms in the index.

-Yonik


Re: Search results problem

2007-10-17 Thread Yonik Seeley
On 10/17/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 I also found this:

 Controls the maximum number of terms that can be added to a Field for a
 given Document, thereby truncating the document. Increase this number if
 large documents are expected. However, setting this value too high may
 result in out-of-memory errors.

 Coming from: http://www.ibm.com/developerworks/library/j-solr2/index.html

 That might be a problem for me.

 I was thinking about using copyFields, instead of one large fulltext
 field. Would that solve my problem, or would the maxFieldLength still
 apply when using copyFields?

maxFieldLength is a setting on the IndexWriter and applies to all fields.
If you want more tokens indexed, simply increase the value of
maxFieldLength to something like 20 and you should be fine.

There's no penalty for setting it higher than the largest field you
are indexing (no diff between 1M and 2B if all your docs have field
lengths less than 1M tokens anyway).

-Yonik


Re: GET_SCORES flag in SolrIndexSearcher

2007-10-20 Thread Yonik Seeley
On 10/19/07, Chris Hostetter [EMAIL PROTECTED] wrote:
 (it doesn't matter that parseSort
 returns null when the sort string is just score ... SolrIndexSearcher
 recognizes a null Sort as being the default sort by score)

Yep... FYI, I did this early on specifically because no sort and
score desc  get you the same results from Lucene's
IndexSearcher.search(), but they take different code paths (the former
being slightly faster).

-Yonik


Re: Performance when indexing or cold cache

2007-10-22 Thread Yonik Seeley
On 10/22/07, Walter Underwood [EMAIL PROTECTED] wrote:
 lst name=appends
   str name=fq(pushstatus:A AND (type:movie OR type:person))/str
 /lst
   /requestHandler

Perhaps try setting up a static warming query for this filter and any
other common filters?

Also look for correlations between when slow queries happen and the
number of segments in the index (and perhaps lower mergeFactor to
compensate if possible).

-Yonik


Re: Using wildcard with accented words

2007-10-22 Thread Yonik Seeley
On 10/22/07, Erik Hatcher [EMAIL PROTECTED] wrote:
 Perhaps this is a case that Solr could address with a third analyzer
 configuration (it already has query, and index differentiation)
 that could be incorporated for wildcard queries.   Thoughts on that?

I've actually thought about it previously it would be nice for it
to all work automatically for the user.  Seems like the implementation
should be based on the TokenFilter level, then things like synonym
filters, stemmers, etc, would do nothing.

Perhaps add some new methods to BaseTokenFilterFactory to do prefix,
wildcard, etc, transformations?

Another gotcha is handling multiple tokens.
What happens if someone queries for myfield:foo-bar*
with a letter tokenizer or a word-delimiter filter?  It's not a simple
prefix query at all!

-Yonik


Re: Search results problem

2007-10-22 Thread Yonik Seeley
On 10/19/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
 Yonik Seeley schrieb:
  On 10/17/07, Maximilian Hütter [EMAIL PROTECTED] wrote:
  I also found this:
 
  Controls the maximum number of terms that can be added to a Field for a
  given Document, thereby truncating the document. Increase this number if
  large documents are expected. However, setting this value too high may
  result in out-of-memory errors.
 
  Coming from: http://www.ibm.com/developerworks/library/j-solr2/index.html
 
  That might be a problem for me.
 
  I was thinking about using copyFields, instead of one large fulltext
  field. Would that solve my problem, or would the maxFieldLength still
  apply when using copyFields?
 
  maxFieldLength is a setting on the IndexWriter and applies to all fields.
  If you want more tokens indexed, simply increase the value of
  maxFieldLength to something like 20 and you should be fine.
 
  There's no penalty for setting it higher than the largest field you
  are indexing (no diff between 1M and 2B if all your docs have field
  lengths less than 1M tokens anyway).
 
  -Yonik
 
 Yes, that would be an easy solution, as there is no performance penalty
 as say.
 I am still unsure, if the maxFieldLength applies to copyFields?

maxFieldLength applies to all fields (it's a Lucene concept, not a Solr one).

copyField and maxFieldLength are not related.

 When using copyFields I get an array back for that field (I copied to).
 So it seems to be different.

???  maxFieldLength only applies to the number of tokens indexed.  You
will always get the complete field back if it's stored, regardless of
what maxFieldLength is.

 Is there a performance penalty for using copyFields when indexing?

copyFields are done as a discrete step before indexing... almost no
cost to do that.
Indexing itself will have a performance impact if there are more
fields to index + store as a result of the copyField commands.

 How
 about the mixed fieldtypes in the source fields? What happens when I
 copy an sint based field and a string based field to a string based field?

copyField is done based on the string values, before any analysis.
Mixed content should be fine.

-Yonik


Re: Search results problem

2007-10-23 Thread Yonik Seeley
On 10/23/07, Maximilian Hütter [EMAIL PROTECTED] wrote:

  ???  maxFieldLength only applies to the number of tokens indexed.  You
  will always get the complete field back if it's stored, regardless of
  what maxFieldLength is.

 What I meant was, that it is different from just having a field with all
 the tokens compared to using copyField to copy all the content to a
 field. CopyField doesn't just copy the contents to the field but seems
 to somehow link them there.

copyField simply creates an additional value for the target...
it would end up the same as if you sent it in yourself.

 So if my maxFieldLength is for example set to 100 and I use copyField
 for 101 other fields, will the 101th get truncated?

copyField and maxFieldLength have nothing to do with each other.

maxFieldLength limits the number of *tokens* in all values of a given
name in a given document.

So if you had

field1: this is a test
and a maxFieldLength of 3, then the test token would be dropped.

if you had
field1: this is
field1: a test
and a maxFieldLength of 3, then the test token would still be dropped.


  Is there a performance penalty for using copyFields when indexing?
 
  copyFields are done as a discrete step before indexing... almost no
  cost to do that.
  Indexing itself will have a performance impact if there are more
  fields to index + store as a result of the copyField commands.

 The documents in my application have something like 400+ fields (many
 multivalued). For easy searching the application copies all the contents
 of the 400+ fields to one field (fulltext field) which is used as
 defaultfield. This field is quite large for many documents (it gets as
 long as 55 tokens). I was thinking about using copyField for copying
 the fields onto that field instead of having the application do it
 before sending it to Solr.

The indexing cost will be identical in either case.  Since copyField
is a little more elegant (why force the user to send the data more
than once), I'd use that.

If you don't need to search on all 400+ fields individually, don't
index them (just index your defaultfield).
And I wouldn't store your defaultfield since it's redundant info.

-Yonik


Re: Payloads for multiValued fields?

2007-10-24 Thread Yonik Seeley
On 10/24/07, Alf Eaton [EMAIL PROTECTED] wrote:
 Yonik Seeley wrote:
  Could you perhaps index the captions as
  #1 this is the first caption
  #2 this is the second caption
 
  And then when just look for #n in the highlighted results?
  For display, you could also strip out the #n in the captions.
 

 This was working ok for a while, but there's a problem: the highlighter
 doesn't return the whole caption - just the highlighted part - so
 sometimes the #n at the start of the caption field doesn't get returned
 and isn't available. Any other ideas? Perhaps there's a way for the
 response to say which fields of each document were matched?

Perhaps try hl.fragsize=0

http://wiki.apache.org/solr/HighlightingParameters

-Yonik


Re: Empty field error when boosting a dismax query using bf

2007-10-24 Thread Yonik Seeley
On 10/24/07, Alf Eaton [EMAIL PROTECTED] wrote:
 I'm trying to use the bf parameter to boost a dismax query based on the value 
 of a certain (integer) field. The trouble is that for some of the documents 
 this field is empty (rather than zero), which means that there's an error 
 when using the bf parameter:
 -
 select?q=query+stringqf=bodyqt=dismaxbf=intfield
 -

 java.lang.NumberFormatException: For input string: 

It looks like you are indexing a zero-length string for that field.
Instead, completely leave the field out.

In the future, we should probably have Solr remove (not index) empty
non-string fields.

-Yonik


Re: where did my foreign language go?

2007-10-24 Thread Yonik Seeley
On 10/24/07, Ian Holsman [EMAIL PROTECTED] wrote:
 Hi.

 I'm in the middle of bringing up a new solr server and am using the
 trunk. (where I was using an earlier nightly release of about 2-3 weeks
 ago on my old server)

 now, when I do a search for 日本 (japan) it used to show the kanji in
 the q area, but now it shows gibberish instead 日本


 any hints on where I should start investigating on why this is happening?

My standard answer is to use the python writer (wt=python) to see what
the actual unicode values are when debugging an issue like this.

When I try your URL with the example server from the solr trunk, I get
'q':u'\u65e5\u672c',
And when I try your server, I get
'q':u'\u00e6\u0097\u00a5\u00e6\u009c\u00ac',

So the answer is that your app-server isn't correctly handling UTF-8
encoded URLs.
I see you are using Tomcat... see
http://wiki.apache.org/solr/SolrTomcat


URI Charset Config

If you are going to query Solr using international characters (127)
using HTTP-GET, you must configure Tomcat to conform to the URI
standard by accepting percent-encoded UTF-8.

Edit Tomcat's conf/server.xml and add the following attribute to the
correct Connector element: URIEncoding=UTF-8.

Server ...
 Service ...
   Connector ... URIEncoding=UTF-8/
 ...
   /Connector
 /Service
/Server

This is only an issue when sending non-ascii characters in a query
request... no configuration is needed for Solr/Tomcat to return
non-ascii chars in a response, or accept non-ascii chars in an
HTTP-POST body.


-Yonik


Re: My filters are not used

2007-10-24 Thread Yonik Seeley
On 10/24/07, Norskog, Lance [EMAIL PROTECTED] wrote:
 I am creating a filter that is never used. Here is the query sequence:

 q=*:*fq=contentid:00*start=0rows=200

 q=*:*fq=contentid:00*start=200rows=200

 q=*:*fq=contentid:00*start=400rows=200

 q=*:*fq=contentid:00*start=600rows=200

 q=*:*fq=contentid:00*start=700rows=200

 Accd' to the statistics here is my filter cache usage:

 lookups : 1
[...]

 I'm completely confused. I thought this should be 1 insert, 4 lookups, 4
 hits, and a hitratio of 100%.

Solr has a query cache too... the query cache is checked, there's a
hit, and the query process is short circuited.

-Yonik


Re: Forced Top Document

2007-10-25 Thread Yonik Seeley
On 10/25/07, Chris Hostetter [EMAIL PROTECTED] wrote:

 : The typical use case, though, is for the featured document to be on top only
 : for certain queries.  Like in an intranet where someone queries 401K or
 : retirement or similar, you want to feature a document about benefits that
 : would otherwise rank really low for that query.  I have not be able to make
 : sorting strategies work very well.

 this type of question typically falls into two use cases:
   1) targeted ads
   2) sponsored results

 in the targeted ads case, the special matches aren't part of the normal
 flow of results, and don't fit into pagination -- they always appera at
 the top, or to the right, on every page, no matter what the sort  this
 kind of usage doesn't really need any special logic, it can be solved as
 easily by a second Solr hit as it can by custom request handler logic.

 in the sponsored results use case, the special matches should appear
 in the normal flow of results as the #1 (2, 3, etc) matches, so that they
 don't appear on page#2 ... but that also means that it's extremely
 disconcerting for users if those matches are still at the top when the
 userse resort.  if a user is looking at product listings, sorted by
 relevancy and the top 3 results all say they are sponsered that's fine
 ... but if the user sort by price and those 3 results are still at teh
 top of the list, even though they clearly aren't the chepest, that's just
 going to piss the user off.

 in my profesional opinion: don't fuck with your users.  default to
 whatever order you want, but if the user specificly requests to sort the
 results by some option, do it.

 assuming you follow my professional opinion, then boosting docs to have
 an artifically high score will work fine.

 if you absolutely *MUST* have certain docs sorting before others,
 regardless of which sort option the user picks, then it is still possible
 do ... i'm hesitant to even say how, but if people insist on knowing...



 allways sort by score first, then by whatever field the user wants to sort
 by ... but when the user wants to sort on a specific field, move the users
 main query input into an fq (so it doesn't influence the score) ... and
 use an extremely low boost matchalldocs query along with your special doc
 matching query as the main (scoring) query param.  the key being that
 even though your primary sort is on score, every doc except your special
 matches have identical scores.

That sorts by relevance for your sponsored results, right?
What if you want absolute ordering based on dollars spent on that
result, for example.

 (this may not be possible with dismax because it's not trivial to move
 the query into an fq

Should be easier in trunk:

fq=!dismaxfoo bar
  or
fq=!dismax v=$userq

-Yonik


Re: prefix-search ingnores the lowerCaseFilter

2007-10-25 Thread Yonik Seeley
On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
 Is it possible that the prefix-processing ignores the filters?

Yes, It's a known limitation that we haven't worked out a fix for yet.
The issue is that you can't just run the prefix through the filters
because of things like stop words, stemming, minimum length filters,
etc.

-Yonik


Re: indexing one documents with different populated fields causes a deletion of documents in with other populated fileds

2007-10-25 Thread Yonik Seeley
On 10/25/07, Anton Valdstein [EMAIL PROTECTED] wrote:
 Does solr check automatically for duplicate texts in  other fields and
 delete documents  that have the same text stored  in  other fields?

Solr automatically overwrites (deletes old versions of) documents with
the same uniqueKey field (normally called id).

Both Lucene and Solr lack the ability to change (or add fields to)
existing documents.

-Yonik


Re: SOLR 1.3 Release?

2007-10-25 Thread Yonik Seeley
On 10/25/07, Matthew Runo [EMAIL PROTECTED] wrote:
 Any ideas on when 1.3 might be released? We're starting a new project
 and I'd love to use 1.3 for it - is SVN head stable enough for use?

I think it's stable in the sense of does the right thing and doesn't
crash, but IMO
isn't stable in the sense that new interfaces (internal and external)
added since 1.2 may still be changing.

Lots of new stuff going in (and has gone in), and I wouldn't expect to
see 1.3 super soon.
Just IMO of course.

-Yonik


Re: indexing one documents with different populated fields causes a deletion of documents in with other populated fileds

2007-10-25 Thread Yonik Seeley
On 10/25/07, Anton Valdstein [EMAIL PROTECTED] wrote:
 thanks, that explains a lot (:,
 I have another question: about how the idf is calculated:
 is the document frequency the sum of all documents containing the term in
 one of their fields or just in the field the query contained?

idfs are field (fieldname) specific.  So it's based on the count of
documents containing that word in that field.

Things are done on the basis of term in Lucene, and a term consists
of the fieldname and the word.

-Yonik


Re: CollectionDistribution - Changes reflected immediately on master, but only after tomcat restart on slave

2007-10-26 Thread Yonik Seeley
On 10/26/07, Karen Loughran [EMAIL PROTECTED] wrote:
 But after distribution of this latest snapshop to the slave the collection
 does not show the update (with solr admin query url or via java query client)
 UNLESS I restart tomcat ?

Sounds like a config issue with the scripts... pulling the snapshot is
obviously working, but snapinstaller (calling commit) is broken.

try running bin/commit -V by hand on the slave

-Yonik


Re: prefix-search ingnores the lowerCaseFilter

2007-10-29 Thread Yonik Seeley
On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote:
 On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote:
  On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote:
   Is it possible that the prefix-processing ignores the filters?
 
  Yes, It's a known limitation that we haven't worked out a fix for yet.
  The issue is that you can't just run the prefix through the filters
  because of things like stop words, stemming, minimum length filters,
  etc.

 What about not having only facet.prefix but additionally
 facet.filtered.prefix that runs the prefix through the filters?
 Would that be possible?

The underlying issue remains - it's not safe to treat the prefix like
any other word when running it through the filters.

-Yonik


Re: Phrase Query Performance Question

2007-10-30 Thread Yonik Seeley
On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote:
 Thanks a lot for replying Yonik!

 I am running solr on a windows 2003 server (standard version). intel Xeon CPU 
 3.00GHz, with 4.00 GB RAM.
 The index is locate on Raid5 with 2 million documents. Is there any way to 
 improve query performance without moving to more powerful computer?

 I understand that the query performances of phrase query (auto repair) has 
 to do with the number of documents containing the two words. In fact the 
 number of documents that have auto and repair are about 10. It is like 5% 
 of the documents containing auto and repair.  It seems to me 937 ms is too 
 slower.

Chen, that does seem slow I'm not sure why.
1) was this the first search on the index?  if so, try running some
other searches to warm things up first.
2) was the jvm in server mode?  (start with -server)
3) shut down unlrelated things on the system so that there is more
memory available to the OS to cache the index files

 Would it be faster if I run solr on linux system?

Maybe... Lucene does rely on the OS caching often used parts of the
index, so this can differ the most between Windows and Linux.  If you
have a Linux box lying around, trying it out quick to remove that
variable would be a good idea.

-Yonik


Re: FW: Score customization

2007-10-31 Thread Yonik Seeley
On 10/31/07, Victoria Kaganski [EMAIL PROTECTED] wrote:
 Does FunctionQuery actually override the default similarity function? If
 it does, how can I still access the similarity value?

FunctionQuery returns the *value* of a field (or a function of it) as
the value for a query - it does not use Similarity at all.

If you put a FunctionQuery in a BooleanQuery with other queries (like
normal relevance queries), the scores will be added together.

If you use a BoostedQuery, the FunctionQuery score will be multiplied
by the normal relevance score.

-Yonik


Re: fieldNorm seems to be killing my score

2007-11-01 Thread Yonik Seeley
Hmmm, a norm of 0.0???  That implies that the boost for that field
(text) was set to zero when it was indexed.
How did you index the data (straight HTTP, SolrJ, etc)?  What does
your schema for this field (and copyFields) look like?

-Yonik

On 11/1/07, Robert Young [EMAIL PROTECTED] wrote:
 Hi,

 I've been trying to debug why one of my test cases doesn't work. I
 have an index with two documents in, one talking mostly about apples
 and one talking mostly about oranges (for the sake of this test case)
 both of which have 'test_site' in their site field. If I run the query
 +(apple^4 orange) +(site:test_site) I would expect the document
 which talks about apples to always apear first but it does not.
 Looking at the debug output (below) it looks like fieldNorm is killing
 the first part of the query. Why is this and how can I stop it?

 ?xml version=1.0 encoding=UTF-8?
 response

 lst name=responseHeader
  int name=status0/int
  int name=QTime4/int
  lst name=params
   str name=rows10/str
   str name=start0/str

   str name=indenton/str
   str name=q+(apple^4 orange) +(site:test_site)/str
   str name=debugQueryon/str
   str name=version2.2/str
  /lst
 /lst
 result name=response numFound=2 start=0
  doc

   str name=guidtest_index-test_site-integration:124/str
   str name=indextest_index/str
   str name=link/oranges/str
   str name=sitetest_site/str
   str name=snippetorange orange orange/str
   str name=titleorange/str

  /doc
  doc
   str name=guidtest_index-test_site-integration:123/str
   str name=indextest_index/str
   str name=link/me/str
   str name=sitetest_site/str
   str name=snippetapple apple apple/str

   str name=titleapple/str
  /doc
 /result
 lst name=debug
  str name=rawquerystring+(apple^4 orange) +(site:test_site)/str
  str name=querystring+(apple^4 orange) +(site:test_site)/str
  str name=parsedquery+(text:appl^4.0 text:orang) +site:test_site/str
  str name=parsedquery_toString+(text:appl^4.0 text:orang)
 +site:test_site/str

  lst name=explain
   str name=id=test_index-test_site-integration:124,internal_docid=13
 0.14332592 = (MATCH) sum of:
   0.0 = (MATCH) product of:
 0.0 = (MATCH) sum of:
   0.0 = (MATCH) weight(text:orang in 13), product of:
 0.24034579 = queryWeight(text:orang), product of:
   1.9162908 = idf(docFreq=5)
   0.1254224 = queryNorm
 0.0 = (MATCH) fieldWeight(text:orang in 13), product of:
   2.236068 = tf(termFreq(text:orang)=5)
   1.9162908 = idf(docFreq=5)
   0.0 = fieldNorm(field=text, doc=13)
 0.5 = coord(1/2)
   0.14332592 = (MATCH) weight(site:test_site in 13), product of:
 0.13407566 = queryWeight(site:test_site), product of:
   1.0689929 = idf(docFreq=13)
   0.1254224 = queryNorm
 1.0689929 = (MATCH) fieldWeight(site:test_site in 13), product of:
   1.0 = tf(termFreq(site:test_site)=1)
   1.0689929 = idf(docFreq=13)
   1.0 = fieldNorm(field=site, doc=13)
 /str
   str name=id=test_index-test_site-integration:123,internal_docid=14
 0.14332592 = (MATCH) sum of:
   0.0 = (MATCH) product of:
 0.0 = (MATCH) sum of:
   0.0 = (MATCH) weight(text:appl^4.0 in 14), product of:
 0.96138316 = queryWeight(text:appl^4.0), product of:
   4.0 = boost
   1.9162908 = idf(docFreq=5)
   0.1254224 = queryNorm
 0.0 = (MATCH) fieldWeight(text:appl in 14), product of:
   2.236068 = tf(termFreq(text:appl)=5)
   1.9162908 = idf(docFreq=5)
   0.0 = fieldNorm(field=text, doc=14)
 0.5 = coord(1/2)
   0.14332592 = (MATCH) weight(site:test_site in 14), product of:
 0.13407566 = queryWeight(site:test_site), product of:
   1.0689929 = idf(docFreq=13)
   0.1254224 = queryNorm
 1.0689929 = (MATCH) fieldWeight(site:test_site in 14), product of:
   1.0 = tf(termFreq(site:test_site)=1)
   1.0689929 = idf(docFreq=13)
   1.0 = fieldNorm(field=site, doc=14)
 /str
  /lst
 /lst
 /response



Re: SOLR 1.3: defaultOperator always defaults to OR although AND is specifed.

2007-11-01 Thread Yonik Seeley
Try the latest... I just fixed this.
-Yonik

On 11/1/07, Britske [EMAIL PROTECTED] wrote:

 experimenting with SOLR 1.3 and discovered that although I specified
 solrQueryParser defaultOperator=AND/ in schema.xml

 q=a+b behaves as q=a OR B instead of q=a AND b

 Obviously this is not correct.
 I used the nightly of 29 oct.

 Cheers,
 Geert-Jan

 --
 View this message in context: 
 http://www.nabble.com/SOLR-1.3%3A-defaultOperator-always-defaults-to-OR-although-AND-is-specifed.-tf4731773.html#a13529997
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: overlapping onDeckSearchers message

2007-11-03 Thread Yonik Seeley
On 11/3/07, Brian Whitman [EMAIL PROTECTED] wrote:
 I have a solr index that hasn't had many problems recently but I had
 the logs open and noticed this a lot during indexing:

 [16:23:34.086] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

That means that one searcher hasn't yet finished warming in the
background, and a commit was just done and another searcher started
warming.

-Yonik


Re: FW: Score customization

2007-11-03 Thread Yonik Seeley
On 11/3/07, Victoria Kaganski [EMAIL PROTECTED] wrote:
 I guess I was not clear... I understand that if I use FunctionQuery, it's 
 result value will return as the score, instead of the similarity. Am I right?

Only for the FunctionQuery part... it's not an all or nothing thing.

Let me give you a specific example in Solr Query syntax:

+text:spider man~100 _val_:popularity

This query will result in the full-text relevance score (yes, using
similarity) of the first part added to the value of the popularity
field.  Try some examples out and let us know if you don't get what
you expect.

-Yonik




 

 From: [EMAIL PROTECTED] on behalf of Yonik Seeley
 Sent: Wed 10/31/2007 7:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: FW: Score customization



 On 10/31/07, Victoria Kaganski [EMAIL PROTECTED] wrote:
  Does FunctionQuery actually override the default similarity function? If
  it does, how can I still access the similarity value?

 FunctionQuery returns the *value* of a field (or a function of it) as
 the value for a query - it does not use Similarity at all.

 If you put a FunctionQuery in a BooleanQuery with other queries (like
 normal relevance queries), the scores will be added together.

 If you use a BoostedQuery, the FunctionQuery score will be multiplied
 by the normal relevance score.

 -Yonik






Re: customer request handler doesn't envok the query tokenization chain

2007-11-04 Thread Yonik Seeley
On 11/4/07, Yu-Hui Jin [EMAIL PROTECTED] wrote:
 Let's say we defined a customer filed type that when querying and indexing,
 the solr.LowerCaseFilterFactory is used as the last filter to low-case all
 letters.  In the Analysis UI, we found tokenization is working correctly.

 We also defined  a custom request handler which always creates a boolean
 query that ANDs all tokens for fielded queries (we overrided the
 getFieldQuery method only).

First, if all you are doing is ANDing all the tokens, you can just
change the default operator to AND (q.op=AND).

Analysis is done during query parsing by the query parser... if you
create your own queries, you need to do that analysis yourself.

-Yonik


Re: customer request handler doesn't envok the query tokenization chain

2007-11-05 Thread Yonik Seeley
On 11/5/07, Yu-Hui Jin [EMAIL PROTECTED] wrote:
 Just curious, does the default operator  ( AND or OR) specify the
 relationship between a field/value component or between the tokens of the
 same field/value componenet?

between any clauses in a boolean query.

 e.g. for a query like this:

 field1:abc  field2:xyz

 does the operator   connect field1:abc and field2:xyz , or it connects  the
 tokens from abc and xyz for their respective field?

These are two different query clauses (the fieldnames don't matter).
If the default operator is OR, then it will be interpreted as
field1:abc OR field2:xyz  (both optional)
if the default operator is set to AND then it will be
field1:abc AND field2:xyz  (both required)

-Yonik


Re: Phrase Query Performance Question and score threshold

2007-11-05 Thread Yonik Seeley
On 11/5/07, Haishan Chen [EMAIL PROTECTED] wrote:
 As for the first issues. The number of different phrase queries have 
 performance issues I found so far are about 10.

If these are normal phrase queries (no slop), a good solution might be
to simply index and query these phrases as a single token.  One could
do this with a SynonymFilter.

Oh, and no, a score threshold won't help performance.

 I believe there will be a lot more I just haven't tried.  It can be solve by 
 using faster hard
 ware though.  Also I believe it will help if SOLR has samilar distributed 
 search
 architecture like NUTCH so that it can scale out instead of scale up.

It's coming...

-Yonik


Re: specify index location

2007-11-05 Thread Yonik Seeley
On 11/5/07, evol__ [EMAIL PROTECTED] wrote:
 Just a remark:
   !-- Used to specify an alternate directory to hold all index data
other than the default ./data under the Solr home.
If replication is in use, this should match the replication
 configuration. --
 Might be a good idea to change this to ./data/index to reflect the location
 that is expected in there.

./data is the generic solr data directory index stores the main
index under the data directory.

-Yonik


Re: value boosts? (boosting a multiValued field's data)

2007-11-05 Thread Yonik Seeley
On 11/6/07, evol__ [EMAIL PROTECTED] wrote:
 Hi. Is the expansion method described in the following year old post still
 the best available way to do this?
 http://www.nabble.com/newbie-Q-regarding-schema-configuration-tf1814271.html#a4956602

 The way I understand it, indexing these
 field name=foo boost=1.0First val/field
 field name=foo boost=0.8Less important value/field
 would just make the boost 0.8 field-wide?

Yes... all boost values for multivalued fields are multiplied
together.  Nothing we can do about that... only one norm (boost *
lengthNorm) is stored per document per unique field.

-Yonik


Re: query syntax

2007-11-06 Thread Yonik Seeley
On 11/6/07, Traut [EMAIL PROTECTED] wrote:
  I have in index document with field name and its value is
 somename123
  Why I can't find anything with query
  name:somename123*

This is a prefix query.  No analysis is done on the prefix, so it may
not match analysis that was done when the document was indexed.

For example, if you use WordDelimiterFilter, this may be indexed as
somename 123

  but there are results on query
  name:somename123*

This is not a prefix query.  The * will most likely be removed by the
analyzer, leaving you effectively with a query of name:somename123

-Yonik


Re: Can you parse the contents of a field to populate other fields?

2007-11-07 Thread Yonik Seeley
On 11/6/07, Kristen Roth [EMAIL PROTECTED] wrote:
 Yonik - thanks so much for your help!  Just to clarify; where should the
 regex go for each field?

Each field should have a different FieldType (referenced by the type
XML attribute).  Each fieldType can have it's own analyzer.  You can
use a different PatternTokenizer (which specifies a regex) for each
analyzer.

-Yonik


Re: SOLR 1.2 - Duplicate Documents??

2007-11-08 Thread Yonik Seeley
On Nov 7, 2007 12:30 PM, realw5 [EMAIL PROTECTED] wrote:
 We did have Tomcat crash once (JVM OutOfMem) durning an indexing process,
 could that be a possible source of the issue?

Yes.
Deletes are buffered and carried out in a different phase.

-Yonik


Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread Yonik Seeley
On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote:
 So if I am hitting multiple fields (in the same search request) that invoke 
 different Analyzers -- am I at a dead end, and have to result to consequetive 
 multiple queries instead

Solr handles that for you automatically.

 The app that I am replacing (and trying to enhance) has the ability to search 
 multiple books at once
 with sen/par and case sensitivity settings individually selectable per book

You could easily select case sensitivity or not *per query* across all books.
You should step back and see what the requirements actually are (i.e.
the reasons why one needs to be able to select case
sensitive/insensitive on a book level... it doesn't make sense to me
at first blush).

It could be done on a per-book level in solr with a more complex query
structure though...

(+case:sensitive +(normal relevancy query on the case sensitive fields
goes here)) OR (+case:insensitive +(normal relevancy query on the case
insensitive fields goes here))

-Yonik


Re: solr range query

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 8:02 AM, Heba Farouk [EMAIL PROTECTED] wrote:
 I would like to use solr  to return ranges of searches on an integer
 field, if I wrote in the url  offset:[0 TO 10], it returns documents
 with offset values 0, 1, 10 only  but I want to return the range 0,1,2,
 3, 4 ,10. How can I do that with solr

Use fieldType=sint (sortable int... see the schema.xml), and reindex.

-Yonik


Re: no segments* file found

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 3:46 AM, SDIS M. Beauchamp [EMAIL PROTECTED] wrote:
 If I don't optimize, I 've got a too many files open at about 450K files
 and 3 Gb index

You may need to increase the number of filedescriptors in your system.
If you're using Linux, see this:
http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html
Check the system wide limit and the per-process limit.

-Yonik


Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 2:20 PM, David Neubert [EMAIL PROTECTED] wrote:
 Erik - thanks, I am considering this approach, verses explicit redundant 
 indexing -- and am also considering Lucene -

There's not a well defined solution in either IMO.

 - problem is, I am one week into both technologies (though have years in the 
 search space) -- wish I could
 go to Hong Kong -- any discounts available anywhere :)

Unfortunately the OS Summit has been canceled.

-Yonik


Re: Exception in SOLR when querying for fields of type string

2007-11-13 Thread Yonik Seeley
On Nov 13, 2007 6:23 PM, Kasi Sankaralingam [EMAIL PROTECTED] wrote:
 It is not tokenized, it is a string field, so will it still match
 photo for field 'title_s' and book for the default field?

Yes, because the query parser splits up things by whitespace before
analyzers are even applied.
Do you have a default field defined?

-Yonik


Re: how to load custom valuesource as plugin

2007-11-14 Thread Yonik Seeley
Unfortunately, the function query parser isn't currently pluggable.

-Yonik

On Nov 14, 2007 2:02 PM, Britske [EMAIL PROTECTED] wrote:

 I've created a simple valueSource which is supposed to calculate a weighted
 sum over a list of supplied valuesources.

 How can I let Solr recognise this valuesource?

 I tried to simply upload it as a plugin, and reference is by its name (wsum)
 in a functionquery, but got a Unknown function wsum in FunctionQuery.

 Can anybody tell me what I'm missing here?

 Thanks in advance,
 Geert-Jan


Re: score customization

2007-11-17 Thread Yonik Seeley
On Nov 15, 2007 11:06 AM, Jae Joo [EMAIL PROTECTED] wrote:
 I am looking for the way to get the score - only hundredth - ex.
 4.09something like that.
 Currently, it has 7 decimal digits. float name=score1.8032384/float

If you want to display scores only to the hundredths place, simply do
that in your client.
There's not a good reason to try and add this to solr... saving 5
bytes per document wouldn't be worth it.

-Yonik


Re: Payloads in Solr

2007-11-17 Thread Yonik Seeley
On Nov 17, 2007 2:18 PM, Tricia Williams [EMAIL PROTECTED] wrote:
 I was wondering how Solr people feel about the inclusion of Payload
 functionality in the Solr codebase?

All for it... depending on what one means by payload functionality of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).

 From a recent message to the [EMAIL PROTECTED] mailing list:
I'm working on the issue
  https://issues.apache.org/jira/browse/SOLR-380 which is a feature
  request that allows one to index a Structured Document which is
  anything that can be represented by XML in order to provide more
  context to hits in the result set.  This allows us to do things like
  query the index for Canada and be able to not only say that that
  query matched a document titled Some Nonsense but also that the
  query term appeared on page 7 of chapter 1.  We can then take this one
  step further and markup/highlight the image of this page based on our
  OCR and position hit.
  For example:
 
  book title='Some Nonsense'chapter title='One'page name='1'Some
  text from page one of a book./pagepage name='7'Some more text from
  page seven of a book. Oh and I'm from Canada./page/chapter/book
 
I accomplished this by creating a custom Tokenizer which strips the
  xml elements and stores them as a Payload at each of the Tokens
  created from the character data in the input.  The payload is the
  string that describes the XPath at that location.  So for Canada the
  payload is /book[title='Some
  Nonsense']/chapter[title='One']/page[name='7']

That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?

 Using Payloads requires me to include lucene-core-2.3-dev.jar  which
 might be a barrier.  Also, using my Tokenizer with Solr specific
 TokenFilter(s) looses the Payload at modified tokens.

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

-Yonik


<    1   2   3   4   5   6   7   8   9   10   >