RE: add CJKTokenizer to solr

2007-06-21 Thread Xuesong Luo
Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected 
highlight results when I tested with Germany. The field value I searched is 
"Ein Mann beißt den Hund".  The search criteria is beißt. 

When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the 
highlight result is: 
Ein Mann beißt den Hund 

When using CJKTokenizer, beißt is treated as 3 single terms, the result is:
Ein Mann beißt den Hund

When using standard tokenizer, beißt is treated as a word, the result is:
Ein Mann beißt den Hund


I understand why the standard tokenizer treat beißt as a word, but don't know 
how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit?


Thanks
Xuesong

-Original Message-
From: Toru Matsuzawa [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 18, 2007 10:29 PM
To: solr-user@lucene.apache.org
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it, 
it sends it again. 

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
> 
> 
> CJKTokenizerFactory that I am using is appended.
> 
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
 *
 */
public class CJKTokenizerFactory extends BaseTokenizerFactory {

  /**
   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
return new CJKTokenizer( input );
  }

}


-- 
Trou Matsuzawa





RE: Faceted Search!

2007-06-21 Thread Chris Hostetter

: generating XML feed file and feeding to the Solr server.  However, I was
: also looking into implementing having sub-categories within the
: categories if that make sense.  For example, in the shopper.com we have
: the categories of by price, manufactures and so on and with in them,they
: are sub categories (price is sub-cat into <$100, 100-200, 200-300 etc).
: I don't have constraint in terms of technology.  If I have to implement
: db server I won't mind implementing it.  Anyway, plz shine a light on
: how would you handle this issue.  Any suggestion will be appericated.

the shopper.com solution is very VERY specialized and specific to the
datamodel used to manage the category metadata  if i had to do it
overagain i would do it a lot differnetly.

way way back there was a thread about "complex faceting" where i included
some ideas on a possible facet configuration xml syntax which could
then be parsed by a request handler, with different types of faceting
(simple query, ranges, based on terms, prefix) delegated to helper
classes.  there was also the idea of being able groups facets or make
facets depend on other facets (ie: don't show the author facet untill a
value has been picked from the author_initial facet)

nothing ever really came of it, but it's how i'd probably approach trying
to tackle something like the shopper.com functionality if CNET threw away
our product metadata data model and started from scratch.

http://www.nabble.com/metadata-about-result-sets--t1243321.html#a3334244



-Hoss



Re: All facet.fields for a given facet.query?

2007-06-21 Thread Chris Hostetter

: > facet.mincount is a way to tell solr not to bother giving you those 0
: > counts ...
:
: An aside: shouldn't that be the default?  All of the people using
: facets that I have seen always have to set facet.mincount=1 (or
: facet.zeros=false)

Hmmm... maybe, but it's a really easy option to turn on, and i think if we
don't have facet.mincount default to 0 new users might get confused
when some constraints don't show up ... returning them with a 0 count
makes it clear Solr knows about them and tried them and found no
intersection with the current results.


-Hoss



Re: All facet.fields for a given facet.query?

2007-06-21 Thread Chris Hostetter

: I get your point, but how to know where additional metadata is of value
: if not
: just trying? Currently I start with a generic approach to see what

Man power.

for simple schemas the brute force facet on everything appraoch can scale
well .. but as soon as you start talking about having hundards of dynamic
fields where every product might be differnet you have to either
accept that you're going to be fighting an uphill performance battler
-- or start explicitly classifying those fields in some way that lets you
know which ones to use in which use cases (or at the very least: which
order to try them in in which use cases so you can do the most important
ones first and stop when you have some options to give the user.

you can even use the brute force "facet on everything" in Solr appraoch to
help you find those patterns for classifying your fields ... you might
even be able to completely automate it ... but i'm guessing you're going
to want to do it in batch on the backend and not in real time everytime a
user does a search.




-Hoss



Re: Facets & Links

2007-06-21 Thread Chris Hostetter

: solr.zappos.com/select/&fq=brand_exact:VALUE  ?

that will work (just remember to URL escape the brand name, and probably
put it in quotes too if you think it might contains whitespace.

: Should I not be sending the same facets the second time to SOLR? Do
: you remove the facet they've just clicked on?

depends on your use case ... if you've got a multivalue field you might
want to let people keep faceting on it more (brand probably isn't a good
example, but what if i pick "red" in a facet on color and there are
multitone shoes? .. i might want to see all the options for other colors
that i can have on my shoes.




-Hoss



Re: commit script with solr 1.2 response format

2007-06-21 Thread Chris Hostetter

: I guess we should look for 'status="0"><' ?

that wouldn't quite work.

: Or,  if you get a response code of 200, it's a success unless
: you see status=""

we could always make it an option in the scripts.conf file -- what
substring to match on ... just in case people want to write their own
crazy commit handler and still use the script ... but that may be
overkill.



-Hoss



Re: add CJKTokenizer to solr

2007-06-21 Thread Chris Hostetter

: Regarding reflection - even if reflection is slower, and I'm sure it is,
: I just don't know exactly how much slower it is, couldn't we cache the
: instantiated instances keyed off by name?  Such instances would have to
: be thread-safe, but I imagine most/all Tokenizers already are
: thread-safe.

most instances of Tokenizer and TokenFilter aren't threadsafe -- i'm not
sure how they could be given that the only real method they have is
"next()" ... everyone implementation i know of is constructed using a
Reader or TokenStream (depending on wether it's a Tokenizer or
TokenFilter) ... so reuse with new input is a bit hard

as i mentioned in one of the threads i linked to, the best we can probably
do is resolve the classname into a Class object in the init methods of
a ReflectionTOkenFilterFactory or ReflectionTokenizerFactory class, but a
new instance really needs to be constructed everytime the create() methods
are called.

like i said though: i'm in favore of factories like this ... i just don't
think we should do anything to hide their use and make refering to
Tokenizer or TOkenFilter class names directly use reflection magicly.


: http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
: http://www.nabble.com/foo-tf1737025.html#a4720545



-Hoss



Facet searching on single field with multiple words value.

2007-06-21 Thread ashwani kabra

Hi friends,

I tried to implement the facet searching in a sample code and when I tried
it with various case and found no result in one case.I wanted to narrow by
one field "title" and gave the multiple word or say phrase.

So First, in this preparing the lucene query and converting it into
QueryFilter.
Adding IndexReader to this filter query and converting it into BitSet.
and We are preparing TermQuery on the basis of lucene field and given
value(Getting the field and value dynamically and value can be single word
or multiple word). Adding this term query to query filter and again
converting it to BitSet.
so this has two different bitset, one is based on free text lucene query and
another is on the basis of field by which I need to narrow by. and doing the
following operation:-

 secondBitSet.and(FirstBitSet);
 int count=secondBitSet.cardinality();

Now, The problem occured when I passed the multiple words in term query.
e.g.
QueryFilter filter = new QueryFilter(new TermQuery(new Term(FieldName,
FieldValue)));

where field name  and field value dynamically getting.
here we take the example value.
FieldName:- "Title"
FieldValue:- "Software Development" ot it may be "Software AND Development".
In this case i'm not getting the result. i.e. counts= 0.

Code is given below.

IndexReader reader = searcher.getIndexReader();
BitSet firstBitSet =  firstQueryFilter.bits(reader);

QueryFilter filter = new QueryFilter(new TermQuery(new Term(FieldName,
FieldValue)));
BitSet secondBitSet = filter.bits(reader);

secondBitSet.and (FirstBitSet);
int count=secondBitSet.cardinality();


-- 
Regards
Ashwani Kabra
-- 
View this message in context: 
http://www.nabble.com/Facet-searching-on-single-field-with-multiple-words-value.-tf3962746.html#a11246099
Sent from the Solr - User mailing list archive at Nabble.com.



Re: commit script with solr 1.2 response format

2007-06-21 Thread James liu

aha,,same question i found few days ago.

i m sorry to forget submit it.

2007/6/22, Yonik Seeley <[EMAIL PROTECTED]>:


On 6/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> I just started running the scripts and
>
> The commit script seems to run fine, but it says there was an error.  I
> looked into it, and the scripts expect 1.1 style response:
>
>
>
> 1.2 /update returns:
>
>
>
>
>  0
>  44
>
>

I guess we should look for 'status="0"><' ?

Or,  if you get a response code of 200, it's a success unless
you see status=""

-Yonik





--
regards
jl


Re: add CJKTokenizer to solr

2007-06-21 Thread Otis Gospodnetic
Eh, I was looking at these Factories just the other day and wondering about the 
similar stuff as Daniel.
Regarding reflection - even if reflection is slower, and I'm sure it is, I just 
don't know exactly how much slower it is, couldn't we cache the instantiated 
instances keyed off by name?  Such instances would have to be thread-safe, but 
I imagine most/all Tokenizers already are thread-safe.

Daniel, I suggest you take that UbberTokenizerFactory code, slap ASL 2.0 on top 
of it, add simple instance caching as mentioned above, and post the code to 
JIRA.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, June 21, 2007 9:39:20 PM
Subject: Re: add CJKTokenizer to solr


: Why instead of that we don't create an UbberFactory that takes the Tokenizer
: class as a parameter and instantiates the proper Tokenizer?

The idea has come up before ... and there's really no reason why it
wouldn't be okay to include a reflection based facotry like this in Solr
-- it just hasn't been done yet.

One of the reasons is that there are some performance costs associated
with the reflection, so we wouldn't want to competley replace the existing
"configuration via factory name" model with a "configure via class name
and an uber factory does the reflection quetly in the background" model
because it's the kind of appraoch that would really only make sense for
simple prototypes -- in any system where you are really concerned about
performacne, reflection on every analyzer call would probably be pretty
expensive.  (allthough i'd love to see benchmarks prove me wrong)

Another question in my mind is "why doesn't solr provide an optional jar
with factories for every tokenizer/tokenfilter in the lucene contribs?"
... the only answer to that is that no one has bothered to crank out a
patch that does it.

http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
http://www.nabble.com/foo-tf1737025.html#a4720545


-Hoss






Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux


After further reading, especially 


(Thanks Hoss)


Depending on update patterns and index sizes, you can probably get
better efficiency with multiple indexes, but not really more
functionality (in your case), right?


Maybe I'm approaching your point of view : "Loose Schema with Dynamic 
Fields", this is probably my solution. There's something strange to me 
to consider a lucene index as a blob, but if it works for bigger than 
me, I should follow. So, it means one fieldtype by analyzer, and the 
datamodel logic is only from the collection side. I think I got my idea 
for september, but I would be very glad if you have something to add.


--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux

Thanks Yonik to share your reflexion,

This doesn't sound like true federated search, 


I'm affraid to not understand "federated search", you seems to have a 
precise idea behind the head.



since you have a number
of fields that are the same in each index that you search across, and
you treat them all the same.  This is functionally equivalent to
having a single schema and a single index.  You can still have
multiple applications that query the single collection differently.


Before a pointer or a web example from you, what you describe seems to 
me like implement a complete database with a single table (not easy to 
understand and maintain, but possible). To my experience, a collection 
is a schema, with thousands or millions XML documents, could be 10, 20 
or more fields, and search configuration is generated from a kind of 
data schema (there's no real standard for explaining for example, that a 
title or a subject need one field for exact match, and another for word 
search). If an index was too big (hopefully I never touch this limit 
with lucene), I guess there are solutions. My problem is to maintain 
different collections with each their intellectual logic, some shared 
FieldNames, like Dublin Core, or at least "fulltext", but also specific 
for each ones.



Depending on update patterns and index sizes, you can probably get
better efficiency with multiple indexes, but not really more
functionality (in your case), right?


Maybe "let it understandable" could be accepted as a functionality ? 
Perhaps less now, but it was a time when lucene index could become 
corrupted, so that separate them was important.


I guess that those specific problems will not be Solr priorities, but 
till I have been corrected, I'm still feeling that multiple indexes are 
useful.



--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: DismaxRequestHandler reports sort by score as invalid

2007-06-21 Thread Chris Hostetter

as mentioned, the warning is missleading in the case where you sort by
socre, i filed a bug as a reminder to fix it (and so people searching for
it will understand what's going on)

patches welcome! :)

http://issues.apache.org/jira/browse/SOLR-270

: WARNING: Invalid sort "score desc" was specified, ignoring


-Hoss



Re: add CJKTokenizer to solr

2007-06-21 Thread Chris Hostetter

: Why instead of that we don't create an UbberFactory that takes the Tokenizer
: class as a parameter and instantiates the proper Tokenizer?

The idea has come up before ... and there's really no reason why it
wouldn't be okay to include a reflection based facotry like this in Solr
-- it just hasn't been done yet.

One of the reasons is that there are some performance costs associated
with the reflection, so we wouldn't want to competley replace the existing
"configuration via factory name" model with a "configure via class name
and an uber factory does the reflection quetly in the background" model
because it's the kind of appraoch that would really only make sense for
simple prototypes -- in any system where you are really concerned about
performacne, reflection on every analyzer call would probably be pretty
expensive.  (allthough i'd love to see benchmarks prove me wrong)

Another question in my mind is "why doesn't solr provide an optional jar
with factories for every tokenizer/tokenfilter in the lucene contribs?"
... the only answer to that is that no one has bothered to crank out a
patch that does it.

http://www.nabble.com/Re%3A-making-schema.xml-nicer-to-read-use-p5939980.html
http://www.nabble.com/foo-tf1737025.html#a4720545


-Hoss



Facets & Links

2007-06-21 Thread Matthew Runo

Hello!

Let's say I have a query which is returning facets. Let's say they  
are various brand names, and map back to a brand_exact field in the  
index.


What is the proper format for the link that these facets should have?

something like..

solr.zappos.com/select/facets>&fq=brand_exact:VALUE  ?


Should I not be sending the same facets the second time to SOLR? Do  
you remove the facet they've just clicked on?


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++




Re: commit script with solr 1.2 response format

2007-06-21 Thread Yonik Seeley

On 6/21/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

I just started running the scripts and

The commit script seems to run fine, but it says there was an error.  I
looked into it, and the scripts expect 1.1 style response:

   

1.2 /update returns:

   
   
   
 0
 44
   
   


I guess we should look for 'status="0"><' ?

Or,  if you get a response code of 200, it's a success unless
you see status=""

-Yonik


commit script with solr 1.2 response format

2007-06-21 Thread Ryan McKinley

I just started running the scripts and

The commit script seems to run fine, but it says there was an error.  I 
looked into it, and the scripts expect 1.1 style response:


  

1.2 /update returns:

  
  
  
0
44
  
  


ryan


Re: Multiple doc types in schema

2007-06-21 Thread Yonik Seeley

On 6/21/07, Frédéric Glorieux <[EMAIL PROTECTED]> wrote:

>> I will also need multiple indexes searches,
>
> Do you mean:

> 2) Multiple indexes with different schemas, search will search across
> all or some subset and combine the results (federated search)

Exactly that. I'm comming from a quite old lucene based project, called SDX
.
Sorry for the link, the project is mainly documented in french. The
framework is cocoon base, maybe heavy now. It allows to host multiple
"applications", with multiple "bases", a base is a kind of Solr Schema,
in 2000.

 From this experience, I can say cross search between different schemas
is possible, and users may find it important. Take for example a
library. They have different collections, lets say : csv records
obtained from digitized photos, a light model, no write waited ; and a
complex librarian model documented every day. These collections share at
least a title and author field, and should be opened behind the same
form for public ; but each one should have also its own application,
according to its information model.


This doesn't sound like true federated search, since you have a number
of fields that are the same in each index that you search across, and
you treat them all the same.  This is functionally equivalent to
having a single schema and a single index.  You can still have
multiple applications that query the single collection differently.

Depending on update patterns and index sizes, you can probably get
better efficiency with multiple indexes, but not really more
functionality (in your case), right?

-Yonik


Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux


Hi Sonic,


I will also need multiple indexes searches,


Do you mean:



2) Multiple indexes with different schemas, search will search across
all or some subset and combine the results (federated search)


Exactly that. I'm comming from a quite old lucene based project, called SDX
. 
Sorry for the link, the project is mainly documented in french. The 
framework is cocoon base, maybe heavy now. It allows to host multiple 
"applications", with multiple "bases", a base is a kind of Solr Schema, 
in 2000.


From this experience, I can say cross search between different schemas 
is possible, and users may find it important. Take for example a 
library. They have different collections, lets say : csv records 
obtained from digitized photos, a light model, no write waited ; and a 
complex librarian model documented every day. These collections share at 
least a title and author field, and should be opened behind the same 
form for public ; but each one should have also its own application, 
according to its information model.


With the "SDX" framework upper, I know real life applications with 30 
lucene indexes. It's possible, because lucene allow it (MultiReader) 
.



--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


> 1) Multiple unrelated indexes with different schemas, that you will
> search separately... but you just want them in the same JVM for some
> reason.
>


3) Multiple indexes with the same schema, each index is a "shard" that
contains part of the total collection.  Search will merge results
across all shards to give appearance of a single large collection
(distributed search).

-Yonik





Re: DismaxRequestHandler reports sort by score as invalid

2007-06-21 Thread Yonik Seeley

A little background:
I originally conceived of query operation chains (based on some of my
previous hacking in mechanical investing stock screens: select all
stocks; take top 10% lowest PE; then take the top 20 highest growth
rate; then sort descending by 13 week relative strength).

So, I thought that the next thing after a query *might* be a sort, so
getSort() shouldn't throw an exception if it wasn't.  I think this
idea is now outdated (we know when we have a sort spec) and an
exception should just be thrown on a syntax error.

-Yonik

On 6/21/07, J.J. Larrea <[EMAIL PROTECTED]> wrote:

Because "score desc" is the default Lucene & Solr behavior when no explicit sort is specified, 
QueryParsing.parseSort() returns a null sort so that the non-sort versions of the query execution routines get 
called.  However the caller SolrPluginUtils.parseSort issues that warning whenever it gets a null sort.  Perhaps 
that interaction should be altered, or perhaps it should be left in as a sort of "are you sure you want to 
tell me what I already know?", er, warning.  But as it stands you can simply ignore it, or else leave the 
sort off entirely when it is "score desc"; if the behavior were different in those two cases it would 
certainly be a bug, but as you noted that's not the case.

- J.J.

At 10:50 AM -0400 6/21/07, gerard sychay wrote:
>Hello all,
>
>This is a minor issue and does not affect Solr operation, but I could not find 
it in the issue tracking.
>
>To reproduce:
>
>- I set up a Solr server with the example docs indexed by following the Solr 
tutorial.
>
>- I clicked on the following example search under the "Sorting" section:
>
>http://localhost:8983/solr/select/?indent=on&q=video&sort=score+desc
>
>- I added a "qt" parameter to try out the DisMax Request Handler:
>
>http://localhost:8983/solr/select/?indent=on&q=video&sort=score+desc&qt=dismax
>
>- In the Solr output, I get:
>
>WARNING: Invalid sort "score desc" was specified, ignoring Jun 21, 2007 
10:33:37 AM org.apache.solr.core.SolrCore execute
>INFO: /select/ sort=score+desc&indent=on&qt=dismax&q=video 0 131
>
>The WARNING line is the issue. It does not seem that it should be there. But 
as I said, it does not appear to affect operation as the results are sorted by 
score descending anyway (because that is the default?).




Re: DismaxRequestHandler reports sort by score as invalid

2007-06-21 Thread J.J. Larrea
Because "score desc" is the default Lucene & Solr behavior when no explicit 
sort is specified, QueryParsing.parseSort() returns a null sort so that the 
non-sort versions of the query execution routines get called.  However the 
caller SolrPluginUtils.parseSort issues that warning whenever it gets a null 
sort.  Perhaps that interaction should be altered, or perhaps it should be left 
in as a sort of "are you sure you want to tell me what I already know?", er, 
warning.  But as it stands you can simply ignore it, or else leave the sort off 
entirely when it is "score desc"; if the behavior were different in those two 
cases it would certainly be a bug, but as you noted that's not the case.

- J.J.

At 10:50 AM -0400 6/21/07, gerard sychay wrote:
>Hello all,
>
>This is a minor issue and does not affect Solr operation, but I could not find 
>it in the issue tracking.
>
>To reproduce:
>
>- I set up a Solr server with the example docs indexed by following the Solr 
>tutorial.
>
>- I clicked on the following example search under the "Sorting" section:
>
>http://localhost:8983/solr/select/?indent=on&q=video&sort=score+desc
>
>- I added a "qt" parameter to try out the DisMax Request Handler:
>
>http://localhost:8983/solr/select/?indent=on&q=video&sort=score+desc&qt=dismax
>
>- In the Solr output, I get:
>
>WARNING: Invalid sort "score desc" was specified, ignoring Jun 21, 2007 
>10:33:37 AM org.apache.solr.core.SolrCore execute
>INFO: /select/ sort=score+desc&indent=on&qt=dismax&q=video 0 131
>
>The WARNING line is the issue. It does not seem that it should be there. But 
>as I said, it does not appear to affect operation as the results are sorted by 
>score descending anyway (because that is the default?).



Re: Recent updates to Solrsharp

2007-06-21 Thread Jeff Rodenburg

great, thanks Yonik.

On 6/20/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 6/21/07, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
> As an aside, it would be nice to record these issues more granularly in
> JIRA.  Could we get a component created for our client library, similar
to
> java/php/ruby?

Done.

-Yonik



Re: Multiple doc types in schema

2007-06-21 Thread Yonik Seeley

On 6/21/07, Frédéric Glorieux <[EMAIL PROTECTED]> wrote:

I will also need multiple indexes searches,


Do you mean:

1) Multiple unrelated indexes with different schemas, that you will
search separately... but you just want them in the same JVM for some
reason.

2) Multiple indexes with different schemas, search will search across
all or some subset and combine the results (federated search)

3) Multiple indexes with the same schema, each index is a "shard" that
contains part of the total collection.  Search will merge results
across all shards to give appearance of a single large collection
(distributed search).

-Yonik


DismaxRequestHandler reports sort by score as invalid

2007-06-21 Thread gerard sychay

Hello all,

This is a minor issue and does not affect Solr operation, 
but I could not find it in the issue tracking.


To reproduce:

- I set up a Solr server with the example docs indexed by 
following the Solr tutorial.


- I clicked on the following example search under the 
"Sorting" section:


http://localhost:8983/solr/select/?indent=on&q=video&sort=score+desc

- I added a "qt" parameter to try out the DisMax Request 
Handler:


http://localhost:8983/solr/select/?indent=on&q=video&sort=score+desc&qt=dismax

- In the Solr output, I get:

WARNING: Invalid sort "score desc" was specified, ignoring 
Jun 21, 2007 10:33:37 AM org.apache.solr.core.SolrCore execute

INFO: /select/ sort=score+desc&indent=on&qt=dismax&q=video 0 131

The WARNING line is the issue. It does not seem that it 
should be there. But as I said, it does not appear to affect 
operation as the results are sorted by score descending 
anyway (because that is the default?).


Re: Multiple doc types in schema

2007-06-21 Thread Frédéric Glorieux


Otis,

Thanks for the link and the work !
Maybe around september, I will need this patch, if it's not already 
commit to the Solr sources.


I will also need multiple indexes searches, but understand that there is 
no simple, fast and genereric solution in solr context. Maybe I should 
lose solr caching, but it seems not an impossible work to design its own 
custom request handler to query different indexes, like lucene allow it.



SOLR-215 support multiple indices on a single Solr instance.  It does *not* 
support searching of multiple indices at once (e.g. parallel search) and 
merging of results.





--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: Multiple doc types in schema

2007-06-21 Thread Walter Underwood
I used Solr with indexes on NFS and I do not recommend it.
It was either 100 or 1000 times slower than local disc
for indexing, I forget which. Unusable.

This is not a problem with Solr/Lucene, I have seen the
same NFS performance cost with other search engines.

wunder

On 6/21/07 3:22 AM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> SOLR-215 support multiple indices on a single Solr instance.  It does *not*
> support searching of multiple indices at once (e.g. parallel search) and
> merging of results.
> 
> This has nothing to do with NFS, though.
> 
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> 
> - Original Message 
> From: James liu <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, June 21, 2007 3:45:06 AM
> Subject: Re: Multiple doc types in schema
> 
> I see SOLR-215 from this mail.
> 
> Does it now really support multi index and search it will return merged
> data?
> 
> for example:
> 
> i wanna search: aaa, and i have index1, index2, index3, index4it should
> return the result from index1,index2,index3, index4 and merge result by
> score, datetime, or other thing.
> 
> Does it support NFS and how its performance?
> 
> 
> 
> 2007/6/21, Otis Gospodnetic <[EMAIL PROTECTED]>:
>> 
>> This sounds like a potentially good use-case for SOLR-215!
>> See https://issues.apache.org/jira/browse/SOLR-215
>> 
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>> 
>> - Original Message 
>> From: Chris Hostetter <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org; Jack L <[EMAIL PROTECTED]>
>> Sent: Wednesday, June 6, 2007 6:58:10 AM
>> Subject: Re: Multiple doc types in schema
>> 
>> 
>> : This is based on my understanding that solr/lucene does not
>> : have the concept of document type. It only sees fields.
>> :
>> : Is my understanding correct?
>> 
>> it is.
>> 
>> : It seems a bit unclean to mix fields of all document types
>> : in the same schema though. Or, is there a way to allow multiple
>> : document types in the schema, and specify what type to use
>> : when indexing and searching?
>> 
>> it's really just an issue of semantics ... the schema.xml is where you
>> list all of the fields you need in your index, any notion of doctype is
>> entire artificial ... you could group all of the
>> fields relating to doctypeA in one section of the schema.xml, then have a
>> big  line and then list the fields in doctypeB, etc... but
>> wat if there are fields you use in both "doctypes" ? .. how much you "mix"
>> them is entirely up to you.
>> 
>> 
>> 
>> -Hoss
>> 
>> 
>> 
>> 
>> 
> 



Multi-language Tokenizers / Filters recommended?

2007-06-21 Thread Daniel Alheiros
Hi

I'm now considering how to improve query results on a set of languages and
would like to hear considerations based on your experience in that.

I'm using the tokenizer HTMLStringWhitespaceTokenizerFactory with the
WordDelimiterFilterFactory, LowerCaseFilterFactory and
RemoveDuplicatesTokenFilterFactory as my default config.

I need to deal with:
English (OK)
Spanish
Welsh
Chinese Simplified
Russian
Arabic

For Spanish and Russian I'm using the SnowballPorterFilterFactory plus the
defaults. Should I use any specific TokenizerFactory? Which one?

For Chinese I'm going to use a TokenizerFactory that returns the
CJKTokenizer (as I read a previous discussion about it) plus the default
filters. Is it OK of the filters are inadequate?

For Welsh I'm using the defaults and would like to know if you have any
recommendation related to that.

For Arabic should I use the AraMorph Analyzer (
http://www.nongnu.org/aramorph/english/lucene.html)? What other processing
should I do to have better query results.

Does anyone have stop-words and synonyms for other languages but English?

I think this discussion can became a documentation topic with examples,
how-to's and stop-words / synonyms for each language, so it would be much
simpler for those who need to deal with non-English content. What do you
think about that?

Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: add CJKTokenizer to solr

2007-06-21 Thread Daniel Alheiros
Hi

Well, creating a Factory for each new Tokenizer we want to add means you are
replicating the same code again and again just to bind the Factory (Solr
interface) to the Tokenizer (Lucene interface).

Why instead of that we don't create an UbberFactory that takes the Tokenizer
class as a parameter and instantiates the proper Tokenizer?

It could be done simply, but it would impact the schema.xml and its parsers
and config classes associated. But I think it would make things simpler.

What do you think about it?

A code example follows:
public class UbberTokenizerFactory extends BaseTokenizerFactory {/**
* @see org.apache.solr.analysis.TokenizerFactory#create(Reader) */
public TokenStream create(Reader input){String
tokenizerClassName = ""; // get the tokenizer class name from the config
try{return
(TokenStream)(Class.forName(tokenizerClassName).getConstructor(new
Class[]{Reader.class}).newInstance(input));}catch (Exception
e){throw new IllegalArgumentException("It wasn't
possible to instantiate the Factory. Verify if the tokenizer class name \""
+ tokenizerClassName + "\" is correct and is available in the classpath.",
e);}//return new CJKTokenizer(input);} }

Regards,
Daniel Alheiros

On 19/6/07 18:57, "Mike Klaas" <[EMAIL PROTECTED]> wrote:

> 
> On 18-Jun-07, at 10:28 PM, Toru Matsuzawa wrote:
> 
>> I'm sorry. Because it was not possible to append it,
>> it sends it again.
>> 
 I got the error below after adding CJKTokenizer to schema.xml.  I
 checked the constructor of CJKTokenizer, it requires a Reader
 parameter,
 I guess that's why I get this error, I searched the email
 archive, it
 seems working for other users. Does anyone know what is the problem?
>>> 
>>> 
>>> CJKTokenizerFactory that I am using is appended.
> 
> Would you be interested in contributing this class to solr?
> 
> -MIke


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: All facet.fields for a given facet.query?

2007-06-21 Thread Thomas Traeger



: Faceting on manufacturers and categories first and than present the
: corresponding facets might be used under some circumstances, but in my case
: the category structure is quite deep, detailed and complex. So when
: the user enters a query I like to say to him "Look, here are the
: manufacturers and categories with matches to your query, choose one if you
: want, but maybe there is another one with products that better fit your
: needs or products that you didn't even know about. So maybe you like to
: filter based on the following attributes." Something like this ;o)

categories was just an example i used because it tends to be a common use
case ... my point is the decision about which facet qualifies for the
"maybe there is another one with products that better fit your needs" part
of the response either requires computing counts for *every* facet
constraint and then looking at them to see which ones provide good
distribution, or by knowing something more about your metadata (ie: having
stats that show the majority of people who search on the word "canon" want
to facet on "megapixels") .. this is where custom biz logic comes in,
becuase in a lot of situations computing counts for every possible facet
may not be practical (even if the syntax to request it was easier)
I get your point, but how to know where additional metadata is of value 
if not
just trying? Currently I start with a generic approach to see what 
really is

in the product data, to get an overview of the quality of the data and
what happens if I use the data in the new search solution. Then I can 
decide

what is to do to optimize the system, i.e. try to reduce the count of
attributes, get the marketing to split somewhat generic attributes into 
more
detailed ones, find a way to display the most relevant facets for the 
current

query first and so on...

Tom


Re: problems getting data into solr index

2007-06-21 Thread vanderkerkoff

Hi Mike, Brian

Thanks for helping with this, and for clearing up my misunderstanding.  Solr
the python module and Solr the package being two different things, I've got
you.

The issues I have are compounded by the fact that we're hovering between
using the Unicode branch of Django and the older branch that has newforms,
both of which have an impact on what I'm trying to do.

It's getting closer to being resolved, and it's down to your advice, so
thanks again.






-- 
View this message in context: 
http://www.nabble.com/problems-getting-data-into-solr-index-tf3915542.html#a11230922
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multiple doc types in schema

2007-06-21 Thread Otis Gospodnetic
SOLR-215 support multiple indices on a single Solr instance.  It does *not* 
support searching of multiple indices at once (e.g. parallel search) and 
merging of results.

This has nothing to do with NFS, though.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: James liu <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, June 21, 2007 3:45:06 AM
Subject: Re: Multiple doc types in schema

I see SOLR-215 from this mail.

Does it now really support multi index and search it will return merged
data?

for example:

i wanna search: aaa, and i have index1, index2, index3, index4it should
return the result from index1,index2,index3, index4 and merge result by
score, datetime, or other thing.

Does it support NFS and how its performance?



2007/6/21, Otis Gospodnetic <[EMAIL PROTECTED]>:
>
> This sounds like a potentially good use-case for SOLR-215!
> See https://issues.apache.org/jira/browse/SOLR-215
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> - Original Message 
> From: Chris Hostetter <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org; Jack L <[EMAIL PROTECTED]>
> Sent: Wednesday, June 6, 2007 6:58:10 AM
> Subject: Re: Multiple doc types in schema
>
>
> : This is based on my understanding that solr/lucene does not
> : have the concept of document type. It only sees fields.
> :
> : Is my understanding correct?
>
> it is.
>
> : It seems a bit unclean to mix fields of all document types
> : in the same schema though. Or, is there a way to allow multiple
> : document types in the schema, and specify what type to use
> : when indexing and searching?
>
> it's really just an issue of semantics ... the schema.xml is where you
> list all of the fields you need in your index, any notion of doctype is
> entire artificial ... you could group all of the
> fields relating to doctypeA in one section of the schema.xml, then have a
> big  line and then list the fields in doctypeB, etc... but
> wat if there are fields you use in both "doctypes" ? .. how much you "mix"
> them is entirely up to you.
>
>
>
> -Hoss
>
>
>
>
>


-- 
regards
jl





Re: Multi-language indexing and searching

2007-06-21 Thread Daniel Alheiros
Hi Hoss.

I've tried that yesterday using the same approach you just said (I've
created the base fields for any language with basic analyzers) and it worked
alright.

Thanks again for you time.

Regards,
Daniel


On 20/6/07 21:00, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : So far it sounds good for my needs, now I'm going to try if my other
> : features still work (I'm worried about highlighting as I'm going to return a
> : different field)...
> 
> i'm not really a highlighting guy so i'm not sure ... but if you're okay
> with *simple* highlighting you can probably just highlight your title
> field (using a whitespace analyzer or something) and get decent results
> without needing to worry about the fact that you are using differnet
> langauges.
> 
> 
> 
> -Hoss
> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: delete changed?

2007-06-21 Thread James liu

aha,,sorry,i miss it.

2007/6/21, Chris Hostetter <[EMAIL PROTECTED]>:


:  curl http://192.168.7.6:8080/solr0/update --data-binary
: 'nodeid:20'
:
: i remember it is ok when i use solr 1.1
...
: HTTP Status 400 - missing content stream


please note the "Upgrading from Solr 1.1" section of the 1.2 CHANGES.txt
file, which states...

The Solr "Request Handler" framework has been updated in two key ways:
First, if a Request Handler is registered in solrconfig.xml with a name
starting with "/" then it can be accessed using path-based URL, instead of
using the legacy "/select?qt=name" URL structure.  Second, the Request
Handler framework has been extended making it possible to write Request
Handlers that process streams of data for doing updates, and there is a
new-style Request Handler for XML updates given the name of "/update" in
the example solrconfig.xml.  Existing installations without this "/update"
handler will continue to use the old update servlet and should see no
changes in behavior.  For new-style update handlers, errors are now
reflected in the HTTP status code, Content-type checking is more strict,
and the response format has changed and is controllable via the wt
parameter.



-Hoss





--
regards
jl


Re: delete changed?

2007-06-21 Thread Chris Hostetter
:  curl http://192.168.7.6:8080/solr0/update --data-binary
: 'nodeid:20'
:
: i remember it is ok when i use solr 1.1
...
: HTTP Status 400 - missing content stream


please note the "Upgrading from Solr 1.1" section of the 1.2 CHANGES.txt
file, which states...

The Solr "Request Handler" framework has been updated in two key ways:
First, if a Request Handler is registered in solrconfig.xml with a name
starting with "/" then it can be accessed using path-based URL, instead of
using the legacy "/select?qt=name" URL structure.  Second, the Request
Handler framework has been extended making it possible to write Request
Handlers that process streams of data for doing updates, and there is a
new-style Request Handler for XML updates given the name of "/update" in
the example solrconfig.xml.  Existing installations without this "/update"
handler will continue to use the old update servlet and should see no
changes in behavior.  For new-style update handlers, errors are now
reflected in the HTTP status code, Content-type checking is more strict,
and the response format has changed and is controllable via the wt
parameter.



-Hoss