Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Michael McCandless
On Sun, May 15, 2011 at 7:44 PM, Mark Miller markrmil...@gmail.com wrote:

 Could you please revert your commit, until we've reached some
 consensus on this discussion first?

 Let's reach some consensus, but why revert? This has been the behavior - 
 shouldn't the consensus onus be on changing it to begin with? That's how I 
 see it.

To be clear, I'm asking that Yonik revert his commit from yesterday
(rev 1103444), where he added text_nwd fieldType and dynamic fields
*_nwd to the example schema.xml.

I agree we should reach consensus before changing what's already
committed, that's exactly why I'm asking Yonik to revert -- we were in
the middle of discussing this, and I had posted a patch on SOLR-2519,
when he suddenly committed the text_nwd change, yesterday.

Does anyone disagree that Yonik's commit was inappropriate?  This is
not how we work at Apache.

 I'm going to need to get back up to speed on this issue before I can comment 
 more helpfully. Better out of the box support for other languages is 
 important - I think it makes sense to discuss this issue again myself.

+1

Solr, out of box, is just awful for non-whitespace languages (eg CJK,
and others).  And for every user who comes to the list asking for help
(thank you cyang2010!), I imagine there are many others who simply
gave up and walked away (from Solr) when they tried it on CJK
content.

Lucene has made awesome strides in having natural defaults that work
well across many languages, thanks to the hard work of Robert and
others (StandardAnalyzer now actually follows a standard (UAX #29 --
text segmentation), autophrase off in QP, etc.), and I think we should
take advantage of this in Solr, just like ElasticSearch does.

Really, the best solution (I think) would be to have language-specific
fieldTypes (text_en, text_zh, etc.), but I suspect there's a good
amount of work to reach that so in the meantime I think we should fix
the defaults for the text fieldType to work well across many
languages.

Mike

http://blog.mikemccandless.com


Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Mark Miller

On May 16, 2011, at 5:30 AM, Michael McCandless wrote:

 Does anyone disagree that Yonik's commit was inappropriate?  This is
 not how we work at Apache.

Ah - dunno yet - I obviously missed part of the conversation here. I thought 
you where talking about reversing 'autophrase off' as the default, not these 
'quick' new field types.

Excuse me for a moment while I read...

Yeah - seems a little hasty. Not a fan of 'text_nwd' as a field name either. 
Didn't seem malicious to me, but it does seem we should probably work together 
in JIRA/discussion before just shotgunning changes...

Don't know that I care if it's reverted (if we fall back another 10 steps into 
that BS I quit everything and I'm moving to South America), but we should push 
on here either way.

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org







Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Yonik Seeley
On Sun, May 15, 2011 at 1:48 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Could you please revert your commit, until we've reached some
 consensus on this discussion first?

Huh?
I thought everyone was in agreement that we needed more field types
for different languages?
I added my best guess about what a generic type for
non-whitespace-delimited might look like.
Since it's a new field type, it doesn't affect anything.  Hopefully it
only improves the situation
for someone trying to use one of these languages.

The only negative would seem to be if it's worse than nothing (i.e. a
very bad example
because it actually doesn't work for non-whitespace-delimited languages).

The issue about changing defaults on TextField and changing what text does in
the example schema by default is not dependent on this.  They are only related
by the fact that if another field is added/changed then _nwd may
become redundant
and can be removed.  For now, it only seems like an improvement?

Anyway... the whole language of revert seems unnecessarily confrontational.
Feel free to improve what's there (or delete *_nwd if people really
feel it adds no/negative value)

-Yonik


Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Yonik Seeley
On Mon, May 16, 2011 at 5:30 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 To be clear, I'm asking that Yonik revert his commit from yesterday
 (rev 1103444), where he added text_nwd fieldType and dynamic fields
 *_nwd to the example schema.xml.

So... your position is that until the text fieldType is changed to
support non-whitespace-delimited languages better, that
no other fieldType should be changed/added to better support
non-whitespace-delimited languages?
Man, that seems political, not technical.

Whatever... I'll revert.

-Yonik


Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Simon Willnauer
On Mon, May 16, 2011 at 3:51 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Mon, May 16, 2011 at 5:30 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 To be clear, I'm asking that Yonik revert his commit from yesterday
 (rev 1103444), where he added text_nwd fieldType and dynamic fields
 *_nwd to the example schema.xml.

 So... your position is that until the text fieldType is changed to
 support non-whitespace-delimited languages better, that
 no other fieldType should be changed/added to better support
 non-whitespace-delimited languages?
 Man, that seems political, not technical.

To me it seems neither nor. Its rather the process of improving
aligned with outstanding issues.
It shouldn't feel wrong.

Simon

 Whatever... I'll revert.

 -Yonik



Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Michael McCandless
On Mon, May 16, 2011 at 9:51 AM, Yonik Seeley
yo...@lucidimagination.com wrote:

 To be clear, I'm asking that Yonik revert his commit from yesterday
 (rev 1103444), where he added text_nwd fieldType and dynamic fields
 *_nwd to the example schema.xml.

 So... your position is that until the text fieldType is changed to
 support non-whitespace-delimited languages better, that
 no other fieldType should be changed/added to better support
 non-whitespace-delimited languages?

No, that's not my position at all.

My position is: please don't suddenly commit changes, with your way,
while we're still discussing how to solve the issue.  That's not the
Apache way.

This applies in general, not just this case (fixing Solr's
out-of-the-box behavior with non-whitespace languages).

So, it could very well be, after we iterate on SOLR-2519, that we all
agree your baby step is great, in which case let's go forward with
that.  But we should all come to some consensus about that before you
suddenly commit.

 Man, that seems political, not technical.

I'm sorry you feel that way, but it's important to me that we all
follow the Apache way here.  I feel this will only make our community
stronger.

It's also important that any time another committer is uncomfortable
with what just got committed, and asks for a revert, that it *not* be
a big deal.  It's not political, it was just a mistake and the revert
is quick and painless.

We are commit-then-review here, and if someone is uncomfortable, they
should say so and whoever committed should simply revert it and
re-iterate.  This should be a simple  free tool for all of us to
use.

 Whatever... I'll revert.

Thank you.

Mike


Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Yonik Seeley
On Mon, May 16, 2011 at 10:06 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Mon, May 16, 2011 at 9:51 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:

 To be clear, I'm asking that Yonik revert his commit from yesterday
 (rev 1103444), where he added text_nwd fieldType and dynamic fields
 *_nwd to the example schema.xml.

 So... your position is that until the text fieldType is changed to
 support non-whitespace-delimited languages better, that
 no other fieldType should be changed/added to better support
 non-whitespace-delimited languages?

 No, that's not my position at all.

 My position is: please don't suddenly commit changes, with your way,
 while we're still discussing how to solve the issue.  That's not the
 Apache way.

Dude... everyone has always agreed we need more fieldtypes to support
different languages (as you did earlier in this thread too).  There's been a
history of just adding stuff like that (half of the commits to the example
schema have no associated JIRA issue).

What happens to the default text field will have no bearing on that.
We will still need more field types to support more languages.
Would you be against me adding a text_cjk fieldtype too?

My position: it's silly for a lack of consensus on the text field to
block progesss on any other fieldtype.

-Yonik


Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Michael McCandless
On Mon, May 16, 2011 at 10:22 AM, Yonik Seeley
yo...@lucidimagination.com wrote:

 My position is: please don't suddenly commit changes, with your way,
 while we're still discussing how to solve the issue.  That's not the
 Apache way.

 Dude... everyone has always agreed we need more fieldtypes to support
 different languages (as you did earlier in this thread too).

+1, and I still agree that'd be best.  In that ideal future we would
have no more text fieldType, only text_zh, text_en, etc.

 There's been a
 history of just adding stuff like that (half of the commits to the example
 schema have no associated JIRA issue).

I wasn't objecting to the lack of a referenced JIRA issue; I was
objecting to you suddenly committing 'your way while we were still
discussing what to do.

 What happens to the default text field will have no bearing on that.

That's not really true?  I think any changes we make to any default
text* fieldTypes are strongly related.

For example, if we fix the text fieldType to have good all-around
defaults for all languages (ie, the patch on SOLR-2519) then we don't
need separate text_nwd/*_nwd field types.  Instead, maybe we could add
text_autophrase fieldTypes?  Or maybe text_en_autophrase?

 We will still need more field types to support more languages.

Right.

 Would you be against me adding a text_cjk fieldtype too?

text_cjk would be *awesome*, but text_zh, text_ja, text_ko would be
even better!

If we fix text fieldType to be generic for all languages (use
StandardAnalyzer, turn off autophrase), but then
go and add in specific languages over time (say text_en, text_cjk,
etc.), I think that's a great way to iterate towards the ideal future
where we have text_XX coverage for many languages.

 My position: it's silly for a lack of consensus on the text field to
 block progesss on any other fieldtype.

I disagree; I think changes to text fieldType are very much tied up
to what other text_* fieldTypes we want to introduce.

This is a *really* important configuration file in Solr and we should
present good defaults with it.  People who first use Solr start with
the schema.xml as their starting point.

People who first start with ElasticSearch today get StandardAnalyzer
and no autophrase as the default, which is the best overall default
Lucene has to offer right now.  I think Solr should do the same.

So to sum up, I think we should:

  1) Fix text fieldType to stop destroying non-whitespace languages,
 and use the best general defaults we have to offer today
 (switch from WhitespaceTokenizer - StandardTokenizer, and turn
 off autophrase); this is the patch on SOLR-2519.

  2) Add in text_XX specific language field types for as many as we
 can now, iterating over time to add more as we can / people get
 the itch.  We now have a fabulous analysis module (thank you
 Robert!), so we should take advantage of that and at least make
 text_XX for all the matching analyzers in there.

Let's continue this on the issue...

Mike

http://blog.mikemccandless.com


Re: why query chinese character with bracket become phrase query by default?

2011-05-16 Thread Chris Hostetter

: Does anyone disagree that Yonik's commit was inappropriate?  This is
: not how we work at Apache.

FWIW: I don't see how Yonik's commit was inappropriate at all

He added some new example configuration to trunk that was unused, and in 
no way un-did or blocked any other attempts at improving the configs.

It had no impact on any existing usage, and only served as an example 
(which could be iterated forward)

I seriously don't see the problem here.

-Hoss


Re: why query chinese character with bracket become phrase query by default?

2011-05-15 Thread Michael McCandless
On Fri, May 6, 2011 at 8:49 AM, Michael McCandless
luc...@mikemccandless.com wrote:

 Shouldn't we  have field types in the eg schema for the different
 languages?  Ie, text_zh, text_th, text_en, text_ja, text_nl, etc.

In fact, until we break out dedicated language field types, shouldn't
we default autophrase to off in Solr?

I think this is what ElasticSearch does (just inherits Lucene's
default for this) -- Shay, or any ElasticSearch users out there... can
you confirm?

Leaving autophrase on is catastrophic for non-whitespace languages
(CJK and others), and at best iffy for whitespace languages (ie,
unexpected that the QueryParser would make a PhraseQuery when user
hadn't asked for one, not clear it really helps relevance for
whitespace languages, definitely hurts performance), so leaving it is
doing far more damage than good, as far as I can tell.

Any objections to turning off autophrase by default in Solr, until we
have per-language field types?

Mike

http://blog.mikemccandless.com


Re: why query chinese character with bracket become phrase query by default?

2011-05-15 Thread Michael McCandless
I opened https://issues.apache.org/jira/browse/SOLR-2519 for this.

Mike

http://blog.mikemccandless.com

On Sun, May 15, 2011 at 8:02 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Fri, May 6, 2011 at 8:49 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Shouldn't we  have field types in the eg schema for the different
 languages?  Ie, text_zh, text_th, text_en, text_ja, text_nl, etc.

 In fact, until we break out dedicated language field types, shouldn't
 we default autophrase to off in Solr?

 I think this is what ElasticSearch does (just inherits Lucene's
 default for this) -- Shay, or any ElasticSearch users out there... can
 you confirm?

 Leaving autophrase on is catastrophic for non-whitespace languages
 (CJK and others), and at best iffy for whitespace languages (ie,
 unexpected that the QueryParser would make a PhraseQuery when user
 hadn't asked for one, not clear it really helps relevance for
 whitespace languages, definitely hurts performance), so leaving it is
 doing far more damage than good, as far as I can tell.

 Any objections to turning off autophrase by default in Solr, until we
 have per-language field types?

 Mike

 http://blog.mikemccandless.com



Re: why query chinese character with bracket become phrase query by default?

2011-05-15 Thread Yonik Seeley
On Sun, May 15, 2011 at 8:02 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Fri, May 6, 2011 at 8:49 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Shouldn't we  have field types in the eg schema for the different
 languages?  Ie, text_zh, text_th, text_en, text_ja, text_nl, etc.

 In fact, until we break out dedicated language field types, shouldn't
 we default autophrase to off in Solr?

I've taken a crack at a generic text field for
non-whitespace-delimited fields to the example schema:

   !-- A general unstemmed text field that is better for non
whitespace delimited languanges (nwd) due to
autoGeneratePhraseQueries=false --
fieldType name=text_nwd class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=false 
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

   dynamicField name=*_nwd type=text_nwd indexed=true  stored=true/

You can try it out on trunk with a query like:
http://localhost:8983/solr/select?q=name_nwd:F-11debugQuery=true

And verify it generates an OR:

str name=querystringname_nwd:F-11/str
str name=parsedqueryname_nwd:f name_nwd:11/str

Can someone verify that the WDF params are OK (i.e. I didn't catenate
since that wouldn't make sense if the word parts were actually whole
words in a non-whitespace-delimited language).  Does that make sense?


As far as Solr defaults... perhaps way way back text should have
been named text_en.
But any changes now should be comprehensive (we need to consider
impacts to the example
data, the example schema, the solr tuturial which relies on some of
the current behavior, and a ton of documentation
on the wiki related to  both analysis components (multi-word synonyms,
WDF, etc) and other quickstart guides.

Anyway, changes to the example schema (or the behavior of the example
schema) can have a large impact.
I personally think that adding a new field is much easier and less
disruptive, and given the potential impact
we should hear what others have to say about it too (I'm out the rest
of today, and I know a lot of other
people aren't around this weekend either).

-Yonik


Re: why query chinese character with bracket become phrase query by default?

2011-05-15 Thread Michael McCandless
Yonik,

Could you please revert your commit, until we've reached some
consensus on this discussion first?

Maybe, post alternative patches on the issue (SOLR-2519), and we can
iterate there?

Adding a new example field type (text_nwd) is one way to go, and I
agree is least risk/effort, a quick fix, but I don't think we should
use a quick fix here.

I think it's important for Solr to have good out-of-the-box defaults
for all languages, like ElasticSearch, even if that means we have to
do some extra work now (ie, fixing up the wiki/tutorials) to make that
change.

More below:

On Sun, May 15, 2011 at 12:20 PM, Yonik Seeley
yo...@lucidimagination.com wrote:

 As far as Solr defaults... perhaps way way back text should have
 been named text_en.
 But any changes now should be comprehensive (we need to consider
 impacts to the example
 data, the example schema, the solr tuturial which relies on some of
 the current behavior, and a ton of documentation
 on the wiki related to  both analysis components (multi-word synonyms,
 WDF, etc) and other quickstart guides.

 Anyway, changes to the example schema (or the behavior of the example
 schema) can have a large impact.

I agree: we need to fix the wiki pages/examples that rely on
auto-phrase.

But, really, how much work is this?  Can you point to an example or
two in the wiki/tutorial that advertise/rely on auto phrase?  This
would help me get a sense of how much additional work I'm signing up
for ;)

I just went through the tutorial and didn't see one...

(Also, we should add some CJK docs and queries to the tutorial... a
simple pair is the test case in my patch on SOLR-2519.)

We shouldn't avoid/fear good changes to our defaults just because
fixing it will be more work, especially if someone (me!) is signing up
to do that work

 I personally think that adding a new field is much easier and less
 disruptive, and given the potential impact

I agree the quick fix is somewhat easier than doing it right, but I
think in this case we should do it right.  Solr really should just
work well out-of-the-box on all (including non-whitespace) languages.

 we should hear what others have to say about it too

+1

Mike

http://blog.mikemccandless.com


Re: why query chinese character with bracket become phrase query by default?

2011-05-15 Thread Mark Miller

On May 15, 2011, at 1:48 PM, Michael McCandless wrote:

 Could you please revert your commit, until we've reached some
 consensus on this discussion first?


Let's reach some consensus, but why revert? This has been the behavior - 
shouldn't the consensus onus be on changing it to begin with? That's how I see 
it.

I'm going to need to get back up to speed on this issue before I can comment 
more helpfully. Better out of the box support for other languages is important 
- I think it makes sense to discuss this issue again myself.

- Mark Miller
lucidimagination.com

Lucene/Solr User Conference
May 25-26, San Francisco
www.lucenerevolution.org







Re: why query chinese character with bracket become phrase query by default?

2011-05-06 Thread Michael McCandless
On Thu, May 5, 2011 at 10:00 AM, Yonik Seeley
yo...@lucidimagination.com wrote:

 2011/5/5 Michael McCandless luc...@mikemccandless.com:
 The very first thing every non-whitespace language Solr app should do
 is turn  off autoGeneratePhraseQueries!

 Luckily, this is configurable per FieldType... so if it doesn't exist
 yet, we should come up with a good
 CJK fieldtype to add to the example schema.

+1

Shouldn't we  have field types in the eg schema for the different
languages?  Ie, text_zh, text_th, text_en, text_ja, text_nl, etc.

Mike

http://blog.mikemccandless.com


Re: why query chinese character with bracket become phrase query by default?

2011-05-05 Thread Michael McCandless
Unfortunately, the current out-of-the-box defaults (example config)
for Solr are a disaster for non-whitespace languages (CJK, Thai,
etc.), ie, exactly what you've hit.

This is because Lucene's QueryParser can unexpectedly, dangerously,
create PhraseQuery even when the user did not ask for it (auto
phrase).  Not only does this mean no results for non-whitespace
languages, but it also means worse search performance (PhraseQuery is
usually more costly than TermQuerys).

Lucene leaves this auto phrase behavior off by default, but Solr
defaults it to on.

Robert's email gives a good description of how you can turn it off.

The very first thing every non-whitespace language Solr app should do
is turn  off autoGeneratePhraseQueries!

Mike

http://blog.mikemccandless.com

On Wed, May 4, 2011 at 8:21 PM, cyang2010 ysxsu...@hotmail.com wrote:
 Hi,

 In solr admin query full interface page, the following query with english
 become term query according to debug :

 title_en_US: (blood red)

 lst name=debug
 str name=rawquerystringtitle_en_US: (blood red)/str
 str name=querystringtitle_en_US: (blood red)/str
 str name=parsedquerytitle_en_US:blood title_en_US:red/str
 str name=parsedquery_toStringtitle_en_US:blood title_en_US:red/str


 However, using the same syntax with two chinese terms, the query result into
 a phrase query:

 title_zh_CN: (我活)

 lst name=debug
 str name=rawquerystringtitle_zh_CN: (我活)/str
 str name=querystringtitle_zh_CN: (我活)/str
 str name=parsedqueryPhraseQuery(title_zh_CN:我 活)/str
 str name=parsedquery_toStringtitle_zh_CN:我 活/str


 I do have different tokenizer/filter for those two different fields.
 title_en_US is using all those common english specific tokenizer, while
 title_zh_CN uses solr.ChineseTokenizerFactory.

 I don't think those tokenizer determin whether things within bracket become
 term queries or phrase queries.

 I really need to blindly pass user-input text to a solr field without doing
 any parsing, and hope it is all doing term query for each term contained in
 the search text.

 How do i achieve that?

 Thanks,


 cy

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/why-query-chinese-character-with-bracket-become-phrase-query-by-default-tp2901542p2901542.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: why query chinese character with bracket become phrase query by default?

2011-05-05 Thread Yonik Seeley
2011/5/5 Michael McCandless luc...@mikemccandless.com:
 The very first thing every non-whitespace language Solr app should do
 is turn  off autoGeneratePhraseQueries!

Luckily, this is configurable per FieldType... so if it doesn't exist
yet, we should come up with a good
CJK fieldtype to add to the example schema.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


Re: why query chinese character with bracket become phrase query by default?

2011-05-05 Thread cyang2010
Nice, it works like a charm.

I am using solr 1.4.1.  Here is my configuration for the chinese field:

  fieldType name=text_ch class=solr.TextField
positionIncrementGap=100 
   analyzer type=index 
 tokenizer class=solr.ChineseTokenizerFactory/  
 
   /analyzer 
   analyzer type=query 
 tokenizer class=solr.ChineseTokenizerFactory/  
 filter class=solr.PositionFilterFactory/ 
   /analyzer 
  /fieldType 



Now when I get the expected hassle free parsing on solr side:

lst name=debug
str name=rawquerystringtitle_zh_CN:(我活)/str
str name=querystringtitle_zh_CN:(我活)/str
str name=parsedquerytitle_zh_CN:我 title_zh_CN:活/str
str name=parsedquery_toStringtitle_zh_CN:我 title_zh_CN:活/str



--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-query-chinese-character-with-bracket-become-phrase-query-by-default-tp2901542p2905784.html
Sent from the Solr - User mailing list archive at Nabble.com.


why query chinese character with bracket become phrase query by default?

2011-05-04 Thread cyang2010
Hi,

In solr admin query full interface page, the following query with english
become term query according to debug :

title_en_US: (blood red)

lst name=debug
str name=rawquerystringtitle_en_US: (blood red)/str
str name=querystringtitle_en_US: (blood red)/str
str name=parsedquerytitle_en_US:blood title_en_US:red/str
str name=parsedquery_toStringtitle_en_US:blood title_en_US:red/str


However, using the same syntax with two chinese terms, the query result into
a phrase query:

title_zh_CN: (我活)

lst name=debug
str name=rawquerystringtitle_zh_CN: (我活)/str
str name=querystringtitle_zh_CN: (我活)/str
str name=parsedqueryPhraseQuery(title_zh_CN:我 活)/str
str name=parsedquery_toStringtitle_zh_CN:我 活/str


I do have different tokenizer/filter for those two different fields.   
title_en_US is using all those common english specific tokenizer, while
title_zh_CN uses solr.ChineseTokenizerFactory.   

I don't think those tokenizer determin whether things within bracket become
term queries or phrase queries.

I really need to blindly pass user-input text to a solr field without doing
any parsing, and hope it is all doing term query for each term contained in
the search text.

How do i achieve that?

Thanks,


cy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-query-chinese-character-with-bracket-become-phrase-query-by-default-tp2901542p2901542.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: why query chinese character with bracket become phrase query by default?

2011-05-04 Thread Ahmet Arslan

Please see Robert's two solutions (autoGeneratePhraseQueries or PositionFilter) 
http://search-lucene.com/m/imED32mqqyp1/

--- On Thu, 5/5/11, cyang2010 ysxsu...@hotmail.com wrote:

 From: cyang2010 ysxsu...@hotmail.com
 Subject: why query chinese character with bracket become phrase query by 
 default?
 To: solr-user@lucene.apache.org
 Date: Thursday, May 5, 2011, 3:21 AM
 Hi,
 
 In solr admin query full interface page, the following
 query with english
 become term query according to debug :
 
 title_en_US: (blood red)
 
 lst name=debug
 str name=rawquerystringtitle_en_US: (blood
 red)/str
 str name=querystringtitle_en_US: (blood
 red)/str
 str name=parsedquerytitle_en_US:blood
 title_en_US:red/str
 str name=parsedquery_toStringtitle_en_US:blood
 title_en_US:red/str
 
 
 However, using the same syntax with two chinese terms, the
 query result into
 a phrase query:
 
 title_zh_CN: (我活)
 
 lst name=debug
 str name=rawquerystringtitle_zh_CN:
 (我活)/str
 str name=querystringtitle_zh_CN:
 (我活)/str
 str name=parsedqueryPhraseQuery(title_zh_CN:我
 活)/str
 str name=parsedquery_toStringtitle_zh_CN:我
 活/str
 
 
 I do have different tokenizer/filter for those two
 different fields.   
 title_en_US is using all those common english specific
 tokenizer, while
 title_zh_CN uses
 solr.ChineseTokenizerFactory.   
 
 I don't think those tokenizer determin whether things
 within bracket become
 term queries or phrase queries.
 
 I really need to blindly pass user-input text to a solr
 field without doing
 any parsing, and hope it is all doing term query for each
 term contained in
 the search text.
 
 How do i achieve that?
 
 Thanks,
 
 
 cy
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/why-query-chinese-character-with-bracket-become-phrase-query-by-default-tp2901542p2901542.html
 Sent from the Solr - User mailing list archive at
 Nabble.com.