Re: analysis tool vs. reality

2010-08-16 Thread Chris Hostetter

: Maybe, separate from analysis.jsp (showing only how text is analyzed),
: Solr needs a debug page showing the steps the field's QueryParser goes
: through on a given query, to debug such tricky QueryParser/Analyzer
: interactions?

As mentioned earlier in this thread, i set out to build something exactly 
like this a while back, but as part of the DebugComponent instead of a 
standalone page. I ran into a lot of problems i couldn't figure out any 
way arround, so i posted my thoughts in Jira for future refrence in case 
other folks wanted to follow up with alternate suggestions on how to work 
arround them (or mitigate the maintence headaches involved)...

https://issues.apache.org/jira/browse/SOLR-1749



-Hoss



Re: analysis tool vs. reality

2010-08-16 Thread Chris Hostetter

:  even if you change the Lucene QUeryParser so that whitespace isn't a meta
:  character it doens't affect the underlying issue: analysis.jsp is agnostic
:  about QueryParsers.

: analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
: your default queryparser is actually a de-facto whitespace tokenizer, don't
: try to sugarcoat it.

If it makes you feel better to use the word ignorant instead of agnostic 
fine -- but i'm not suger coating anything.  analysis.jsp's query 
analyzer output is ignorant of all the QueryParsers that might be used at 
query time in the same way that it's index analyzer output is ignorant 
of the UpdateProcessors that might be used at index time -- in both 
cases it only focuses on analysis, and tells you that give input X, the 
analyzer produces output Y.

if you want to change the Lucene QueryParser then go fight that battle in 
another thread -- i'm trying to have a meaningful conversation about how 
we can better educate users about the distinction between Query Parsing 
and Analysis, and about how we can make it more clear what analysis.jsp is 
doing.

Even if you convince folks to make every change you think should be made 
to the Lucene QueryParser (again: please take that up in a seperate 
thread) it won't change the fact that people using analysis.jsp should 
understand the distinction between Query Parsing and Analysis -- unless 
you plan on getting rid of every metacharacter that the Lucene QueryParser 
uses to decide what types of Query to build (ie: '', '-', '', '*', '?') 
and unless you plan on forcing Solr users to only ever use that one 
QueryParser, then no matter what the Lucene QueryParser does with 
whitespace, users still need to understand the distinction between Query 
Parsing and Analysis so they don't type 'Foo*' into analysis.jsp and then 
ask why it says that will match food but it doesn't actually match at 
query time. (suprise suprise: Query Parsing is not the same as analysis, 
and when the QueryParser sees wildcards it doesn't use the analyzer)


-Hoss



Re: analysis tool vs. reality

2010-08-16 Thread Robert Muir
On Mon, Aug 16, 2010 at 4:20 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 Even if you convince folks to make every change you think should be made
 to the Lucene QueryParser (again: please take that up in a seperate
 thread) it won't change the fact that people using analysis.jsp should
 understand the distinction between Query Parsing and Analysis -- unless
 you plan on getting rid of every metacharacter that the Lucene QueryParser
 uses to decide what types of Query to build (ie: '', '-', '', '*', '?')
 and unless you plan on forcing Solr users to only ever use that one
 QueryParser, then no matter what the Lucene QueryParser does with
 whitespace, users still need to understand the distinction between Query
 Parsing and Analysis so they don't type 'Foo*' into analysis.jsp and then
 ask why it says that will match food but it doesn't actually match at
 query time. (suprise suprise: Query Parsing is not the same as analysis,
 and when the QueryParser sees wildcards it doesn't use the analyzer)


Maybe for once your argument isn't completely bogus: the surprise is
actually key here. Theres really nothing documenting the various
hacks/limitations in the queryparsers: such as auto-tokenizing on
whitespace.

I think the 'expanded terms' not being analyzed is similar, its not really
documented well. Thats probably why it comes up on the mailing list it seems
at least every week [at this point you have to admit, there is a problem].

If you want to say the analysis tool is agnostic about queryparsers, thats
fine, you can keep saying that. I'm saying it shouldn't be.


-- 
Robert Muir
rcm...@gmail.com


RE: analysis tool vs. reality

2010-08-16 Thread Steven A Rowe
Hi Robert,

You wrote in response to Hoss:
 Maybe for once your argument isn't completely bogus

Attacking people here is really uncalled for.

-1 from me.

Steve



Re: analysis tool vs. reality

2010-08-16 Thread Robert Muir
On Mon, Aug 16, 2010 at 5:23 PM, Steven A Rowe sar...@syr.edu wrote:

 Hi Robert,

 You wrote in response to Hoss:
  Maybe for once your argument isn't completely bogus

 Attacking people here is really uncalled for.


actually, he asked for it:

 you're right, we should just fix the bug that the queryparser tokenizes on
 whitespace first. then analysis.jsp will be significantly less confusing.

 dude .. not trying to get into a holy war here


 -1 from me.


well, that might be your opinion, but it doesn't change the facts.

-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-13 Thread Michael McCandless
Maybe, separate from analysis.jsp (showing only how text is analyzed),
Solr needs a debug page showing the steps the field's QueryParser goes
through on a given query, to debug such tricky QueryParser/Analyzer
interactions?

We could make a wrapper around the analyzer that records each text
fragment sent to it by the QueryParser, as a start.  It'd be great to
also see it spelled out how that then resulted in a particular part of
the query.  So for query ABC12 FOO you'd see that ABC12 was sent to
analyzer, it returned two tokens (ABC, 12), and then QueryParser made
a PhraseQuery from that, and then FOO was sent, and that turned into
TermQuery, and default op was AND and so a toplevel BooleanQuery with
2 MUST terms was created...

Mike

On Thu, Aug 12, 2010 at 8:39 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:


 :  You say it's bogus because the qp will divide on whitesapce first --
 but
 :  you're assuming you know what query parser will be used ... the field
 :  query parser (to name one) doesn't split on whitespace first.  That's
 my
 :  point: analysis.jsp doesn't make any assumptions about what query
 parser
 :  *might* be used, it just tells you what your analyzers do with strings.
 : 
 :
 : you're right, we should just fix the bug that the queryparser tokenizes
 on
 : whitespace first. then analysis.jsp will be significantly less confusing.

 dude .. not trying to get into a holy war here

 actually I'm suggesting the practical solution: that we fix the primary
 problem that makes it confusing.


 even if you change the Lucene QUeryParser so that whitespace isn't a meta
 character it doens't affect the underlying issue: analysis.jsp is agnostic
 about QueryParsers.


 analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
 your default queryparser is actually a de-facto whitespace tokenizer, don't
 try to sugarcoat it.

 --
 Robert Muir
 rcm...@gmail.com



RE: analysis tool vs. reality

2010-08-13 Thread Burton-West, Tom
+1
I just had occasion to debug something where the interaction between the 
queryparser and the analyzer produced *interesting* results.  Having a separate 
jsp that includes the whole chain (i.e. analyzer/tokenizer/filter and qp) would 
be great!

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Friday, August 13, 2010 5:19 AM
To: solr-user@lucene.apache.org
Subject: Re: analysis tool vs. reality

Maybe, separate from analysis.jsp (showing only how text is analyzed),
Solr needs a debug page showing the steps the field's QueryParser goes
through on a given query, to debug such tricky QueryParser/Analyzer
interactions?

We could make a wrapper around the analyzer that records each text
fragment sent to it by the QueryParser, as a start.  It'd be great to
also see it spelled out how that then resulted in a particular part of
the query.  So for query ABC12 FOO you'd see that ABC12 was sent to
analyzer, it returned two tokens (ABC, 12), and then QueryParser made
a PhraseQuery from that, and then FOO was sent, and that turned into
TermQuery, and default op was AND and so a toplevel BooleanQuery with
2 MUST terms was created...

Mike

On Thu, Aug 12, 2010 at 8:39 PM, Robert Muir rcm...@gmail.com wrote:
 On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter
 hossman_luc...@fucit.orgwrote:


 :  You say it's bogus because the qp will divide on whitesapce first --
 but
 :  you're assuming you know what query parser will be used ... the field
 :  query parser (to name one) doesn't split on whitespace first.  That's
 my
 :  point: analysis.jsp doesn't make any assumptions about what query
 parser
 :  *might* be used, it just tells you what your analyzers do with strings.
 : 
 :
 : you're right, we should just fix the bug that the queryparser tokenizes
 on
 : whitespace first. then analysis.jsp will be significantly less confusing.

 dude .. not trying to get into a holy war here

 actually I'm suggesting the practical solution: that we fix the primary
 problem that makes it confusing.


 even if you change the Lucene QUeryParser so that whitespace isn't a meta
 character it doens't affect the underlying issue: analysis.jsp is agnostic
 about QueryParsers.


 analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
 your default queryparser is actually a de-facto whitespace tokenizer, don't
 try to sugarcoat it.

 --
 Robert Muir
 rcm...@gmail.com



Re: analysis tool vs. reality

2010-08-12 Thread Chris Hostetter

: Furthermore, I would like to add its not just the highlight matches
: functionality that is horribly broken here, but the output of the analysis
: itself is misleading.
: 
: lets say i take 'textTight' from the example, and add the following synonym:
: 
: this is broken = broke
: 
: the query time analysis is wrong, as it clearly shows synonymfilter
: collapsing this is broken to broke, but in reality with the qp for that
: field, you are gonna get 3 separate tokenstreams and this will never
: actually happen (because the qp will divide it up on whitespace first)
: 
: So really the output from 'Query Analyzer' is completely bogus.

analysis.jsp is only intended to explain *analysis* ... it accurately 
tells you what the analyzer type=query ... for the specified field (or 
fieldType) is going to produce given a hunk of text.

That is what it does, that is all that it does, that is all it has ever 
done, and all it has ever purported to do.

You say it's bogus because the qp will divide on whitesapce first -- but 
you're assuming you know what query parser will be used ... the field 
query parser (to name one) doesn't split on whitespace first.  That's my 
point: analysis.jsp doesn't make any assumptions about what query parser 
*might* be used, it just tells you what your analyzers do with strings.

Saying the output of analisys.jsp is bogus because it doesn't take into 
account QueryParsing is like saying the output of stats.jsp is bogus 
because those are only the stats of the local solr instance on that 
machine, and it doesn't do distributed stats -- yeah that would be nice to 
have, but the stats.jsp never implies that's what it's giving you.

If there are ways we can make the purpose of analysis.jsp more obvious, 
and less missleading for people who don't udnerstand the distinction 
between query parsing and analysis then i am all for it.  if you really 
believe getting rid of the highlite check box is going to help, then 
fine -- but i have yet to see any evidence that people who don't 
understand the relationship between query parsing and analysis are 
confused by the blue boxes.

what people seem to be confused by is when they see the same tokens 
ultimately produced by both the index analyzer and the query analyzer 
-- it doesn't matter if those tokens are in blue or not, if they see that 
the tokens in the index analyzer output are a super set of the tokens in 
the query analyzer output then they tend to assume that means searching 
for the string in the query box will match documents containing hte 
string in the index text box.

Getting rid of the blue table cell is just going to make it harder to 
notice matching tokens in the output -- not reduce the confusion when 
those matching tokens exist in the output.

My question is: What can we do to make it more clear what the *purpose* of 
analysis.jsp is?  is there verbage we can add to the page to make it more 
obvious?

NOTE: I'm not just asking Robert, this is a question for the solr-user 
community as a whole.  I *know* what analysis.jsp is for, i've never been 
confused -- for people who have been confused in hte past (or are still 
confused) please help us understand what type of changes we could make to 
the output of analysis.jsp to make it's functionality more understandable.



-Hoss



Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 7:55 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 You say it's bogus because the qp will divide on whitesapce first -- but
 you're assuming you know what query parser will be used ... the field
 query parser (to name one) doesn't split on whitespace first.  That's my
 point: analysis.jsp doesn't make any assumptions about what query parser
 *might* be used, it just tells you what your analyzers do with strings.


you're right, we should just fix the bug that the queryparser tokenizes on
whitespace first. then analysis.jsp will be significantly less confusing.


-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-12 Thread Chris Hostetter

:  You say it's bogus because the qp will divide on whitesapce first -- but
:  you're assuming you know what query parser will be used ... the field
:  query parser (to name one) doesn't split on whitespace first.  That's my
:  point: analysis.jsp doesn't make any assumptions about what query parser
:  *might* be used, it just tells you what your analyzers do with strings.
: 
: 
: you're right, we should just fix the bug that the queryparser tokenizes on
: whitespace first. then analysis.jsp will be significantly less confusing.

dude .. not trying to get into a holy war here

even if you change the Lucene QUeryParser so that whitespace isn't a meta 
character it doens't affect the underlying issue: analysis.jsp is agnostic 
about QueryParsers.  Some other QParser the users uses might have other 
special behavior and if people don't understand hte distinction between 
QueryParsing and analysis they can still be confused -- hell even if the 
only QParser anyone ever uses is the lucene QParser, and even if you get 
the QUeryParser changed so that whitespace isn't a metacharacter, you we 
are still going to be left with the fact that *other* charaters (like '+' 
and '-' and '' and '*' and ...) are metacharacters for that query parser, 
and have special meaning.

analysis.jsp isn't going to know about those, or do anything special for 
them -- so people cna still be easily confused when analysis.jsp says 
one thing about how the string +foo* -bar get's analyzed, but that 
string as a query means something completley different.

Hence my point: leave arguments about QueryParser out of it -- how do we 
make the function of analysis.jsp more clear?


-Hoss



Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 :  You say it's bogus because the qp will divide on whitesapce first --
 but
 :  you're assuming you know what query parser will be used ... the field
 :  query parser (to name one) doesn't split on whitespace first.  That's
 my
 :  point: analysis.jsp doesn't make any assumptions about what query
 parser
 :  *might* be used, it just tells you what your analyzers do with strings.
 : 
 :
 : you're right, we should just fix the bug that the queryparser tokenizes
 on
 : whitespace first. then analysis.jsp will be significantly less confusing.

 dude .. not trying to get into a holy war here

 actually I'm suggesting the practical solution: that we fix the primary
problem that makes it confusing.


 even if you change the Lucene QUeryParser so that whitespace isn't a meta
 character it doens't affect the underlying issue: analysis.jsp is agnostic
 about QueryParsers.


analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
your default queryparser is actually a de-facto whitespace tokenizer, don't
try to sugarcoat it.

-- 
Robert Muir
rcm...@gmail.com


analysis tool vs. reality

2010-08-04 Thread Justin Lolofie
Erik: Yes, I did re-index if that means adding the document again.
Here are the exact steps I took:

1. analysis.jsp ABC12 does NOT match title ABC12 (however, ABC or 12 does)
2. changed schema.xml WordDelimeterFilterFactory catenate-all
3. restarted tomcat
4. deleted the document with title ABC12
5. added the document with title ABC12
6. query ABC12 does NOT result in the document with title ABC12
7. analysis.jsp ABC12 DOES match that document now

Is there any way to see, given an ID, how something is indexed internally?

Lance: I understand the index/query sections of analysis.jsp. However,
it operates on text that you enter into the form, not on actual index
data. Since all my documents have a unique ID, I'd like to supply an
ID and a query, and get back the same index/query sections- using
whats actually in the index.


-- Forwarded message --
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 22:43:17 -0400
Subject: Re: analysis tool vs. reality
Did you reindex after changing the schema?


On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:

Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is ABC12. There is a document who's title field is
ABC12. However, I can only get it to match if I search for ABC or
12. This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
tool ABC12 matches ABC12. However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

 Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

  Hello,

  I have found the analysis tool in the admin page to be very useful in
  understanding my schema. I've made changes to my schema so that a
  particular case I'm looking at matches properly. I restarted solr,
  deleted the document from the index, and added it again. But still,
  when I do a query, the document does not get returned in the results.

  Does anyone have any tips for debugging this sort of issue? What is
  different between what I see in analysis tool and new documents added
  to the index?

  Thanks,
  Justin


Re: analysis tool vs. reality

2010-08-04 Thread Robert Muir
I think I agree with Justin here, I think the way analysis tool highlights
'matches' is extremely misleading, especially considering it completely
ignores queryparsing.

it would be better if it put your text in a memoryindex and actually parsed
the query w/ queryparser, ran it, and used the highlighter to try to show
any matches.

On Wed, Aug 4, 2010 at 10:14 AM, Justin Lolofie jta...@gmail.com wrote:

 Erik: Yes, I did re-index if that means adding the document again.
 Here are the exact steps I took:

 1. analysis.jsp ABC12 does NOT match title ABC12 (however, ABC or 12
 does)
 2. changed schema.xml WordDelimeterFilterFactory catenate-all
 3. restarted tomcat
 4. deleted the document with title ABC12
 5. added the document with title ABC12
 6. query ABC12 does NOT result in the document with title ABC12
 7. analysis.jsp ABC12 DOES match that document now

 Is there any way to see, given an ID, how something is indexed internally?

 Lance: I understand the index/query sections of analysis.jsp. However,
 it operates on text that you enter into the form, not on actual index
 data. Since all my documents have a unique ID, I'd like to supply an
 ID and a query, and get back the same index/query sections- using
 whats actually in the index.


 -- Forwarded message --
 From: Erik Hatcher erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Date: Tue, 3 Aug 2010 22:43:17 -0400
 Subject: Re: analysis tool vs. reality
 Did you reindex after changing the schema?


 On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:

Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is ABC12. There is a document who's title field is
ABC12. However, I can only get it to match if I search for ABC or
12. This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
tool ABC12 matches ABC12. However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

 Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

  Hello,

  I have found the analysis tool in the admin page to be very useful in
  understanding my schema. I've made changes to my schema so that a
  particular case I'm looking at matches properly. I restarted solr,
  deleted the document from the index, and added it again. But still,
  when I do a query, the document does not get returned in the results.

  Does anyone have any tips for debugging this sort of issue? What is
  different between what I see in analysis tool and new documents added
  to the index?

  Thanks,
   Justin




-- 
Robert Muir
rcm...@gmail.com


analysis tool vs. reality

2010-08-04 Thread Justin Lolofie
Wow, I got to work this morning and my query results now include the
'ABC12' document. I'm not sure what that means. Either I made a
mistake in the process I described in the last email (I dont think
this is the case) or there is some kind of caching of query results
going on that doesnt get flushed on a restart of tomcat.




Erik: Yes, I did re-index if that means adding the document again.
Here are the exact steps I took:

1. analysis.jsp ABC12 does NOT match title ABC12 (however, ABC or 12 does)
2. changed schema.xml WordDelimeterFilterFactory catenate-all
3. restarted tomcat
4. deleted the document with title ABC12
5. added the document with title ABC12
6. query ABC12 does NOT result in the document with title ABC12
7. analysis.jsp ABC12 DOES match that document now

Is there any way to see, given an ID, how something is indexed internally?

Lance: I understand the index/query sections of analysis.jsp. However,
it operates on text that you enter into the form, not on actual index
data. Since all my documents have a unique ID, I'd like to supply an
ID and a query, and get back the same index/query sections- using
whats actually in the index.


-- Forwarded message --
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 22:43:17 -0400
Subject: Re: analysis tool vs. reality
Did you reindex after changing the schema?


On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:

Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is ABC12. There is a document who's title field is
ABC12. However, I can only get it to match if I search for ABC or
12. This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
tool ABC12 matches ABC12. However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

 Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

  Hello,

  I have found the analysis tool in the admin page to be very useful in
  understanding my schema. I've made changes to my schema so that a
  particular case I'm looking at matches properly. I restarted solr,
  deleted the document from the index, and added it again. But still,
  when I do a query, the document does not get returned in the results.

  Does anyone have any tips for debugging this sort of issue? What is
  different between what I see in analysis tool and new documents added
  to the index?

  Thanks,
  Justin


Re: analysis tool vs. reality

2010-08-04 Thread Shalin Shekhar Mangar
On Wed, Aug 4, 2010 at 7:52 PM, Robert Muir rcm...@gmail.com wrote:

 I think I agree with Justin here, I think the way analysis tool highlights
 'matches' is extremely misleading, especially considering it completely
 ignores queryparsing.

 it would be better if it put your text in a memoryindex and actually parsed
 the query w/ queryparser, ran it, and used the highlighter to try to show
 any matches.


+1

-- 
Regards,
Shalin Shekhar Mangar.


Re: analysis tool vs. reality

2010-08-04 Thread Chris Hostetter

: I think I agree with Justin here, I think the way analysis tool highlights
: 'matches' is extremely misleading, especially considering it completely
: ignores queryparsing.

it really only attempts to identify when there is overlap between 
analaysis at query time and at indexing time so you can easily spot when 
one analyzer or the other breaks things so that they no longer line up 
(or when it fiexes things so they start to line up)

Even if we eliminated that highlighting as missleading, people would still 
do it in thier minds, it would just be harder -- it doesn't change the 
underlying fact that analysis is only part of the picture.

: it would be better if it put your text in a memoryindex and actually parsed
: the query w/ queryparser, ran it, and used the highlighter to try to show
: any matches.

Thta level of query explanation really only works if the user gives us a 
full document (all fields, not just one) and a full query string, and all 
of the possible query params -- because the query parser (either implicit 
because of config, or explicitly specified by the user) might change it's 
behavior based on those other params.

I agree with you: debugging functionality along hte lines of what you are 
describing would be *VASTLY* more useful then what we've got right now, 
and is something i breifly looked into doing before as an extension of the 
existing DebugComponent...

   https://issues.apache.org/jira/browse/SOLR-1749

...the problems i encountered trying to do it as a debug component on 
a real Solr request seem like they would also be problems for a 
MemoryIndex based admin tool approach like what you suggest -- but if 
you've got ideas on working arround them i am 100% interested.

Independent of how we might create a better QueryPasrser + Analyssis 
Explanation tool / debug component is hte question of what we can do to 
make it more clear what exactly the analysis.jsp page is doing and what 
people can infer from that page.  As i said, i don't think removing the 
match highlighting will actaully reduce confusion, but perhaps there is 
verbage/disclaimers that could be added to make it more clear?



-Hoss



Re: analysis tool vs. reality

2010-08-04 Thread Robert Muir
Furthermore, I would like to add its not just the highlight matches
functionality that is horribly broken here, but the output of the analysis
itself is misleading.

lets say i take 'textTight' from the example, and add the following synonym:

this is broken = broke

the query time analysis is wrong, as it clearly shows synonymfilter
collapsing this is broken to broke, but in reality with the qp for that
field, you are gonna get 3 separate tokenstreams and this will never
actually happen (because the qp will divide it up on whitespace first)

So really the output from 'Query Analyzer' is completely bogus.

On Wed, Aug 4, 2010 at 1:57 PM, Robert Muir rcm...@gmail.com wrote:



 On Wed, Aug 4, 2010 at 1:45 PM, Chris Hostetter 
 hossman_luc...@fucit.orgwrote:


 it really only attempts to identify when there is overlap between
 analaysis at query time and at indexing time so you can easily spot when
 one analyzer or the other breaks things so that they no longer line up
 (or when it fiexes things so they start to line up)


 It attempts badly, because it only works in the most trivial of cases
 (e.g. doesnt reflect the interaction of queryparser with multiword synonyms
 or worddelimiterfilter).

 Since Solr includes these non-trivial analysis components *in the example*
 it means that this 'highlight matches' doesnt actually even really work at
 all.

 Someone is gonna use this thing when they dont understand why analysis isnt
 doing what they want, i.e. the cases like I outlined above.

 For the trivial cases where it does work the 'highlight matches' isnt
 useful anyway, so in its current state its completely unnecessary.


 Even if we eliminated that highlighting as missleading, people would still
 do it in thier minds, it would just be harder -- it doesn't change the
 underlying fact that analysis is only part of the picture.


 I'm not suggesting that. I'm suggesting fixing the highlighting so its not
 misleading. There are really only two choices:
 1. remove the current highlighting
 2. fix it.

 in its current state its completely useless and misleading, except for very
 trivial cases, in which you dont need it anyway.



 : it would be better if it put your text in a memoryindex and actually
 parsed
 : the query w/ queryparser, ran it, and used the highlighter to try to
 show
 : any matches.

 Thta level of query explanation really only works if the user gives us a
 full document (all fields, not just one) and a full query string, and all
 of the possible query params -- because the query parser (either implicit
 because of config, or explicitly specified by the user) might change it's
 behavior based on those other params.


 thats true, but I dont see why the user couldnt be allowed to provide just
 this.
 I'd bet money a lot of people are using this thing with a specific
 query/document in mind anyway!


 people can infer from that page.  As i said, i don't think removing the
 match highlighting will actaully reduce confusion, but perhaps there is
 verbage/disclaimers that could be added to make it more clear?


  As i said before, I think i disagree with you. I think for stuff like this
 the technicals are less important, whats important is this is a misleading
 checkbox that really confuses users.

 I suggest disabling it entirely, you are only going to remove confusion.


 --
 Robert Muir
 rcm...@gmail.com




-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-04 Thread Lance Norskog
there is some kind of caching of query results
going on that doesnt get flushed on a restart of tomcat.

Yes. Solr by default has http caching on if there is no configuration,
and the example solrconfig.xml has it configured on. You should edit
solrconfig.xml to use the alternative described in the comments.

On Wed, Aug 4, 2010 at 7:55 AM, Justin Lolofie jta...@gmail.com wrote:
 Wow, I got to work this morning and my query results now include the
 'ABC12' document. I'm not sure what that means. Either I made a
 mistake in the process I described in the last email (I dont think
 this is the case) or there is some kind of caching of query results
 going on that doesnt get flushed on a restart of tomcat.




 Erik: Yes, I did re-index if that means adding the document again.
 Here are the exact steps I took:

 1. analysis.jsp ABC12 does NOT match title ABC12 (however, ABC or 12 does)
 2. changed schema.xml WordDelimeterFilterFactory catenate-all
 3. restarted tomcat
 4. deleted the document with title ABC12
 5. added the document with title ABC12
 6. query ABC12 does NOT result in the document with title ABC12
 7. analysis.jsp ABC12 DOES match that document now

 Is there any way to see, given an ID, how something is indexed internally?

 Lance: I understand the index/query sections of analysis.jsp. However,
 it operates on text that you enter into the form, not on actual index
 data. Since all my documents have a unique ID, I'd like to supply an
 ID and a query, and get back the same index/query sections- using
 whats actually in the index.


 -- Forwarded message --
 From: Erik Hatcher erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Date: Tue, 3 Aug 2010 22:43:17 -0400
 Subject: Re: analysis tool vs. reality
 Did you reindex after changing the schema?


 On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:

    Hi Erik, thank you for replying. So, turning on debugQuery shows
    information about how the query is processed- is there a way to see
    how things are stored internally in the index?

    My query is ABC12. There is a document who's title field is
    ABC12. However, I can only get it to match if I search for ABC or
    12. This was also true in the analysis tool up until recently.
    However, I changed schema.xml and turned on catenate-all in
    WordDelimterFilterFactory for title fieldtype. Now, in the analysis
    tool ABC12 matches ABC12. However, when doing an actual query, it
    does not match.

    Thank you for any help,
    Justin


    -- Forwarded message --
    From: Erik Hatcher erik.hatc...@gmail.com
    To: solr-user@lucene.apache.org
    Date: Tue, 3 Aug 2010 16:50:06 -0400
    Subject: Re: analysis tool vs. reality
    The analysis tool is merely that, but during querying there is also a
    query parser involved.  Adding debugQuery=true to your request will
    give you the parsed query in the response offering insight into what
    might be going on.   Could be lots of things, like not querying the
    fields you think you are to a misunderstanding about some text not
    being analyzed (like wildcard clauses).

         Erik

    On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

      Hello,

      I have found the analysis tool in the admin page to be very useful in
      understanding my schema. I've made changes to my schema so that a
      particular case I'm looking at matches properly. I restarted solr,
      deleted the document from the index, and added it again. But still,
      when I do a query, the document does not get returned in the results.

      Does anyone have any tips for debugging this sort of issue? What is
      different between what I see in analysis tool and new documents added
      to the index?

      Thanks,
      Justin




-- 
Lance Norskog
goks...@gmail.com


analysis tool vs. reality

2010-08-03 Thread Justin Lolofie
Hello,

I have found the analysis tool in the admin page to be very useful in
understanding my schema. I've made changes to my schema so that a
particular case I'm looking at matches properly. I restarted solr,
deleted the document from the index, and added it again. But still,
when I do a query, the document does not get returned in the results.

Does anyone have any tips for debugging this sort of issue? What is
different between what I see in analysis tool and new documents added
to the index?

Thanks,
Justin


Re: analysis tool vs. reality

2010-08-03 Thread Erik Hatcher
The analysis tool is merely that, but during querying there is also a  
query parser involved.  Adding debugQuery=true to your request will  
give you the parsed query in the response offering insight into what  
might be going on.   Could be lots of things, like not querying the  
fields you think you are to a misunderstanding about some text not  
being analyzed (like wildcard clauses).


Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:


Hello,

I have found the analysis tool in the admin page to be very useful in
understanding my schema. I've made changes to my schema so that a
particular case I'm looking at matches properly. I restarted solr,
deleted the document from the index, and added it again. But still,
when I do a query, the document does not get returned in the results.

Does anyone have any tips for debugging this sort of issue? What is
different between what I see in analysis tool and new documents added
to the index?

Thanks,
Justin




analysis tool vs. reality

2010-08-03 Thread Justin Lolofie
Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is ABC12. There is a document who's title field is
ABC12. However, I can only get it to match if I search for ABC or
12. This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
tool ABC12 matches ABC12. However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

   Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

Hello,

I have found the analysis tool in the admin page to be very useful in
understanding my schema. I've made changes to my schema so that a
particular case I'm looking at matches properly. I restarted solr,
deleted the document from the index, and added it again. But still,
when I do a query, the document does not get returned in the results.

Does anyone have any tips for debugging this sort of issue? What is
different between what I see in analysis tool and new documents added
to the index?

Thanks,
Justin


Re: analysis tool vs. reality

2010-08-03 Thread Lance Norskog
This is the 'index' part of the analyser.jsp page. You can ask how the
text is indexed as well as how it is turned into a query.

On Tue, Aug 3, 2010 at 4:35 PM, Justin Lolofie jta...@gmail.com wrote:
 Hi Erik, thank you for replying. So, turning on debugQuery shows
 information about how the query is processed- is there a way to see
 how things are stored internally in the index?

 My query is ABC12. There is a document who's title field is
 ABC12. However, I can only get it to match if I search for ABC or
 12. This was also true in the analysis tool up until recently.
 However, I changed schema.xml and turned on catenate-all in
 WordDelimterFilterFactory for title fieldtype. Now, in the analysis
 tool ABC12 matches ABC12. However, when doing an actual query, it
 does not match.

 Thank you for any help,
 Justin


 -- Forwarded message --
 From: Erik Hatcher erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Date: Tue, 3 Aug 2010 16:50:06 -0400
 Subject: Re: analysis tool vs. reality
 The analysis tool is merely that, but during querying there is also a
 query parser involved.  Adding debugQuery=true to your request will
 give you the parsed query in the response offering insight into what
 might be going on.   Could be lots of things, like not querying the
 fields you think you are to a misunderstanding about some text not
 being analyzed (like wildcard clauses).

       Erik

 On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

    Hello,

    I have found the analysis tool in the admin page to be very useful in
    understanding my schema. I've made changes to my schema so that a
    particular case I'm looking at matches properly. I restarted solr,
    deleted the document from the index, and added it again. But still,
    when I do a query, the document does not get returned in the results.

    Does anyone have any tips for debugging this sort of issue? What is
    different between what I see in analysis tool and new documents added
    to the index?

    Thanks,
    Justin




-- 
Lance Norskog
goks...@gmail.com


Re: analysis tool vs. reality

2010-08-03 Thread Erik Hatcher

Did you reindex after changing the schema?


On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:


Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is ABC12. There is a document who's title field is
ABC12. However, I can only get it to match if I search for ABC or
12. This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
tool ABC12 matches ABC12. However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher erik.hatc...@gmail.com
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

  Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

   Hello,

   I have found the analysis tool in the admin page to be very  
useful in

   understanding my schema. I've made changes to my schema so that a
   particular case I'm looking at matches properly. I restarted solr,
   deleted the document from the index, and added it again. But still,
   when I do a query, the document does not get returned in the  
results.


   Does anyone have any tips for debugging this sort of issue? What is
   different between what I see in analysis tool and new documents  
added

   to the index?

   Thanks,
   Justin