Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-17 Thread Chris Hostetter

: Imagine there is a query like harry potter dvd-collection cheap or cheap
: Harry Potter dvd-collection. 
: How can I customize, that, if there is something said about the category
: cheap, Solr uses a facetting query on cat:cheap? To do so, I have to
: alter the original query - how can I do that?

TMTOWTDI

One solution would be a QParserPlugin ... it's utilized by the 
QueryComponent to decide how to parse the query string.

Or you could write your own SearchComponent to use in place of the 
QueryComponent, then you could not only modify the way the string is 
parsed, but you could also modify the DocSet/DocList anyway you want.


-Hoss



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-11 Thread MitchK

Hello Hossman,

sorry for my late response.

For this specific case, you are right. It makes more sense to do such work
on the fly.
However, I am only testing at the moment, what one can do with Solr and what
not.

Is the UpdateProcessor something that comes froms Lucene itself or from
Solr?

Thanks!


hossman wrote:
 
 
 : Is there a way to prepare a document the described way with Lucene/Solr,
 : before I analyze it?
 : My use case is to categorize several documents in an automatic way,
 which
 : includes that I have to create data from the given input doing some
 : information retrieval.
 
 As Ryan mentioned earlier: this is what the UpdateRequestProcessor API 
 is for -- it allows you to modify Documents (regardless of how they were 
 added: csv, xml, dih) prior to Solr processing them...
 
 http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-to27026739.html
 
 Personally, i think you may be looking at your problem from the wrong 
 dirrection...
 
 :  Imagine you would analyze, index and store them like you normally do
 and
 :  afterwards you want to set, whether the document belongs to the
 expensive
 :  item-group or not.
 :  If the price for the item is higher than 500$, it belongs to the
 :  expensive
 :  ones, otherwise not.
 
 ...for a situation like that, i wouldn't attempt to classify the docs as 
 expensive or cheap when adding them.  instead i would use numeric 
 ranges for faceting and filtering to show me how many docs where 
 expensive or cheap at query time -- that way when the ecomony tanks i 
 can redifine my definition of expensive on the fly w/o needing to 
 reindex a million documents.
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27109760.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-11 Thread Erik Hatcher


On Jan 11, 2010, at 7:33 AM, MitchK wrote:
Is the UpdateProcessor something that comes froms Lucene itself or  
from

Solr?


It's at the Solr level - http://lucene.apache.org/solr/api/org/apache/solr/update/processor/UpdateRequestProcessor.html 



Erik



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-11 Thread MitchK

Is there any schemata that explains which class is responsible for which
level of processing my data to the index?

My example was: I have categorized, whether something is cheap or expensive.  
Let's say I didn't do that on the fly, but with the help of the
UpdateRequestProcessor.
Imagine there is a query like harry potter dvd-collection cheap or cheap
Harry Potter dvd-collection. 
How can I customize, that, if there is something said about the category
cheap, Solr uses a facetting query on cat:cheap? To do so, I have to
alter the original query - how can I do that?
 

Erik Hatcher-4 wrote:
 
 
 On Jan 11, 2010, at 7:33 AM, MitchK wrote:
 Is the UpdateProcessor something that comes froms Lucene itself or  
 from
 Solr?
 
 It's at the Solr level -
 http://lucene.apache.org/solr/api/org/apache/solr/update/processor/UpdateRequestProcessor.html
  
  
 
   Erik
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27111504.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-08 Thread MitchK

Okay, you're right. It really would be cleaner, if I do such stuff in the
code which populates the document to Solr.

Is there a way to prepare a document the described way with Lucene/Solr,
before I analyze it?
My use case is to categorize several documents in an automatic way, which
includes that I have to create data from the given input doing some
information retrieval.

The problem is I am really new to Solr and Lucene - as you can see - and I
do not know, whether there are some classes that fit my needs.

Any idea?


Erick Erickson wrote:
 
 Well, I'd approach either of these use cases
 by simply performing my computations on
 the input and storing the result in another
 (non-indexed unless I wanted to search it)
 field. This wouldn't happen in the Analyzer,
 but in the code that populated the document
 fields.
 
 Which is a much cleaner solution IMO than creating
 some sort of index this but store that capability.
 The purpose of analysis is to produce *searchable*
 tokens after all.
 
 But we're getting into angels dancing on pins here. Do
 you actually have a use case you're trying to implement
 or is this mostly theoretical?
 
 Erick
 
 On Thu, Jan 7, 2010 at 2:08 PM, MitchK mitc...@web.de wrote:
 

 The difference between stored and indexed is clear now.

 You are right, if you are responsing only to normal users.

 Use case:
 You got a stored field The good, the bad and the ugly.
 And you got a really fantastic analyzer, which is doing some magic to
 this
 movie title.
 Let's say, the analyzer translates the title into md5 or into another
 abstract expression.
 Instead of doing the same magical function on the client's side again and
 again, he only needs to take the prepared data from your response.

 Another use case could be:
 Imagine you have got two categories: cheap and expensive and your
 document
 gots a title-, a label-, an owner- and a price-field.
 Imagine you would analyze, index and store them like you normally do and
 afterwards you want to set, whether the document belongs to the expensive
 item-group or not.
 If the price for the item is higher than 500$, it belongs to the
 expensive
 ones, otherwise not.
 I think, this would be a job for a special analyzer - and this only makes
 sense, if I also store the analyzed data.

 I think information retrieval is a really interesting use case.


 Erick Erickson wrote:
 
  What is your use case for responding sometimes with the indexed
 value?
  Other than reconstructing a field that hasn't been stored, I can't
 think
  of
  one.
 
  I still think you're missing the point. Indexing and storing are
  orthogonal operations that have (almost) nothing to do with each
  other, for all that they happen at the same time on the same field.
 
  You never search against the stored data in a field. You *always*
  search against the indexed data.
 
  Contrariwise, you never display the indexed form to the user, you
  *always* show the stored data (unless you come up with
  a really interesting use case).
 
  Step back and consider what happens when you index data,
  it gets broken up all kinds of ways. Stop words are removed,
  case may change, etc, etc, etc. It makes no sense to
  then display this data for a user. Would you really like
  to have, say a movie title The Good, The Bad, and The
  Ugly. Remove stopwords, puncuation and lowercase
  and you index three tokens good, bad, ugly.
  Even if you reconstruct this field, the user would see
  good bad ugly. Bad, very bad.
 
  Yet I want to display the original title to the user in
  response to searching on ugly, so I need the
  original, unanalyzed data.
 
  Perhaps it would help to think of it this way.
  1 take some data and index it in f1
  but do NOT store it in f1. Store it in f2
  but do NOT index it in f2.
  2 take that same data, index AND store
  it in f3.
 
  1 is almost entirely equivalent to 2
  in terms of index resources.
 
  Practically though, 1 is harder to use,
  because you have to remember
  to use f1 for searching and f2 for getting
  the raw data.
 
  HTH
  Erick
 
  On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:
 
 
  Thank you, Ryan. I will have a look on lucene's material and luke.
 
  I think I got it. :)
 
  Sometimes there will be the need, to response on the one hand the
 value
  and
  on the other hand the indexed version of the value.
  How can I fullfill such needs? Doing copyfield on indexed-only fields?
 
 
 
  ryantxu wrote:
  
  
   On Jan 7, 2010, at 10:50 AM, MitchK wrote:
  
  
   Eric,
  
   you mean, everything is okay, but I do not see it?
  
   Internally for searching the analysis takes place and writes to
 the
   index in an inverted fashion, but the stored stuff is left alone.
  
   if I use an analyzer, Solr stores it's output two ways?
   One public output, which is similar to the original input
   and one hidden or internal output, which is based on the
   analyzer's work?
   Did I understand that right?
  
   yes.

Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-08 Thread Erick Erickson
Somewhere, you have to create the document XML you
send to SOLR. Just add the calculated data to
your new field there...

HTH
Erick

On Fri, Jan 8, 2010 at 9:30 AM, MitchK mitc...@web.de wrote:


 Okay, you're right. It really would be cleaner, if I do such stuff in the
 code which populates the document to Solr.

 Is there a way to prepare a document the described way with Lucene/Solr,
 before I analyze it?
 My use case is to categorize several documents in an automatic way, which
 includes that I have to create data from the given input doing some
 information retrieval.

 The problem is I am really new to Solr and Lucene - as you can see - and I
 do not know, whether there are some classes that fit my needs.

 Any idea?


 Erick Erickson wrote:
 
  Well, I'd approach either of these use cases
  by simply performing my computations on
  the input and storing the result in another
  (non-indexed unless I wanted to search it)
  field. This wouldn't happen in the Analyzer,
  but in the code that populated the document
  fields.
 
  Which is a much cleaner solution IMO than creating
  some sort of index this but store that capability.
  The purpose of analysis is to produce *searchable*
  tokens after all.
 
  But we're getting into angels dancing on pins here. Do
  you actually have a use case you're trying to implement
  or is this mostly theoretical?
 
  Erick
 
  On Thu, Jan 7, 2010 at 2:08 PM, MitchK mitc...@web.de wrote:
 
 
  The difference between stored and indexed is clear now.
 
  You are right, if you are responsing only to normal users.
 
  Use case:
  You got a stored field The good, the bad and the ugly.
  And you got a really fantastic analyzer, which is doing some magic to
  this
  movie title.
  Let's say, the analyzer translates the title into md5 or into another
  abstract expression.
  Instead of doing the same magical function on the client's side again
 and
  again, he only needs to take the prepared data from your response.
 
  Another use case could be:
  Imagine you have got two categories: cheap and expensive and your
  document
  gots a title-, a label-, an owner- and a price-field.
  Imagine you would analyze, index and store them like you normally do and
  afterwards you want to set, whether the document belongs to the
 expensive
  item-group or not.
  If the price for the item is higher than 500$, it belongs to the
  expensive
  ones, otherwise not.
  I think, this would be a job for a special analyzer - and this only
 makes
  sense, if I also store the analyzed data.
 
  I think information retrieval is a really interesting use case.
 
 
  Erick Erickson wrote:
  
   What is your use case for responding sometimes with the indexed
  value?
   Other than reconstructing a field that hasn't been stored, I can't
  think
   of
   one.
  
   I still think you're missing the point. Indexing and storing are
   orthogonal operations that have (almost) nothing to do with each
   other, for all that they happen at the same time on the same field.
  
   You never search against the stored data in a field. You *always*
   search against the indexed data.
  
   Contrariwise, you never display the indexed form to the user, you
   *always* show the stored data (unless you come up with
   a really interesting use case).
  
   Step back and consider what happens when you index data,
   it gets broken up all kinds of ways. Stop words are removed,
   case may change, etc, etc, etc. It makes no sense to
   then display this data for a user. Would you really like
   to have, say a movie title The Good, The Bad, and The
   Ugly. Remove stopwords, puncuation and lowercase
   and you index three tokens good, bad, ugly.
   Even if you reconstruct this field, the user would see
   good bad ugly. Bad, very bad.
  
   Yet I want to display the original title to the user in
   response to searching on ugly, so I need the
   original, unanalyzed data.
  
   Perhaps it would help to think of it this way.
   1 take some data and index it in f1
   but do NOT store it in f1. Store it in f2
   but do NOT index it in f2.
   2 take that same data, index AND store
   it in f3.
  
   1 is almost entirely equivalent to 2
   in terms of index resources.
  
   Practically though, 1 is harder to use,
   because you have to remember
   to use f1 for searching and f2 for getting
   the raw data.
  
   HTH
   Erick
  
   On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:
  
  
   Thank you, Ryan. I will have a look on lucene's material and luke.
  
   I think I got it. :)
  
   Sometimes there will be the need, to response on the one hand the
  value
   and
   on the other hand the indexed version of the value.
   How can I fullfill such needs? Doing copyfield on indexed-only
 fields?
  
  
  
   ryantxu wrote:
   
   
On Jan 7, 2010, at 10:50 AM, MitchK wrote:
   
   
Eric,
   
you mean, everything is okay, but I do not see it?
   
Internally for searching the analysis takes 

Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-08 Thread Chris Hostetter

: Is there a way to prepare a document the described way with Lucene/Solr,
: before I analyze it?
: My use case is to categorize several documents in an automatic way, which
: includes that I have to create data from the given input doing some
: information retrieval.

As Ryan mentioned earlier: this is what the UpdateRequestProcessor API 
is for -- it allows you to modify Documents (regardless of how they were 
added: csv, xml, dih) prior to Solr processing them...

http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-to27026739.html

Personally, i think you may be looking at your problem from the wrong 
dirrection...

:  Imagine you would analyze, index and store them like you normally do and
:  afterwards you want to set, whether the document belongs to the expensive
:  item-group or not.
:  If the price for the item is higher than 500$, it belongs to the
:  expensive
:  ones, otherwise not.

...for a situation like that, i wouldn't attempt to classify the docs as 
expensive or cheap when adding them.  instead i would use numeric 
ranges for faceting and filtering to show me how many docs where 
expensive or cheap at query time -- that way when the ecomony tanks i 
can redifine my definition of expensive on the fly w/o needing to 
reindex a million documents.



-Hoss



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread MitchK

Eric,

you mean, everything is okay, but I do not see it?

Internally for searching the analysis takes place and writes to the  
index in an inverted fashion, but the stored stuff is left alone.

if I use an analyzer, Solr stores it's output two ways?
One public output, which is similar to the original input
and one hidden or internal output, which is based on the analyzer's work?
Did I understand that right?

If yes, I have got another problem: 
I don't want to waste any diskspace. Does the copyfield-order stores the
same data two times?
I mean: I have got originalField and copiedField. originalField gets indexed
with text_analyzer and copiedField with a stemmer. Does this mean, I am
storing the original data two times public and once analyzed per analyzer?
Or does Solr stores the original input only once and makes a reference to
the public data of the originalField? 

Thank you
Mitch


Erik Hatcher-4 wrote:
 
 Mitch,
 
 Again, I think you're misunderstanding what analysis does.  You must  
 be expecting we think, though you've not provided exact duplication  
 steps to be sure, that the value you get back from Solr is the  
 analyzer processed output.  It's not, it's exactly what you provide.   
 Internally for searching the analysis takes place and writes to the  
 index in an inverted fashion, but the stored stuff is left alone.
 
 There's some thinking going on implementing it such that analyzed  
 output is stored.
 
 You can, however, use the analysis request handler componentry to get  
 analyzed stuff back as you see it in analysis.jsp on a per-document or  
 per-field text basis - if you're looking to leverage the analyzer  
 output in that fashion from a client.
 
   Erik
 
 On Jan 7, 2010, at 1:21 AM, MitchK wrote:
 

 Hello Erick,

 thank you for answering.

 I can do whatever I want - Solr does nothing.
 For example: If I use the textgen-fieldtype which is predefined,  
 nothing
 happens to the text. Even the stopFilter is not working - no  
 stopword from
 stopword.txt was replaced. I think, that this only affects the index,
 because, if I query for for he returns nothing, which is quietly  
 correct,
 due to the work of the stopFilter.

 Everything works fine on analysis.jsp, but not in reality.

 If you have got any testcase-data you want me to add, please, tell  
 me and I
 will show you the saved data afterwards.

 Thank you.

 Mitch


 Erick Erickson wrote:

 Well, I have noticed that Solr isn't using ANY analyzer

 How do you know this? Because it's highly unlikely that SOLR
 is completely broken on that level.

 Erick

 On Wed, Jan 6, 2010 at 3:48 PM, MitchK mitc...@web.de wrote:


 I have tested a lot and all the time I thought I set wrong options  
 for my
 custom analyzer.
 Well, I have noticed that Solr isn't using ANY analyzer, filter or
 stemmer.
 It seems like it only stores the original input.

 I am using the example-configuration of the current Solr 1.4  
 release.
 What's wrong?

 Thank you!
 --
 View this message in context:
 http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 -- 
 View this message in context:
 http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html
 Sent from the Solr - User mailing list archive at Nabble.com.

 
 
 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27062080.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread Ryan McKinley


On Jan 7, 2010, at 10:50 AM, MitchK wrote:



Eric,

you mean, everything is okay, but I do not see it?


Internally for searching the analysis takes place and writes to the
index in an inverted fashion, but the stored stuff is left alone.


if I use an analyzer, Solr stores it's output two ways?
One public output, which is similar to the original input
and one hidden or internal output, which is based on the  
analyzer's work?

Did I understand that right?


yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are  
based on indexed fields)


Take a look at Lucene in Action for a better description of what is  
happening.  The best tool to get your head around what is happening is  
probably luke (http://www.getopt.org/luke/)





If yes, I have got another problem:
I don't want to waste any diskspace.


You have control over what is stored and what is indexed -- how that  
is configured is up to you.


ryan


Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread MitchK

Thank you, Ryan. I will have a look on lucene's material and luke.

I think I got it. :)

Sometimes there will be the need, to response on the one hand the value and
on the other hand the indexed version of the value. 
How can I fullfill such needs? Doing copyfield on indexed-only fields?



ryantxu wrote:
 
 
 On Jan 7, 2010, at 10:50 AM, MitchK wrote:
 

 Eric,

 you mean, everything is okay, but I do not see it?

 Internally for searching the analysis takes place and writes to the
 index in an inverted fashion, but the stored stuff is left alone.

 if I use an analyzer, Solr stores it's output two ways?
 One public output, which is similar to the original input
 and one hidden or internal output, which is based on the  
 analyzer's work?
 Did I understand that right?
 
 yes.
 
 indexed fields and stored fields are different.
 
 Solr results show stored fields in the results (however facets are  
 based on indexed fields)
 
 Take a look at Lucene in Action for a better description of what is  
 happening.  The best tool to get your head around what is happening is  
 probably luke (http://www.getopt.org/luke/)
 
 

 If yes, I have got another problem:
 I don't want to waste any diskspace.
 
 You have control over what is stored and what is indexed -- how that  
 is configured is up to you.
 
 ryan
 
 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread Ryan McKinley


On Jan 7, 2010, at 12:11 PM, MitchK wrote:



Thank you, Ryan. I will have a look on lucene's material and luke.

I think I got it. :)

Sometimes there will be the need, to response on the one hand the  
value and

on the other hand the indexed version of the value.
How can I fullfill such needs? Doing copyfield on indexed-only fields?



see erik's response on 'analysis request handler'





ryantxu wrote:



On Jan 7, 2010, at 10:50 AM, MitchK wrote:



Eric,

you mean, everything is okay, but I do not see it?

Internally for searching the analysis takes place and writes to  
the

index in an inverted fashion, but the stored stuff is left alone.


if I use an analyzer, Solr stores it's output two ways?
One public output, which is similar to the original input
and one hidden or internal output, which is based on the
analyzer's work?
Did I understand that right?


yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are
based on indexed fields)

Take a look at Lucene in Action for a better description of what is
happening.  The best tool to get your head around what is happening  
is

probably luke (http://www.getopt.org/luke/)




If yes, I have got another problem:
I don't want to waste any diskspace.


You have control over what is stored and what is indexed -- how that
is configured is up to you.

ryan




--
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread Erick Erickson
What is your use case for responding sometimes with the indexed value?
Other than reconstructing a field that hasn't been stored, I can't think of
one.

I still think you're missing the point. Indexing and storing are
orthogonal operations that have (almost) nothing to do with each
other, for all that they happen at the same time on the same field.

You never search against the stored data in a field. You *always*
search against the indexed data.

Contrariwise, you never display the indexed form to the user, you
*always* show the stored data (unless you come up with
a really interesting use case).

Step back and consider what happens when you index data,
it gets broken up all kinds of ways. Stop words are removed,
case may change, etc, etc, etc. It makes no sense to
then display this data for a user. Would you really like
to have, say a movie title The Good, The Bad, and The
Ugly. Remove stopwords, puncuation and lowercase
and you index three tokens good, bad, ugly.
Even if you reconstruct this field, the user would see
good bad ugly. Bad, very bad.

Yet I want to display the original title to the user in
response to searching on ugly, so I need the
original, unanalyzed data.

Perhaps it would help to think of it this way.
1 take some data and index it in f1
but do NOT store it in f1. Store it in f2
but do NOT index it in f2.
2 take that same data, index AND store
it in f3.

1 is almost entirely equivalent to 2
in terms of index resources.

Practically though, 1 is harder to use,
because you have to remember
to use f1 for searching and f2 for getting
the raw data.

HTH
Erick

On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:


 Thank you, Ryan. I will have a look on lucene's material and luke.

 I think I got it. :)

 Sometimes there will be the need, to response on the one hand the value and
 on the other hand the indexed version of the value.
 How can I fullfill such needs? Doing copyfield on indexed-only fields?



 ryantxu wrote:
 
 
  On Jan 7, 2010, at 10:50 AM, MitchK wrote:
 
 
  Eric,
 
  you mean, everything is okay, but I do not see it?
 
  Internally for searching the analysis takes place and writes to the
  index in an inverted fashion, but the stored stuff is left alone.
 
  if I use an analyzer, Solr stores it's output two ways?
  One public output, which is similar to the original input
  and one hidden or internal output, which is based on the
  analyzer's work?
  Did I understand that right?
 
  yes.
 
  indexed fields and stored fields are different.
 
  Solr results show stored fields in the results (however facets are
  based on indexed fields)
 
  Take a look at Lucene in Action for a better description of what is
  happening.  The best tool to get your head around what is happening is
  probably luke (http://www.getopt.org/luke/)
 
 
 
  If yes, I have got another problem:
  I don't want to waste any diskspace.
 
  You have control over what is stored and what is indexed -- how that
  is configured is up to you.
 
  ryan
 
 

 --
 View this message in context:
 http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread MitchK

The difference between stored and indexed is clear now.

You are right, if you are responsing only to normal users.

Use case:
You got a stored field The good, the bad and the ugly.
And you got a really fantastic analyzer, which is doing some magic to this
movie title.
Let's say, the analyzer translates the title into md5 or into another
abstract expression.
Instead of doing the same magical function on the client's side again and
again, he only needs to take the prepared data from your response.

Another use case could be:
Imagine you have got two categories: cheap and expensive and your document
gots a title-, a label-, an owner- and a price-field.
Imagine you would analyze, index and store them like you normally do and
afterwards you want to set, whether the document belongs to the expensive
item-group or not.
If the price for the item is higher than 500$, it belongs to the expensive
ones, otherwise not.
I think, this would be a job for a special analyzer - and this only makes
sense, if I also store the analyzed data.

I think information retrieval is a really interesting use case.


Erick Erickson wrote:
 
 What is your use case for responding sometimes with the indexed value?
 Other than reconstructing a field that hasn't been stored, I can't think
 of
 one.
 
 I still think you're missing the point. Indexing and storing are
 orthogonal operations that have (almost) nothing to do with each
 other, for all that they happen at the same time on the same field.
 
 You never search against the stored data in a field. You *always*
 search against the indexed data.
 
 Contrariwise, you never display the indexed form to the user, you
 *always* show the stored data (unless you come up with
 a really interesting use case).
 
 Step back and consider what happens when you index data,
 it gets broken up all kinds of ways. Stop words are removed,
 case may change, etc, etc, etc. It makes no sense to
 then display this data for a user. Would you really like
 to have, say a movie title The Good, The Bad, and The
 Ugly. Remove stopwords, puncuation and lowercase
 and you index three tokens good, bad, ugly.
 Even if you reconstruct this field, the user would see
 good bad ugly. Bad, very bad.
 
 Yet I want to display the original title to the user in
 response to searching on ugly, so I need the
 original, unanalyzed data.
 
 Perhaps it would help to think of it this way.
 1 take some data and index it in f1
 but do NOT store it in f1. Store it in f2
 but do NOT index it in f2.
 2 take that same data, index AND store
 it in f3.
 
 1 is almost entirely equivalent to 2
 in terms of index resources.
 
 Practically though, 1 is harder to use,
 because you have to remember
 to use f1 for searching and f2 for getting
 the raw data.
 
 HTH
 Erick
 
 On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:
 

 Thank you, Ryan. I will have a look on lucene's material and luke.

 I think I got it. :)

 Sometimes there will be the need, to response on the one hand the value
 and
 on the other hand the indexed version of the value.
 How can I fullfill such needs? Doing copyfield on indexed-only fields?



 ryantxu wrote:
 
 
  On Jan 7, 2010, at 10:50 AM, MitchK wrote:
 
 
  Eric,
 
  you mean, everything is okay, but I do not see it?
 
  Internally for searching the analysis takes place and writes to the
  index in an inverted fashion, but the stored stuff is left alone.
 
  if I use an analyzer, Solr stores it's output two ways?
  One public output, which is similar to the original input
  and one hidden or internal output, which is based on the
  analyzer's work?
  Did I understand that right?
 
  yes.
 
  indexed fields and stored fields are different.
 
  Solr results show stored fields in the results (however facets are
  based on indexed fields)
 
  Take a look at Lucene in Action for a better description of what is
  happening.  The best tool to get your head around what is happening is
  probably luke (http://www.getopt.org/luke/)
 
 
 
  If yes, I have got another problem:
  I don't want to waste any diskspace.
 
  You have control over what is stored and what is indexed -- how that
  is configured is up to you.
 
  ryan
 
 

 --
 View this message in context:
 http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-07 Thread Erick Erickson
Well, I'd approach either of these use cases
by simply performing my computations on
the input and storing the result in another
(non-indexed unless I wanted to search it)
field. This wouldn't happen in the Analyzer,
but in the code that populated the document
fields.

Which is a much cleaner solution IMO than creating
some sort of index this but store that capability.
The purpose of analysis is to produce *searchable*
tokens after all.

But we're getting into angels dancing on pins here. Do
you actually have a use case you're trying to implement
or is this mostly theoretical?

Erick

On Thu, Jan 7, 2010 at 2:08 PM, MitchK mitc...@web.de wrote:


 The difference between stored and indexed is clear now.

 You are right, if you are responsing only to normal users.

 Use case:
 You got a stored field The good, the bad and the ugly.
 And you got a really fantastic analyzer, which is doing some magic to this
 movie title.
 Let's say, the analyzer translates the title into md5 or into another
 abstract expression.
 Instead of doing the same magical function on the client's side again and
 again, he only needs to take the prepared data from your response.

 Another use case could be:
 Imagine you have got two categories: cheap and expensive and your document
 gots a title-, a label-, an owner- and a price-field.
 Imagine you would analyze, index and store them like you normally do and
 afterwards you want to set, whether the document belongs to the expensive
 item-group or not.
 If the price for the item is higher than 500$, it belongs to the expensive
 ones, otherwise not.
 I think, this would be a job for a special analyzer - and this only makes
 sense, if I also store the analyzed data.

 I think information retrieval is a really interesting use case.


 Erick Erickson wrote:
 
  What is your use case for responding sometimes with the indexed value?
  Other than reconstructing a field that hasn't been stored, I can't think
  of
  one.
 
  I still think you're missing the point. Indexing and storing are
  orthogonal operations that have (almost) nothing to do with each
  other, for all that they happen at the same time on the same field.
 
  You never search against the stored data in a field. You *always*
  search against the indexed data.
 
  Contrariwise, you never display the indexed form to the user, you
  *always* show the stored data (unless you come up with
  a really interesting use case).
 
  Step back and consider what happens when you index data,
  it gets broken up all kinds of ways. Stop words are removed,
  case may change, etc, etc, etc. It makes no sense to
  then display this data for a user. Would you really like
  to have, say a movie title The Good, The Bad, and The
  Ugly. Remove stopwords, puncuation and lowercase
  and you index three tokens good, bad, ugly.
  Even if you reconstruct this field, the user would see
  good bad ugly. Bad, very bad.
 
  Yet I want to display the original title to the user in
  response to searching on ugly, so I need the
  original, unanalyzed data.
 
  Perhaps it would help to think of it this way.
  1 take some data and index it in f1
  but do NOT store it in f1. Store it in f2
  but do NOT index it in f2.
  2 take that same data, index AND store
  it in f3.
 
  1 is almost entirely equivalent to 2
  in terms of index resources.
 
  Practically though, 1 is harder to use,
  because you have to remember
  to use f1 for searching and f2 for getting
  the raw data.
 
  HTH
  Erick
 
  On Thu, Jan 7, 2010 at 12:11 PM, MitchK mitc...@web.de wrote:
 
 
  Thank you, Ryan. I will have a look on lucene's material and luke.
 
  I think I got it. :)
 
  Sometimes there will be the need, to response on the one hand the value
  and
  on the other hand the indexed version of the value.
  How can I fullfill such needs? Doing copyfield on indexed-only fields?
 
 
 
  ryantxu wrote:
  
  
   On Jan 7, 2010, at 10:50 AM, MitchK wrote:
  
  
   Eric,
  
   you mean, everything is okay, but I do not see it?
  
   Internally for searching the analysis takes place and writes to the
   index in an inverted fashion, but the stored stuff is left alone.
  
   if I use an analyzer, Solr stores it's output two ways?
   One public output, which is similar to the original input
   and one hidden or internal output, which is based on the
   analyzer's work?
   Did I understand that right?
  
   yes.
  
   indexed fields and stored fields are different.
  
   Solr results show stored fields in the results (however facets are
   based on indexed fields)
  
   Take a look at Lucene in Action for a better description of what is
   happening.  The best tool to get your head around what is happening is
   probably luke (http://www.getopt.org/luke/)
  
  
  
   If yes, I have got another problem:
   I don't want to waste any diskspace.
  
   You have control over what is stored and what is indexed -- how that
   is configured is up to you.
  
   ryan
  
  
 
  --
  View this 

No Analyzer, tokenizer or stemmer works at Solr

2010-01-06 Thread MitchK

I have tested a lot and all the time I thought I set wrong options for my
custom analyzer.
Well, I have noticed that Solr isn't using ANY analyzer, filter or stemmer.
It seems like it only stores the original input.

I am using the example-configuration of the current Solr 1.4 release.
What's wrong?

Thank you!
-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-06 Thread Erick Erickson
Well, I have noticed that Solr isn't using ANY analyzer

How do you know this? Because it's highly unlikely that SOLR
is completely broken on that level.

Erick

On Wed, Jan 6, 2010 at 3:48 PM, MitchK mitc...@web.de wrote:


 I have tested a lot and all the time I thought I set wrong options for my
 custom analyzer.
 Well, I have noticed that Solr isn't using ANY analyzer, filter or stemmer.
 It seems like it only stores the original input.

 I am using the example-configuration of the current Solr 1.4 release.
 What's wrong?

 Thank you!
 --
 View this message in context:
 http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-06 Thread Ryan McKinley


On Jan 6, 2010, at 3:48 PM, MitchK wrote:



I have tested a lot and all the time I thought I set wrong options  
for my

custom analyzer.
Well, I have noticed that Solr isn't using ANY analyzer, filter or  
stemmer.

It seems like it only stores the original input.


The stored value is always the original input.

The *indexed* values are transformed by analysis.

If you really need to store the analyzed fields, that may be possible  
with an UpdateRequestProcessor.  also see:

https://issues.apache.org/jira/browse/SOLR-314

ryan


Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-06 Thread MitchK

Hello Erick,

thank you for answering.

I can do whatever I want - Solr does nothing.
For example: If I use the textgen-fieldtype which is predefined, nothing
happens to the text. Even the stopFilter is not working - no stopword from
stopword.txt was replaced. I think, that this only affects the index,
because, if I query for for he returns nothing, which is quietly correct,
due to the work of the stopFilter. 

Everything works fine on analysis.jsp, but not in reality. 

If you have got any testcase-data you want me to add, please, tell me and I
will show you the saved data afterwards.  

Thank you.

Mitch


Erick Erickson wrote:
 
 Well, I have noticed that Solr isn't using ANY analyzer
 
 How do you know this? Because it's highly unlikely that SOLR
 is completely broken on that level.
 
 Erick
 
 On Wed, Jan 6, 2010 at 3:48 PM, MitchK mitc...@web.de wrote:
 

 I have tested a lot and all the time I thought I set wrong options for my
 custom analyzer.
 Well, I have noticed that Solr isn't using ANY analyzer, filter or
 stemmer.
 It seems like it only stores the original input.

 I am using the example-configuration of the current Solr 1.4 release.
 What's wrong?

 Thank you!
 --
 View this message in context:
 http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-06 Thread MitchK

Hello Ryan,

thank you for answering.

In my schema.xml I am defining the field as indexed = true.
The problem is: nothing, even the original predefined analyzers don't work
anyway.
Please, have a look on my response to Erick.

Mitch

P.S.
Oh, I see what you mean. The field is indexed = true. My language was a
little bit tricky ;).


ryantxu wrote:
 
 
 On Jan 6, 2010, at 3:48 PM, MitchK wrote:
 

 I have tested a lot and all the time I thought I set wrong options  
 for my
 custom analyzer.
 Well, I have noticed that Solr isn't using ANY analyzer, filter or  
 stemmer.
 It seems like it only stores the original input.
 
 The stored value is always the original input.
 
 The *indexed* values are transformed by analysis.
 
 If you really need to store the analyzed fields, that may be possible  
 with an UpdateRequestProcessor.  also see:
 https://issues.apache.org/jira/browse/SOLR-314
 
 ryan
 
 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055512.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: No Analyzer, tokenizer or stemmer works at Solr

2010-01-06 Thread Erik Hatcher

Mitch,

Again, I think you're misunderstanding what analysis does.  You must  
be expecting we think, though you've not provided exact duplication  
steps to be sure, that the value you get back from Solr is the  
analyzer processed output.  It's not, it's exactly what you provide.   
Internally for searching the analysis takes place and writes to the  
index in an inverted fashion, but the stored stuff is left alone.


There's some thinking going on implementing it such that analyzed  
output is stored.


You can, however, use the analysis request handler componentry to get  
analyzed stuff back as you see it in analysis.jsp on a per-document or  
per-field text basis - if you're looking to leverage the analyzer  
output in that fashion from a client.


Erik

On Jan 7, 2010, at 1:21 AM, MitchK wrote:



Hello Erick,

thank you for answering.

I can do whatever I want - Solr does nothing.
For example: If I use the textgen-fieldtype which is predefined,  
nothing
happens to the text. Even the stopFilter is not working - no  
stopword from

stopword.txt was replaced. I think, that this only affects the index,
because, if I query for for he returns nothing, which is quietly  
correct,

due to the work of the stopFilter.

Everything works fine on analysis.jsp, but not in reality.

If you have got any testcase-data you want me to add, please, tell  
me and I

will show you the saved data afterwards.

Thank you.

Mitch


Erick Erickson wrote:


Well, I have noticed that Solr isn't using ANY analyzer

How do you know this? Because it's highly unlikely that SOLR
is completely broken on that level.

Erick

On Wed, Jan 6, 2010 at 3:48 PM, MitchK mitc...@web.de wrote:



I have tested a lot and all the time I thought I set wrong options  
for my

custom analyzer.
Well, I have noticed that Solr isn't using ANY analyzer, filter or
stemmer.
It seems like it only stores the original input.

I am using the example-configuration of the current Solr 1.4  
release.

What's wrong?

Thank you!
--
View this message in context:
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
Sent from the Solr - User mailing list archive at Nabble.com.







--
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html
Sent from the Solr - User mailing list archive at Nabble.com.