Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
Hi, i have this bunch of lines in my schema.xml that should do a replacement
but it doesn't work!

fieldType name=salary_max_text class=solr.TextField
omitNorms=true
  analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=([0-9]+k?[.,]?[0-9]*).*?([0-9]+k?[.,]?[0-9]*) replacement=$2/
  /analyzer
/fieldType


I need it to extract only the numbers from some other string. The strings
can be anything: only letters (so it should replace it with an empty
string), letters + numbers. The numbers can be in one of those formats

17000 -- ok
17,000 -- should be replaced with 17000
17.000 -- should be replaced with 17000
17k -- should be replaced with 17000

how can i accomplish this? 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3120748.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Ahmet Arslan

 Hi, i have this bunch of lines in my
 schema.xml that should do a replacement
 but it doesn't work!
 
     fieldType name=salary_max_text
 class=solr.TextField
 omitNorms=true
       analyzer type=index
           tokenizer
 class=solr.StandardTokenizerFactory/
         charFilter
 class=solr.PatternReplaceCharFilterFactory
 pattern=([0-9]+k?[.,]?[0-9]*).*?([0-9]+k?[.,]?[0-9]*)
 replacement=$2/
       /analyzer
     /fieldType
 

charFilter definitions should be above the tokenizer definition.
i.e., 
analyzer
charFilter
tokenizer
filter


Re: Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
fieldType name=salary_min_text class=solr.TextField 
  analyzer type=index
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
replacement=$1/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.TrimFilterFactory /
  /analyzer
  analyzer type=query
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
replacement=$1/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.TrimFilterFactory /
  /analyzer
/fieldType

fieldType name=salary_max_text class=solr.TextField 
  analyzer type=index
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
replacement=$2/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.TrimFilterFactory /
  /analyzer
  analyzer type=query
charFilter class=solr.PatternReplaceCharFilterFactory
pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
replacement=$2/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.TrimFilterFactory /
  /analyzer
/fieldType

this is the final version of my schema part, but what i get is this:


doc
float name=score1.0/float
str name=salaryNegotiable/str
str name=salary_maxNegotiable/str
str name=salary_minNegotiable/str
/doc
doc
float name=score1.0/float
str name=salary£7 to £8 per hour/str
str name=salary_max£7 to £8 per hour/str
str name=salary_min£7 to £8 per hour/str
/doc
doc
float name=score1.0/float
str name=salary£125 to £150 per day/str
str name=salary_max£125 to £150 per day/str
str name=salary_min£125 to £150 per day/str
/doc

which is not what i'm expecting... the regular expression works in
http://www.fileformat.info/tool/regex.htm without any problem

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121055.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Ahmet Arslan
     fieldType
 name=salary_min_text class=solr.TextField 
       analyzer type=index
         charFilter
 class=solr.PatternReplaceCharFilterFactory
 pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
 replacement=$1/
         tokenizer
 class=solr.KeywordTokenizerFactory/
         filter
 class=solr.LowerCaseFilterFactory /
         filter
 class=solr.TrimFilterFactory /
       /analyzer
       analyzer type=query
         charFilter
 class=solr.PatternReplaceCharFilterFactory
 pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
 replacement=$1/
         tokenizer
 class=solr.KeywordTokenizerFactory/
         filter
 class=solr.LowerCaseFilterFactory /
         filter
 class=solr.TrimFilterFactory /
       /analyzer
     /fieldType
 
     fieldType name=salary_max_text
 class=solr.TextField 
       analyzer type=index
         charFilter
 class=solr.PatternReplaceCharFilterFactory
 pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
 replacement=$2/
         tokenizer
 class=solr.KeywordTokenizerFactory/
         filter
 class=solr.LowerCaseFilterFactory /
         filter
 class=solr.TrimFilterFactory /
       /analyzer
       analyzer type=query
         charFilter
 class=solr.PatternReplaceCharFilterFactory
 pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*
 replacement=$2/
         tokenizer
 class=solr.KeywordTokenizerFactory/
         filter
 class=solr.LowerCaseFilterFactory /
         filter
 class=solr.TrimFilterFactory /
       /analyzer
     /fieldType
 
 this is the final version of my schema part, but what i
 get is this:
 
 
 doc
 float name=score1.0/float
 str name=salaryNegotiable/str
 str name=salary_maxNegotiable/str
 str name=salary_minNegotiable/str
 /doc
 doc
 float name=score1.0/float
 str name=salary£7 to £8 per hour/str
 str name=salary_max£7 to £8 per
 hour/str
 str name=salary_min£7 to £8 per
 hour/str
 /doc
 doc
 float name=score1.0/float
 str name=salary£125 to £150 per
 day/str
 str name=salary_max£125 to £150 per
 day/str
 str name=salary_min£125 to £150 per
 day/str
 /doc
 
 which is not what i'm expecting... the regular expression
 works in
 http://www.fileformat.info/tool/regex.htm
 without any problem

I am not good with regular expressions, but response always contains 
untouched/un-analyzed version of fields. You can visually test your 
fieldType/regex on admin/analysis.jsp page. It show indexed terms step by step.


Re: Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
Index Analyzer
org.apache.solr.analysis.KeywordTokenizerFactory
{luceneMatchVersion=LUCENE_31}
position1
term text   £22000 - £25000 per annum + benefits
startOffset 0
endOffset   36


org.apache.solr.analysis.PatternReplaceFilterFactory {replacement=$2,
pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*,
luceneMatchVersion=LUCENE_31}
position1
term text   25000
startOffset 0
endOffset   36


this is my output for the field salary_max, it seems to be working from the
admin jsp interface

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121353.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Ahmet Arslan
 Index Analyzer
 org.apache.solr.analysis.KeywordTokenizerFactory
 {luceneMatchVersion=LUCENE_31}
 position    1
 term text    £22000 - £25000 per annum +
 benefits
 startOffset    0
 endOffset    36
 
 
 org.apache.solr.analysis.PatternReplaceFilterFactory
 {replacement=$2,
 pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*,
 luceneMatchVersion=LUCENE_31}
 position    1
 term text    25000
 startOffset    0
 endOffset    36
 
 
 this is my output for the field salary_max, it seems to be
 working from the
 admin jsp interface

That's good to know. If you explain your final goal in detail, users can give 
better pointers.


Re: Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
i have the string You may earn 25k dollars per week stored in the field
salary

i'm using 2 copyfields salary_min and salary_max with source in salary
with those 2 datatypes 

salary is text
salary_min is salary_min_text
salary_max is salary_max_text

so, i was expecting this:

solr updates its index
solr copies the value from salary to salary_min and applies the value with
the regex
solr copies the value from salary to salary_max and applies the value with
the regex


but it's not working, it copies the value from one field to another, but the
filter isn't applied, even if it's working as you could see


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121386.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Ahmet Arslan
 i have the string You may earn 25k
 dollars per week stored in the field
 salary
 
 i'm using 2 copyfields salary_min and salary_max with
 source in salary
 with those 2 datatypes 
 
 salary is text
 salary_min is salary_min_text
 salary_max is salary_max_text
 
 so, i was expecting this:
 
 solr updates its index
 solr copies the value from salary to salary_min and applies
 the value with
 the regex
 solr copies the value from salary to salary_max and applies
 the value with
 the regex
 
 
 but it's not working, it copies the value from one field to
 another, but the
 filter isn't applied, even if it's working as you could
 see

Okey, that makes sense. copyField just copies the content. It has nothing to do 
with analyzers. Two solutions comes to my mind.

1-) If you are using data import handler, I think (i am not good with regex), 
you can use regex transformer to populate these two fields.

http://wiki.apache.org/solr/DataImportHandler#RegexTransformer

2-) If not, you can populate these two field in a custom 
UpdateRequestProcessor. There is an example to modify and to start here :

http://wiki.apache.org/solr/UpdateRequestProcessor


Re: Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
ok, but i'm not applying the filtering on the copyfields.
this is how my schema looks:



field name=salary type=text indexed=true stored=true /
field name=salary_min type=salary_min_text indexed=true stored=true
/
field name=salary_max type=salary_max_text indexed=true stored=true
/
 

copyField source=salary dest=salary_min /
copyField source=salary dest=salary_max /

and the two datatypes defined before. that's why i tought i could first use
copyField to copy the value then index them with my two datatypes
filtering...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121497.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Juan Grande
Hi Samuele,

It's not clear for me if your goal is to search on that field (for example,
salary_min:[100 TO 200]) or if you want to show the transformed field to
the user (so you want the result of the regex replacement to be included in
the search results).

If your goal is to show the results to the user, then (as Ahmet said in a
previous mail) it won't work, because the content of the documents is stored
verbatim. The analysis only affects the way that documents are searched.

If your goal is to search, could you please show us the query that you're
using to test the use case?

Thanks!

*Juan*



On Wed, Jun 29, 2011 at 10:02 AM, samuele.mattiuzzo samum...@gmail.comwrote:

 ok, but i'm not applying the filtering on the copyfields.
 this is how my schema looks:



 field name=salary type=text indexed=true stored=true /
 field name=salary_min type=salary_min_text indexed=true
 stored=true
 /
 field name=salary_max type=salary_max_text indexed=true
 stored=true
 /


 copyField source=salary dest=salary_min /
 copyField source=salary dest=salary_max /

 and the two datatypes defined before. that's why i tought i could first use
 copyField to copy the value then index them with my two datatypes
 filtering...

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121497.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Regex replacement not working!

2011-06-29 Thread Michael Kuhlmann
Am 29.06.2011 12:30, schrieb samuele.mattiuzzo:
 fieldType name=salary_min_text class=solr.TextField 
   analyzer type=index
...

 this is the final version of my schema part, but what i get is this:
 
 
 doc
 float name=score1.0/float
 str name=salaryNegotiable/str
 str name=salary_maxNegotiable/str
 str name=salary_minNegotiable/str
 /doc
...


The mistake is that you assume that the filter applied to the result.
This is not true. Index filters only affect the index (as the name
says), not the contents.

Therefore, if you have copyFields that are stored, the'll always return
the same value as the original field.

Try inspecting your index data with luke or the admin console. Then
you'll see whether your regex applies.

Greetings,
Kuli


Re: Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
my goal is/was storing the value into the field, and i get i have to create
my Update handler.

i was trying to use query with salary_min:[100 TO 200] and it's actually
working... since i just need it to search, i'll stay with this solution

is the [100 TO 200] a performance killer? i remember reading something
around, but cannot find it again...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121625.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Ahmet Arslan
 my goal is/was storing the value into
 the field, and i get i have to create
 my Update handler.
 
 i was trying to use query with salary_min:[100 TO 200] and
 it's actually
 working... since i just need it to search, i'll stay with
 this solution
 
 is the [100 TO 200] a performance killer? i remember
 reading something
 around, but cannot find it again...

Please be aware that range query is working on strings. It will return unwanted 
results. String sorting and integer sorting is different.

If you are after range queries you need to defied price_min and price_max 
fields as trie-based types. tint, tdouble etc. And populate them with the 
update processor or at client side.


Re: Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
ok, last question on the UpdateProcessor: can you please give me the steps to
implement my own?
i mean, i can push my custom processor in solr's code, and then what?
i don't understand how i have to change the solrconf.xml and how can i bind
that to the updater i just wrotea
and also i don't understand how i do have to change the schema.xml

i'm sorry for this question, but i started working on solr 5 days ago and
for some things i really need a lot of documentation, and this isn't fully
covered anywhere

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Ahmet Arslan
 ok, last question on the
 UpdateProcessor: can you please give me the steps to
 implement my own?
 i mean, i can push my custom processor in solr's code, and
 then what?
 i don't understand how i have to change the solrconf.xml
 and how can i bind
 that to the updater i just wrotea
 and also i don't understand how i do have to change the
 schema.xml
 
 i'm sorry for this question, but i started working on solr
 5 days ago and
 for some things i really need a lot of documentation, and
 this isn't fully
 covered anywhere

Implementing a conditional copyField example is a good place start. You can 
use it as a template. 

You don't need to modify the solr source code for this. You can write your 
class, compile it, put the resulting jar into solrHome/lib directory. It is 
explained here, how to register your new update processor in solrconfig.xml

http://wiki.apache.org/solr/SolrPlugins#UpdateRequestProcessorFactory  


Re: Regex replacement not working!

2011-06-29 Thread Adam Estrada
I have had the same problems with regex and I went with the regular pattern
replace filter rather than the charfilter. When I added it to the very end
of the chain, only then would it work...I am on Solr 3.2. I have also
noticed that the HTML filter factory is not working either. When I dump the
field that it's supposed to be working on, all the hyperlinks and everything
that you would expect to be stripped are still present.

Adam

On Wed, Jun 29, 2011 at 10:04 AM, samuele.mattiuzzo samum...@gmail.comwrote:

 ok, last question on the UpdateProcessor: can you please give me the steps
 to
 implement my own?
 i mean, i can push my custom processor in solr's code, and then what?
 i don't understand how i have to change the solrconf.xml and how can i bind
 that to the updater i just wrotea
 and also i don't understand how i do have to change the schema.xml

 i'm sorry for this question, but i started working on solr 5 days ago and
 for some things i really need a lot of documentation, and this isn't fully
 covered anywhere

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Regex replacement not working!

2011-06-29 Thread samuele.mattiuzzo
too bad it is still in todo, that's why i was asking some for some tips on
writing, compiling, registration, calling...


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121856.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex replacement not working!

2011-06-29 Thread Ahmet Arslan
 too bad it is still in todo, that's
 why i was asking some for some tips on
 writing, compiling, registration, calling...

Here is general information about how to customize solr via plugins.
http://wiki.apache.org/solr/SolrPlugins

Here is the registration and code example.
http://wiki.apache.org/solr/UpdateRequestProcessor