Regex replacement not working!
Hi, i have this bunch of lines in my schema.xml that should do a replacement but it doesn't work! fieldType name=salary_max_text class=solr.TextField omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.PatternReplaceCharFilterFactory pattern=([0-9]+k?[.,]?[0-9]*).*?([0-9]+k?[.,]?[0-9]*) replacement=$2/ /analyzer /fieldType I need it to extract only the numbers from some other string. The strings can be anything: only letters (so it should replace it with an empty string), letters + numbers. The numbers can be in one of those formats 17000 -- ok 17,000 -- should be replaced with 17000 17.000 -- should be replaced with 17000 17k -- should be replaced with 17000 how can i accomplish this? -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3120748.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
Hi, i have this bunch of lines in my schema.xml that should do a replacement but it doesn't work! fieldType name=salary_max_text class=solr.TextField omitNorms=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.PatternReplaceCharFilterFactory pattern=([0-9]+k?[.,]?[0-9]*).*?([0-9]+k?[.,]?[0-9]*) replacement=$2/ /analyzer /fieldType charFilter definitions should be above the tokenizer definition. i.e., analyzer charFilter tokenizer filter
Re: Regex replacement not working!
fieldType name=salary_min_text class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$1/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$1/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer /fieldType fieldType name=salary_max_text class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$2/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$2/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer /fieldType this is the final version of my schema part, but what i get is this: doc float name=score1.0/float str name=salaryNegotiable/str str name=salary_maxNegotiable/str str name=salary_minNegotiable/str /doc doc float name=score1.0/float str name=salary£7 to £8 per hour/str str name=salary_max£7 to £8 per hour/str str name=salary_min£7 to £8 per hour/str /doc doc float name=score1.0/float str name=salary£125 to £150 per day/str str name=salary_max£125 to £150 per day/str str name=salary_min£125 to £150 per day/str /doc which is not what i'm expecting... the regular expression works in http://www.fileformat.info/tool/regex.htm without any problem -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121055.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
fieldType name=salary_min_text class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$1/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$1/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer /fieldType fieldType name=salary_max_text class=solr.TextField analyzer type=index charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$2/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer analyzer type=query charFilter class=solr.PatternReplaceCharFilterFactory pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.* replacement=$2/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / /analyzer /fieldType this is the final version of my schema part, but what i get is this: doc float name=score1.0/float str name=salaryNegotiable/str str name=salary_maxNegotiable/str str name=salary_minNegotiable/str /doc doc float name=score1.0/float str name=salary£7 to £8 per hour/str str name=salary_max£7 to £8 per hour/str str name=salary_min£7 to £8 per hour/str /doc doc float name=score1.0/float str name=salary£125 to £150 per day/str str name=salary_max£125 to £150 per day/str str name=salary_min£125 to £150 per day/str /doc which is not what i'm expecting... the regular expression works in http://www.fileformat.info/tool/regex.htm without any problem I am not good with regular expressions, but response always contains untouched/un-analyzed version of fields. You can visually test your fieldType/regex on admin/analysis.jsp page. It show indexed terms step by step.
Re: Regex replacement not working!
Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {luceneMatchVersion=LUCENE_31} position1 term text £22000 - £25000 per annum + benefits startOffset 0 endOffset 36 org.apache.solr.analysis.PatternReplaceFilterFactory {replacement=$2, pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*, luceneMatchVersion=LUCENE_31} position1 term text 25000 startOffset 0 endOffset 36 this is my output for the field salary_max, it seems to be working from the admin jsp interface -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
Index Analyzer org.apache.solr.analysis.KeywordTokenizerFactory {luceneMatchVersion=LUCENE_31} position 1 term text £22000 - £25000 per annum + benefits startOffset 0 endOffset 36 org.apache.solr.analysis.PatternReplaceFilterFactory {replacement=$2, pattern=[^\d]?([0-9]+[k,.]?[0-9]*)+.*?([0-9]+[k,.]?[0-9]*)+.*, luceneMatchVersion=LUCENE_31} position 1 term text 25000 startOffset 0 endOffset 36 this is my output for the field salary_max, it seems to be working from the admin jsp interface That's good to know. If you explain your final goal in detail, users can give better pointers.
Re: Regex replacement not working!
i have the string You may earn 25k dollars per week stored in the field salary i'm using 2 copyfields salary_min and salary_max with source in salary with those 2 datatypes salary is text salary_min is salary_min_text salary_max is salary_max_text so, i was expecting this: solr updates its index solr copies the value from salary to salary_min and applies the value with the regex solr copies the value from salary to salary_max and applies the value with the regex but it's not working, it copies the value from one field to another, but the filter isn't applied, even if it's working as you could see -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
i have the string You may earn 25k dollars per week stored in the field salary i'm using 2 copyfields salary_min and salary_max with source in salary with those 2 datatypes salary is text salary_min is salary_min_text salary_max is salary_max_text so, i was expecting this: solr updates its index solr copies the value from salary to salary_min and applies the value with the regex solr copies the value from salary to salary_max and applies the value with the regex but it's not working, it copies the value from one field to another, but the filter isn't applied, even if it's working as you could see Okey, that makes sense. copyField just copies the content. It has nothing to do with analyzers. Two solutions comes to my mind. 1-) If you are using data import handler, I think (i am not good with regex), you can use regex transformer to populate these two fields. http://wiki.apache.org/solr/DataImportHandler#RegexTransformer 2-) If not, you can populate these two field in a custom UpdateRequestProcessor. There is an example to modify and to start here : http://wiki.apache.org/solr/UpdateRequestProcessor
Re: Regex replacement not working!
ok, but i'm not applying the filtering on the copyfields. this is how my schema looks: field name=salary type=text indexed=true stored=true / field name=salary_min type=salary_min_text indexed=true stored=true / field name=salary_max type=salary_max_text indexed=true stored=true / copyField source=salary dest=salary_min / copyField source=salary dest=salary_max / and the two datatypes defined before. that's why i tought i could first use copyField to copy the value then index them with my two datatypes filtering... -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121497.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
Hi Samuele, It's not clear for me if your goal is to search on that field (for example, salary_min:[100 TO 200]) or if you want to show the transformed field to the user (so you want the result of the regex replacement to be included in the search results). If your goal is to show the results to the user, then (as Ahmet said in a previous mail) it won't work, because the content of the documents is stored verbatim. The analysis only affects the way that documents are searched. If your goal is to search, could you please show us the query that you're using to test the use case? Thanks! *Juan* On Wed, Jun 29, 2011 at 10:02 AM, samuele.mattiuzzo samum...@gmail.comwrote: ok, but i'm not applying the filtering on the copyfields. this is how my schema looks: field name=salary type=text indexed=true stored=true / field name=salary_min type=salary_min_text indexed=true stored=true / field name=salary_max type=salary_max_text indexed=true stored=true / copyField source=salary dest=salary_min / copyField source=salary dest=salary_max / and the two datatypes defined before. that's why i tought i could first use copyField to copy the value then index them with my two datatypes filtering... -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121497.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
Am 29.06.2011 12:30, schrieb samuele.mattiuzzo: fieldType name=salary_min_text class=solr.TextField analyzer type=index ... this is the final version of my schema part, but what i get is this: doc float name=score1.0/float str name=salaryNegotiable/str str name=salary_maxNegotiable/str str name=salary_minNegotiable/str /doc ... The mistake is that you assume that the filter applied to the result. This is not true. Index filters only affect the index (as the name says), not the contents. Therefore, if you have copyFields that are stored, the'll always return the same value as the original field. Try inspecting your index data with luke or the admin console. Then you'll see whether your regex applies. Greetings, Kuli
Re: Regex replacement not working!
my goal is/was storing the value into the field, and i get i have to create my Update handler. i was trying to use query with salary_min:[100 TO 200] and it's actually working... since i just need it to search, i'll stay with this solution is the [100 TO 200] a performance killer? i remember reading something around, but cannot find it again... -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121625.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
my goal is/was storing the value into the field, and i get i have to create my Update handler. i was trying to use query with salary_min:[100 TO 200] and it's actually working... since i just need it to search, i'll stay with this solution is the [100 TO 200] a performance killer? i remember reading something around, but cannot find it again... Please be aware that range query is working on strings. It will return unwanted results. String sorting and integer sorting is different. If you are after range queries you need to defied price_min and price_max fields as trie-based types. tint, tdouble etc. And populate them with the update processor or at client side.
Re: Regex replacement not working!
ok, last question on the UpdateProcessor: can you please give me the steps to implement my own? i mean, i can push my custom processor in solr's code, and then what? i don't understand how i have to change the solrconf.xml and how can i bind that to the updater i just wrotea and also i don't understand how i do have to change the schema.xml i'm sorry for this question, but i started working on solr 5 days ago and for some things i really need a lot of documentation, and this isn't fully covered anywhere -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
ok, last question on the UpdateProcessor: can you please give me the steps to implement my own? i mean, i can push my custom processor in solr's code, and then what? i don't understand how i have to change the solrconf.xml and how can i bind that to the updater i just wrotea and also i don't understand how i do have to change the schema.xml i'm sorry for this question, but i started working on solr 5 days ago and for some things i really need a lot of documentation, and this isn't fully covered anywhere Implementing a conditional copyField example is a good place start. You can use it as a template. You don't need to modify the solr source code for this. You can write your class, compile it, put the resulting jar into solrHome/lib directory. It is explained here, how to register your new update processor in solrconfig.xml http://wiki.apache.org/solr/SolrPlugins#UpdateRequestProcessorFactory
Re: Regex replacement not working!
I have had the same problems with regex and I went with the regular pattern replace filter rather than the charfilter. When I added it to the very end of the chain, only then would it work...I am on Solr 3.2. I have also noticed that the HTML filter factory is not working either. When I dump the field that it's supposed to be working on, all the hyperlinks and everything that you would expect to be stripped are still present. Adam On Wed, Jun 29, 2011 at 10:04 AM, samuele.mattiuzzo samum...@gmail.comwrote: ok, last question on the UpdateProcessor: can you please give me the steps to implement my own? i mean, i can push my custom processor in solr's code, and then what? i don't understand how i have to change the solrconf.xml and how can i bind that to the updater i just wrotea and also i don't understand how i do have to change the schema.xml i'm sorry for this question, but i started working on solr 5 days ago and for some things i really need a lot of documentation, and this isn't fully covered anywhere -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121743.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
too bad it is still in todo, that's why i was asking some for some tips on writing, compiling, registration, calling... -- View this message in context: http://lucene.472066.n3.nabble.com/Regex-replacement-not-working-tp3120748p3121856.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regex replacement not working!
too bad it is still in todo, that's why i was asking some for some tips on writing, compiling, registration, calling... Here is general information about how to customize solr via plugins. http://wiki.apache.org/solr/SolrPlugins Here is the registration and code example. http://wiki.apache.org/solr/UpdateRequestProcessor