Hello guys, Hey, I think I´ve found how to do this just adding a filter. Just for anyone´s curiosity:
<fieldType name="emails" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> <filter class="solr.TypeTokenFilterFactory" types="email_type.txt" useWhitelist="true"/> </analyzer> </fieldType> Anyway, I still need to do a query like the following to retrieve those documents with at least one E-mail detected: http://localhost:8080/mysolr/select?q=emails:[* TO *]&start=0&rows=10&sort=mydate desc And I don´t like it, to be honest, Regards, 2013/7/30 Luis Cappa Banda <luisca...@gmail.com> > Hello, Jack, Steve, > > Thank you for your answers. I´ve never used UAX29URLEmailTokenizerFactory, > but I´ve read about it before trying RegExp´s queries. As far as I know, > UAX29URLEmailTokenizerFactory > allows to tokenize an entry text value into patterns that match URLs, > E-mails, etc. Reading the documentation I haven´t found any way to select > just E-mail patterns, not URL ones, for example. I feel that it may have > sense to specify one or multiple patterns in a configuration file to be > setted during the Tokenizer definition in the schema.xml, but I found > nothing. > > I´ve just want to retrieve those documents indexed where they appear at > least one E-mail inside de text. However, even using > UAX29URLEmailTokenizerFactory, > and suposing that I store that E-mail data in a field called 'emails' (I > feel creative, hehe), a query like the following appears to be... dirty: > > http://localhost:8080/mysolr/select?q=emails:[* TO > *]&start=0&rows=10&sort=mydate desc > > What do you think about? > > And Andy... I know many RegExps to find E-mail patterns in a text - that > wasn´t my question, and of course there is no perfect one. However, Lucene > RegExp syntax is different from classic RegExp one, so is not as easy as > copy & paste any RegExps and, voilá! E-mails everywhere. > > Thank you very much in advance, > > Best regards, > > > > > > 2013/7/30 Jack Krupansky <j...@basetechnology.com> > >> Just use the UAX29URLEmailTokenizerFactory, which recognizes email >> addresses. >> >> Any particular reason that you're trying to reinvent the wheel? >> >> -- Jack Krupansky >> >> -----Original Message----- From: Luis Cappa Banda >> Sent: Tuesday, July 30, 2013 10:53 AM >> To: solr-user@lucene.apache.org >> Subject: Email regular expression. >> >> >> Hello everyone! >> >> Unfortunately I have to search all E-mail addresses found in a text field >> from each document. I've been reading for a while how to use RegExp's in >> Solr, but after trying some of them they didn't work. I've noticed that >> Lucene RegExp syntax sometimes is very different from the classic RegExp >> syntax, so that may be the reason why they didn't work for me, and maybe >> someone more expert can help me. >> >> The syntax is the following: >> >> *E-mail: * >> >> text:/[a-z0-9_\|-]+(\.[a-z0-9_**\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]** >> |)*\.([a-z]{2,4})/ >> >> Thank you very much in advance! >> >> Best regards, >> >> -- >> - Luis Cappa >> > > > > -- > - Luis Cappa > -- - Luis Cappa