Custom fieldtype with sharding?
Hi all, I'm having an issue with using a custom fieldtype with distributed search. It may be the case that what I'm looking for could be accomplished in a different way, but this is my first stab at it. I'm looking to store XML in a field. What I've done, which works fine, is to: - on ingest, wrap the XML in a CDATA tag - write a simple class that extends org.apache.solr.schema.TextField, which writes an XML node much in the way that a textfield would, but without escaping the contents It looks like this: public class XMLField extends TextField { @Override public void write(TextResponseWriter xmlWriter, String name, Fieldable f) throws java.io.IOException { Writer writer = xmlWriter.getWriter(); writer.write("'); writer.write(f.stringValue(), 0, f.stringValue() == null ? 0 : f.stringValue().length()); writer.write(""); } } Like I said, simple. Not especially pretty, but it does the job. Works fine for normal searching, I get back a response like: When I try to use this with distributed searching, though, it comes back written as a normal textfield, like:It looks like it doesn't know anything about my custom fieldtype at all, and is defaulting to writing it as a StrField or TextField instead. So, my question: - is there a better way to do this? I'd be fine if it came back with a 'str' element name, as long as it's not escaped. - is there perhaps a different class I should extend to do this with sharded searching? - should I just bite the bullet and manually unescape the xml after receiving the response? I'd really prefer not to do this if I can get around it. Thanks in advance for any help. Peter
Re: facet.offset with facet.sort=lex and shards problem?
On 02/24/2011 02:58 PM, Peter Cline wrote: On 02/24/2011 12:37 PM, Yonik Seeley wrote: On Thu, Feb 24, 2011 at 10:57 AM, Peter Cline wrote: Hi all, I'm having a problem using distributed search in conjunction with the facet.offset parameter and lexical facet value sorting. Is there an incompatibility between these? I'm using Solr 1.41. I have a facet with ~100k values in one index. I'm wanting to page through them alphabetically. When not using distributed search, everything works just fine, and very quick. A query like this works, returning 10 facet values starting at the 50,001st: http://server:port/solr/select/?q=*:*&facet.field=subject_full_facet&facet=true&f.subject_full_facet.facet.limit=10&facet.sort=lex&facet.offset=5 # Butterflies - Indiana ! However, if I enable distributed search, using a single shard (which is the same index), I get no facet values returned. http://server:port/solr/select/?q=*:*&facet.field=subject_full_facet&facet=true&f.subject_full_facet.facet.limit=10&facet.sort=lex&facet.offset=5&shards=server:port/solr # empty list :( Doing a little more testing, I'm finding that with sharding I often get an empty list any time the facet.offset>= facet.limit. Also, by example, if I do facet.limit=100 and facet.offset=90, I get 10 facet values. Doing so without sharding, I get the expected (by me, at least) 100 values (starting at what would normally be the 91st). Can anybody shed any light on this for me? Sounds like a bug. Have you tried a 3x or trunk development build to see if it's fixed there? -Yonik http://lucidimagination.com I haven't. I'll try the current trunk and get back to you. Thanks, Peter I tried today's builds for the 3.x branch and the trunk. The problem persists in both. Peter
Re: facet.offset with facet.sort=lex and shards problem?
On 02/24/2011 12:37 PM, Yonik Seeley wrote: On Thu, Feb 24, 2011 at 10:57 AM, Peter Cline wrote: Hi all, I'm having a problem using distributed search in conjunction with the facet.offset parameter and lexical facet value sorting. Is there an incompatibility between these? I'm using Solr 1.41. I have a facet with ~100k values in one index. I'm wanting to page through them alphabetically. When not using distributed search, everything works just fine, and very quick. A query like this works, returning 10 facet values starting at the 50,001st: http://server:port/solr/select/?q=*:*&facet.field=subject_full_facet&facet=true&f.subject_full_facet.facet.limit=10&facet.sort=lex&facet.offset=5 # Butterflies - Indiana ! However, if I enable distributed search, using a single shard (which is the same index), I get no facet values returned. http://server:port/solr/select/?q=*:*&facet.field=subject_full_facet&facet=true&f.subject_full_facet.facet.limit=10&facet.sort=lex&facet.offset=5&shards=server:port/solr # empty list :( Doing a little more testing, I'm finding that with sharding I often get an empty list any time the facet.offset>= facet.limit. Also, by example, if I do facet.limit=100 and facet.offset=90, I get 10 facet values. Doing so without sharding, I get the expected (by me, at least) 100 values (starting at what would normally be the 91st). Can anybody shed any light on this for me? Sounds like a bug. Have you tried a 3x or trunk development build to see if it's fixed there? -Yonik http://lucidimagination.com I haven't. I'll try the current trunk and get back to you. Thanks, Peter
facet.offset with facet.sort=lex and shards problem?
Hi all, I'm having a problem using distributed search in conjunction with the facet.offset parameter and lexical facet value sorting. Is there an incompatibility between these? I'm using Solr 1.41. I have a facet with ~100k values in one index. I'm wanting to page through them alphabetically. When not using distributed search, everything works just fine, and very quick. A query like this works, returning 10 facet values starting at the 50,001st: http://server:port/solr/select/?q=*:*&facet.field=subject_full_facet&facet=true&f.subject_full_facet.facet.limit=10&facet.sort=lex&facet.offset=5 # Butterflies - Indiana ! However, if I enable distributed search, using a single shard (which is the same index), I get no facet values returned. http://server:port/solr/select/?q=*:*&facet.field=subject_full_facet&facet=true&f.subject_full_facet.facet.limit=10&facet.sort=lex&facet.offset=5&shards=server:port/solr # empty list :( Doing a little more testing, I'm finding that with sharding I often get an empty list any time the facet.offset >= facet.limit. Also, by example, if I do facet.limit=100 and facet.offset=90, I get 10 facet values. Doing so without sharding, I get the expected (by me, at least) 100 values (starting at what would normally be the 91st). Can anybody shed any light on this for me? Thanks, Peter
Re: Question about facet.prefix usage
Hi Simon, I came across your post to the solr users list about using facet prefixes, shown below. I was wondering if you were still using your modified version of SimpleFacets.java, and if so -- if you could send me a copy. I'll need to implement something similar, and it never hurts to start from existing material. Thanks, Peter Simon Hu wrote: I also need the exact same feature. I was not able to find an easy solution and ended up modifying class SimpleFacets to make it accept an array of facet prefixes per field. If you are interested, I can email you the modified SimpleFacets.java. -Simon steve berry-2 wrote: Question: Is it possible to pass complex queries to facet.prefix? Example instead of facet.prefix:foo I want facet.prefix:foo OR facet.prefix:bar My application is for browsing business records that fall into categories. The user is only allowed to see businesses falling into categories which they have access to. I have a series of documents dumped into the following basic structure which I was hoping would help me deal with this: 123 Business Corp. 28255-0001 . charlotte_2006 Banks charlotte_2007 Banks sanfrancisco_2006 Banks sanfrancisco_2007 Banks ... (lots more market_category entries) ... 124 Factory Corp. 28205-0001 . charlotte_2006 Banks charlotte_2007 Banks austin_2006 Banks austin_2007 Banks ... (lots more market_category entries) ... . The multivalued market_category fields are flattened relational data attributed to that business and I want to use those values for facted navigation /but/ I want the facets to be restricted depending on what products the user has access to. For example a user may have access to sanfrancisco_2007 and sanfrancisco_2006 data but nothing else. So I've created a request using facet.prefix that looks something like this: http://SOLRSERVER:8080/solr/select?q.op=AND&q=docType:gen&facet.field=market_category&facet.prefix=charlotte_2007 This ends up producing perfectly suitable facet results that look like this: .. 1 1 1 1 1 1 0 . Bingo! facet.prefix does exactly what I want it to. Now I want to go a step further and pass a compound statement to the facet.prefix along the lines of "facet.prefix:charlotte_2007 OR sanfrancisco_2007" or "facet.prefix:charlotte_2007 OR charlotte_2006" to return more complex facet sets. As far as I can tell looking at the docs this won't work. Is this possible using the existing facet.prefix functionality? Anyone have a better idea of how I should accomplish this? Thanks, steve berry American City Business Journals
uriEncoding for solr in glassfish
Hi all, This is a little off-topic, so I apologize. I asked a question not too long ago about uri encoding problems, and got a quick and accurate response, so I thought I would try again. I need to pass utf-8 encoded characters to solr instances, so I need the uri encoding to be done in UTF-8. In tomcat, this was accomplished by setting an attribute of the Connector (thanks Nicholas and Yonik). We're considering moving from tomcat to Glassfish (for various reasons), so I'm trying to get this working there as well. I found a very similar setting, setting the uriEncoding property in the http-listener, but it's not seeming to have any effect--solr is getting garbled strings. So, in effect, my question is this: has anybody used solr in glassfish and had to address this problem? Seems unlikely, but it's worth a shot. Thanks, Peter
Re: Accented search
I'm not sure about a way to boost scores in this case, but you can achieve the basic matching by applying a filter to the index and the queries. The ISOLatin1Accent Filter seems like it may work for you, though I'm not entirely certain if that will cover all the accent characters you need. My approach has been to write new filters, one to normalize the unicode into the "decomposed" version, then one to manually strip out all of the "add-on" characters (with decimal codepoint greater than 256). I don't know if this will always work, but it's worked well for me so far. I would test out adding a to your analyzer. It might do the trick. Once again, with this approach I'm not sure how to boost either score, so someone else may have better ideas. I'm pretty new to all of this stuff. Peter climbingrose wrote: Hi guys, I'm running to some problems with accented (UTF-8) language. I'd love to hear some ideas about how to use Solr with those languages. Basically, I want to achieve what Google did with UTF-8 language. My requirements including: 1) Accent insensitive search and proper highlighting: For example, we have 2 documents: Doc A (title:Lập Trình Viên) Doc B (title:Lap Trinh Vien) if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập Trình Viên" is highlighted. On the other hand, if the query is "Lap Trinh Vien", Doc A is also matched. 2) Assign proper scores to accented or non-accented searches: if the user enters "Lập Trình Viên", then Doc A should be given higher score than DOC B. if the query is "Lap Trinh Vien", Doc A should be given higher score. Any ideas guys? Thanks in advance!
Re: Illegal xml/html character; unicode problems near solr
Nicolas and Yonik, Thank you both for your excellent responses--this fixed my problem. Now it's time to go back and remove all the hacks I was using to pin this thing together without proper utf-8 support. Thanks again, Peter [EMAIL PROTECTED] wrote: I think Tomcat defaults to the operating system default, e.g. cp1252 on a classic windows. You need to add an attribute URIEncoding="UTF-8" to the Connector you use in the server.xml conf. Nicolas -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] De la part de Yonik Seeley Envoyé : vendredi 7 mars 2008 18:53 À : solr-user@lucene.apache.org Objet : Re: Illegal xml/html character; unicode problems near solr On Fri, Mar 7, 2008 at 12:30 PM, Peter Cline <[EMAIL PROTECTED]> wrote: The following is a snippet of a link to use a facet: search-faceted.html?q=[* TO *]&facet=true&rows=25&fq=name_facet:"Brasseur de Bourbourg, abb%C3%A9, 1814-1874, former owner"" These characters are correctly specified. When it returns, I get an illegal character error. Examining the XML, I get an fq value of: name_facet:"Brasseur de Bourbourg, abbÃÂ(c), 1814-1874, former owner" Is this bad XML part of the responseHeader (parameters that are simply being echoed back)? If so, it's most likely the config on whatever servlet container you are using... you need to configure it to accept UTF-8 URLs rather than latin-1 (Tomcat defaults to the old-style latin-1 AFAIK) -Yonik
Illegal xml/html character; unicode problems near solr
Hi all, I'm new to the list, but I've been struggling with this problem for some time. I'm getting Illegal xml/html character errors and I'm trying to track down the source. The characters in question seem to be in the 128-159 (decimal) range, which is illegal in XML. The characters are mostly diacritics and other types of accents. The original data is encoded in UTF-8. I have verified that the data doesn't contain any of these characters prior to indexing, and when I get the records in question back in a list of results, they display fine. The problem arises when the characters occur in a facet value and I try to pass it through the URL. As an example, consider a facet value: Brasseur de Bourbourg, abb%C3%A9, 1814-1874, former owner The %C3%A9 is an e with a diacritic, so roughly abbe'. The following is a snippet of a link to use a facet: search-faceted.html?q=[* TO *]&facet=true&rows=25&fq=name_facet:"Brasseur de Bourbourg, abb%C3%A9, 1814-1874, former owner"" These characters are correctly specified. When it returns, I get an illegal character error. Examining the XML, I get an fq value of: name_facet:"Brasseur de Bourbourg, abbé, 1814-1874, former owner" I'm not sure how that will display in the email, but in short, it's not what I put in. Further, it's not legal html and things break. Does anyone have any thoughts about this? I apologize if this has been asked somewhere in the past, but I did some digging and couldn't come up with anything. I welcome any input. Regards, Peter Peter Cline, Digital Library Applications Programmer University of Pennsylvania Library email: pcline at pobox dot upenn dot edu