Re: GIT does not support empty directories
This directory intentionally left empty. --wunder On Apr 16, 2010, at 12:33 PM, Ted Dunning wrote: Put a readme file in the directory and be done with it. On Fri, Apr 16, 2010 at 8:40 AM, Robert Muir rcm...@gmail.com wrote: I don't like the idea of complicating lucene/solr's build system any more than it already is, unless its absolutely necessary. its already too complicated. Instead of adding more hacks, what is actually broken (git) is what should be fixed, as the link states: Currently the design of the git index (staging area) only permits *files* to be listed, and nobody competent enough to make the change to allow empty directories has cared enough about this situation to remedy it. On Fri, Apr 16, 2010 at 11:14 AM, Smiley, David W. dsmi...@mitre.org wrote: Seriously. I sympathize with your point that git should support empty directories. But as a practical matter, it's easy to make the ant build tolerant of them. ~ David Smiley From: Robert Muir [rcm...@gmail.com] Sent: Friday, April 16, 2010 6:53 AM To: solr-dev@lucene.apache.org Subject: Re: GIT does not support empty directories Seriously? We should hack our ant files around the bugs in every crappy source control system that comes out? Fix Git. On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.org wrote: I've run into this too. I don't think this needs to be documented, I think it needs to be *fixed* -- that is, the relevant ant tasks need to not assume these directories exist and create them if not. ~ David Smiley -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Wednesday, April 14, 2010 11:14 PM To: solr-dev Subject: GIT does not support empty directories There are some empty directories in the Solr source tree, both in 1.4 and the trunk. example/work example/webapp example/logs Git does not support empty directories: https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F And so, when you check out from the Apache GIT repository, these empty directories do not appear and 'ant example' and 'ant run-example' fail. There is no 'how to use the solr git stuff' wiki page; that seems like the right place to document this. I'm not git-smart enough to write that page. -- Lance Norskog goks...@gmail.com -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com -- Walter Underwood Venture ASM, Troop 14, Palo Alto
[jira] Commented: (SOLR-534) Return all query results with parameter rows=-1
[ https://issues.apache.org/jira/browse/SOLR-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832351#action_12832351 ] Walter Underwood commented on SOLR-534: --- -1 This adds a denial of service vulnerability to Solr. One query can use lots of CPU or memory, or even crash the server. This could also take out an entire distributed system. If this is added, we MUST add a config option to disable it. Let's take this back to the mailing list and find out why they believe all results are needed.There must be a better way to solve this. Return all query results with parameter rows=-1 --- Key: SOLR-534 URL: https://issues.apache.org/jira/browse/SOLR-534 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Environment: Tomcat 5.5 Reporter: Lars Kotthoff Priority: Minor Attachments: solr-all-results.patch The searcher should return all results matching a query when the parameter rows=-1 is given. I know that it is a bad idea to do this in general, but as it explicitly requires a special parameter, people using this feature will be aware of what they are doing. The main use case for this feature is probably debugging, but in some cases one might actually need to retrieve all results because they e.g. are to be merged with results from different sources. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Namespaces in response (SOLR-1586)
On Dec 9, 2009, at 11:11 AM, Mattmann, Chris A (388J) wrote: Any parser that does that is so broken that you should stop using it immediately. --wunder Walter, totally agree here. To elaborate my position: 1. Validation is a user option. The XML spec makes that very clear. We've had 10 years to get that right, and anyone who auto-validates is not paying attention. Validation is very useful when you are creating XML, rarely useful when reading it. 2. XML namespaces are string prefixes that use the URL syntax. They do not follow URI rules for anything but syntax and there is no guarantee that they can be resolved. In fact, an XML parser can't do anything standard with the result if they do resolve. Again, we've had 10 years to figure that out. Yes, this can be confusing, but if a parser author can't figure it out, don't use their parser because they are already getting the simple stuff wrong. wunder
Re: Functions, floats and doubles
Float is almost never good enough. The loss of precision is horrific. wunder On Nov 13, 2009, at 9:58 AM, Yonik Seeley wrote: On Fri, Nov 13, 2009 at 12:52 PM, Grant Ingersoll gsing...@apache.org wrote: Implementing my first function (distance stuff) and notices that functions seem to have a float bent to them. Not even sure what would be involved, but there are cases for distance that I could see wanting double precision. Thoughts? It's an issue in general. But for something like gdist(point_a,point_b), the internal calculations can be done in double precision and if the result is cast to a float at the end, it should be good enough for most uses, right? -Yonik http://www.lucidimagination.com
Re: Functions, floats and doubles
Float is often OK until you try and use it for further calculation. Maybe it is good enough for printing out distance, but maybe not for further use. wunder On Nov 13, 2009, at 10:32 AM, Yonik Seeley wrote: On Fri, Nov 13, 2009 at 1:01 PM, Walter Underwood wun...@wunderwood.org wrote: Float is almost never good enough. The loss of precision is horrific. Are you saying it's not good enough for this case (the final answer of a relative distance calculation)? 7 digits of precision is enough to represent a distance across the US down to the meter... and points closer together would have higher precision of course. For storage of the points themselves, 32 bit floats may also often be enough (~2.4 meter resolution at the equator). Allowing doubles as an option would be nice too - but I expect that doubling the fieldcache may not be worth it for many. Actually, a 32 bit fixed point representation would have a lot more accuracy for this (256 times the resolution at the cost of on-the-fly conversion to a double for calculations). -Yonik http://www.lucidimagination.com
Re: Another RC
Please wait for an official release of Lucene. It makes thing SO much easier when you need to dig into the Lucene code. It is well worth a week delay. wunder On Oct 19, 2009, at 10:27 AM, Yonik Seeley wrote: On Mon, Oct 19, 2009 at 10:59 AM, Grant Ingersoll gsing...@apache.org wrote: Are we ready for a release? +1 I don't think we need to wait for Lucene 2.9.1 - we have all the fixes in our version, and there's little point in pushing things off yet another week. Seems like the next RC should be a *real* one (i.e. no RC label in the version, immediately call a VOTE). -Yonik http://www.lucidimagination.com I got busy at work and haven't been able to address things as much, but it seems like things are progressing. Shall I generate another RC or are we waiting for Lucene 2.9.1? If we go w/ the 2.9.1-dev, then we just need to restore the Maven stuff for them. Hopefully, that stuff was just commented out and not completely removed so as to make it a little easier to restore. -Grant
Re: 8 for 1.4
It might not be proper to use the name Solr, because it is really Apache Solr. At a minimum, it is misleading to use an Apache project name on GPL'ed code. I agree that changing to GPL is a bad idea. I've worked at eight or nine companies since the GPL was created, and GPL'ed code was forbidden at every one of them. GPL is where code goes to die. wunder On Sep 29, 2009, at 3:34 AM, Grant Ingersoll wrote: On Sep 29, 2009, at 4:00 AM, Matthias Epheser wrote: Grant Ingersoll schrieb: Moving to GPL doesn't seem like a good solution to me, but I don't know what else to propose. Why don't we just hold it from this release, but keep it in trunk and encourage the Drupal guys and others to submit their changes? Perhaps by then Matthias or you or someone else will have stepped up. concerning GPL: The message from the drupal guys is that the code altered that much from initial solrjs that they think it's legally acceptable to get their new code out under GPL and only mention that it was inspired by the still existing Apache License solrjs. Sounds reasonable for me but I have few experience with this kind of legal issues. So what do you think? Oh, it's legally fine. The ASL let's you do pretty much whatever you want. But that is pretty much the point. You're taking code with no restrictions on it and putting a whole slew of them back in, preventing Solr from ever distributing it in the future. Something about that stinks to me. There is a pretty large reason why we do our work at the ASF and not under GPL. I won't go into it here, but suffice it to say one can go read volumes of backstory on this elsewhere by searching for GPL vs ASL (or BSD). Furthermore, Matthias, it may be the case in the future that all that work you did for SolrJS may not even be accessible to you, the original author, under the GPL terms, depending on the company (many, many companies explicitly forbid GPL), etc. that you work for. Is that what you want? Also, they can't call it SolrJS, though, as that is the name of our version.
Re: [PMX:FAKE_SENDER] Re: large OR-boolean query
This would work a lot better if you did the join at index time. For each paper, add a field with all the related drug names (or whatever you want to search for), then search on that field. With the current design, it will never be fast and never scale. Each lookup has a cost, so expanding a query to a thousand terms will always be slow. Distributing the query to multiple shards will only make a bad design slightly faster. This is fundamental to search index design. The schema is flat, fully- denormalized, no joins. You tag each document with the terms that you will use to find it. Then you search for those terms directly. wunder On Sep 25, 2009, at 7:52 AM, Luo, Jeff wrote: We are searching strings, not numbers. The reason we are doing this kind of query is that we have two big indexes, say, a collection of medicine drugs and a collection of research papers. I first run a query against the drugs index and get 102400 unique drug names back. Then I need to find all the research papers where one or more of the 102400 drug names are mentioned, hence the large OR query. This is a kind of JOIN query between 2 indexes, which an article in the lucid web site comparing databases and search engines briefly touched. I was able to issue 100 parallel small queries against solr shards and get the results back successfully (even sorted). My custom code is less than 100 lines, mostly in my SearchHandler.handleRequestBody. But I have problem summing up the correct facet counts because the faceting counts from each shard are not disjunctive. Based on what is suggested by two other responses to my question, I think it is possible that the master can pass the original large query to each shard, and each shard will split the large query into 100 lower level disjunctive lucene queries, fire them against its Lucene index in a parallel way and merge the results. Then each shard shall only return 1(instead of 100) result set to the master with disjunctive faceting counts. It seems that the faceting problem can be solved in this way. I would appreciate it if you could let me know if this approach is feasible and correct; what solr plug-ins are needed(my guess is a custom parser and query-component?) Thanks, Jeff -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Thursday, September 24, 2009 10:01 AM To: solr-dev@lucene.apache.org Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote: Hi, We are experimenting a parallel approach to issue a large OR-Boolean query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several solr shards. The way we are trying is to break the large query into smaller ones, e.g., the example above can be broken into 10 small queries: keywords:(1 OR 2 OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048), etc Now each shard will get 10 requests and the master will merge the results coming back from each shard, similar to the regular distributed search. Can you tell us a little bit more about the why/what of this? Are you really searching numbers or are those just for example? Do you care about the score or do you just need to know whether the result is there or not? -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Solr Slow in Unix
In particular, are you using local disc or network storage? --wunder On 7/16/09 8:24 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Jul 16, 2009 at 4:18 AM, Anand Kumar Prabhakaranand2...@gmail.com wrote: I'm running a Solr instance in Apache Tomcat 6 in a Solaris Box. The QTimes are high when compared to the same configuration on a Windows machine. Can anyone help with the configurations i can check to improve the performance? What's the hardware actually look like on each machine? -Yonik http://www.lucidimagination.com
Re: lucene releases vs trunk
This is an excellent idea. When I find a problem and want to research the Lucene bugs that might describe it, that is really hard with a trunk build. It's easy with a release build. wunder On 6/25/09 4:18 AM, Yonik Seeley yo...@lucidimagination.com wrote: For the next release cycle (presumably 1.5?) I think we should really try to stick to released versions of Lucene, and not use dev/trunk versions. Early in Solr's lifetime, Lucene trunk was more stable (APIs changed little, even on non-released versions), and Lucene releases were few and far between. Today, the pace of change in Lucene has quickened, and Lucene APIs are much more in flux until a release is made. It's also now harder to support a Lucene dev release given the growth in complexity (particularly for indexing code). Releases are made more often too, making using released versions more practical. Many of our users dislike our use of dev versions of Lucene too. And yes, 1.4 isn't out the door yet - but people often tend to hit the ground running on the next release. -Yonik http://www.lucidimagination.com
[jira] Commented: (SOLR-1216) disambiguate the replication command names
[ https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719609#action_12719609 ] Walter Underwood commented on SOLR-1216: sync is a weak name, because it doesn't say whether it is a push or pull synchronization. disambiguate the replication command names -- Key: SOLR-1216 URL: https://issues.apache.org/jira/browse/SOLR-1216 Project: Solr Issue Type: Improvement Components: replication (java) Reporter: Noble Paul Assignee: Noble Paul Fix For: 1.4 Attachments: SOLR-1216.patch There is a lot of confusion in the naming of various commands such as snappull, snapshot etc. This is a vestige of the script based replication we currently have. The commands can be renamed to make more sense * 'snappull' to be renamed to 'sync' * 'snapshot' to be renamed to 'backup' thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1216) disambiguate the replication command names
[ https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719625#action_12719625 ] Walter Underwood commented on SOLR-1216: If we choose a name for the thing we are pulling, like image, then we can use makeimage, pullimage, etc. disambiguate the replication command names -- Key: SOLR-1216 URL: https://issues.apache.org/jira/browse/SOLR-1216 Project: Solr Issue Type: Improvement Components: replication (java) Reporter: Noble Paul Assignee: Noble Paul Fix For: 1.4 Attachments: SOLR-1216.patch There is a lot of confusion in the naming of various commands such as snappull, snapshot etc. This is a vestige of the script based replication we currently have. The commands can be renamed to make more sense * 'snappull' to be renamed to 'sync' * 'snapshot' to be renamed to 'backup' thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Streaming Docs, Terms, TermVectors
Don't stream, request chunks of 10 or 100 at a time. It works fine and you don't have to write or test any new code. In addition, it works well with HTTP caches, so if two clients want to get the same data, the second can get it from the cache. We do that at Netflix. Each front-end box does a series of queries to get all the movie titles, then loads them into a local index for autocomplete. wunder On 5/30/09 11:01 AM, Kaktu Chakarabati jimmoe...@gmail.com wrote: For a streaming-like solution, it is possible infact to have a working buffer in-memory that emits chunks on an http connection which is kept alive by the server until the full response has been sent. This is quite similar for example to how video streaming protocols which can operate on top of HTTP work ( cf. a more general discussion on http://ajaxpatterns.org/HTTP_Streaming#In_A_Blink ). Another (non-mutually exclusive) possibility is to introduce a novel binary format for the transmission of such data ( i.e a new wt=.. type ) over http (or any other comm. protocol) so that data can be more effectively compressed and made to better fit into memory. One such format which has been widely circulating and already has many open source projects implementing it is Adobe's AMF ( http://osflash.org/documentation/amf ). It is however a proprietary format so i'm not sure whether it is incorporable under apache foundation terms. -Chak On Sat, May 30, 2009 at 9:58 AM, Dietrich Featherston d...@dfeatherston.comwrote: I was actually curious about the same thing. Perhaps an endpoint reference could be passed in the request where the documents can be sent asynchronously, such as a jms topic. solr/query?q=*:*epr=/my/topiceprtype=jms Then we would need to consider how to break up the response, how to cancel a running query, etc. Is this along the lines of what you're looking for? I would be interested in looking at how the request/response contract changes and what types of endpoint references would be supported. Thanks, D On May 30, 2009, at 12:45 PM, Grant Ingersoll gsing...@apache.org wrote: Anyone have any thoughts on what is involved with streaming lots of results out of Solr? For instance, if I wanted to get something like 1M docs out of Solr (or more) via *:* query, how can I tractably do this? Likewise, if I wanted to return all the terms in the index or all the Term Vectors. Obviously, it is impossible to load all of these things into memory and then create a response, so I was wondering if anyone had any ideas on how to stream them. Thanks, Grant
[jira] Commented: (SOLR-1073) StrField should allow locale sensitive sorting
[ https://issues.apache.org/jira/browse/SOLR-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703893#action_12703893 ] Walter Underwood commented on SOLR-1073: Using the locale of the JVM is very, very bad for a multilingual server. Solr should always use the same, simple locale. It is OK to set a Locale in configuration for single-language installations, but using the JVM locale is a recipe for disaster. You move Solr to a different server and everything breaks. Very, very bad. In a multi-lingual config, locales must be set per-request. Ideally, requests should send an ISO language code as context for the query. StrField should allow locale sensitive sorting -- Key: SOLR-1073 URL: https://issues.apache.org/jira/browse/SOLR-1073 Project: Solr Issue Type: Improvement Environment: All Reporter: Sachin Attachments: LocaleStrField.java Currently, StrField does not take a parameter which it can pass to ctor of SortField making the StrField's sorting rely on the locale of the JVM. Ideally, StrField should allow setting the locale in the schema.xml and use it to create a new instance of the SortField in getSortField() method, something like: snip: public SortField getSortField(SchemaField field,boolean reverse) { ... Locale locale = new Locale(lang,country); return new SortField(field.getName(), locale, reverse); } More details about this issue here: http://www.nabble.com/CJKAnalyzer-and-Chinese-Text-sort-td22374195.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication
[ https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678601#action_12678601 ] Walter Underwood commented on SOLR-1044: During the Oscars, the HTTP cache in front of our Solr farm had a 90% hit rate. I think a 10X reduction in server load is a testimony to the superiority of the HTTP approach. Use Hadoop RPC for inter Solr communication --- Key: SOLR-1044 URL: https://issues.apache.org/jira/browse/SOLR-1044 Project: Solr Issue Type: New Feature Components: search Reporter: Noble Paul Solr uses http for distributed search . We can make it a whole lot faster if we use an RPC mechanism which is more lightweight/efficient. Hadoop RPC looks like a good candidate for this. The implementation should just have one protocol. It should follow the Solr's idiom of making remote calls . A uri + params +[optional stream(s)] . The response can be a stream of bytes. To make this work we must make the SolrServer implementation pluggable in distributed search. Users should be able to choose between the current CommonshttpSolrServer, or a HadoopRpcSolrServer . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?
That info is already available via Luke, right? --wunder On 2/26/09 9:55 AM, Robert Douglass r...@robshouse.net wrote: A solution that I'd considering implementing for Drupal's ApacheSolr module is to do a *:* search and then make tag clouds from all of the facets. Pretty easy to sort all the facet terms into bins based on the number of documents they match, and then to translate bins to font sizes. Tag clouds make a nice alternate representation of facet blocks. Robert Douglass The RobsHouse.net Newsletter: http://robshouse.net/newsletter/robshousenet-newsletter Follow me on Twitter: http://twitter.com/robertDouglass On Feb 26, 2009, at 6:50 PM, Emmanuel Castro Santana wrote: I am developing a Solr based search application and need to get a kind of a keyword report for tag cloud generation. If there is anyone here who has ever had that necessity and has somehow found the way through, I would really appreciate some help. Thanks in advance -- View this message in context: http://www.nabble.com/Is-there-a-built-in-keyword-report-%28Tag-Cloud%29-feat ure-on-Solr---tp9677p9677.html Sent from the Solr - Dev mailing list archive at Nabble.com.
Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?
Oops, missed that you wanted it by facet. Never mind. --wunder On 2/26/09 9:57 AM, Walter Underwood wunderw...@netflix.com wrote: That info is already available via Luke, right? --wunder On 2/26/09 9:55 AM, Robert Douglass r...@robshouse.net wrote: A solution that I'd considering implementing for Drupal's ApacheSolr module is to do a *:* search and then make tag clouds from all of the facets. Pretty easy to sort all the facet terms into bins based on the number of documents they match, and then to translate bins to font sizes. Tag clouds make a nice alternate representation of facet blocks. Robert Douglass The RobsHouse.net Newsletter: http://robshouse.net/newsletter/robshousenet-newsletter Follow me on Twitter: http://twitter.com/robertDouglass On Feb 26, 2009, at 6:50 PM, Emmanuel Castro Santana wrote: I am developing a Solr based search application and need to get a kind of a keyword report for tag cloud generation. If there is anyone here who has ever had that necessity and has somehow found the way through, I would really appreciate some help. Thanks in advance -- View this message in context: http://www.nabble.com/Is-there-a-built-in-keyword-report-%28Tag-Cloud%29-fea t ure-on-Solr---tp9677p9677.html Sent from the Solr - Dev mailing list archive at Nabble.com.
Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?
If you want a tag cloud based on query freqency, start with your HTTP log analysis tools. Most of those generate a list of top queries and top words in queries. wunder On 2/26/09 2:54 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I may have not made myself clear. When I say keyword report, I mean a kind : of a most popular tag cloud, showing in bigger sizes the most searched : terms. Therefore I need information about how many times specific terms have : been searched and I can't see how I could accomplish that with this : solution you have to be more explicit about what you ask for. I've never heard anyone refer to a tag cloud as being based on how often a term is searched for -- everyone i know uses the frequency of words in the corpus, sometimes with a decay function to promote words mentioned in more recent docs. Solr doesn't keep any record of the searches performed, so to build a tag cloud based on query popularity you would need to mine your logs. if you want a tag cloud based on the frequency of words in your corpus, the faceting approach mentioned would work -- but a simpler way to get term counts for the whole index (*:*) would be the TermsComponent. you only really need the facet based solution if you want a cloud based on a subset of documents, (ie: a cloud for all documents matching category:computer) -Hoss
Re: [jira] Issue Comment Edited: (SOLR-844) A SolrServer impl to front-end multiple urls
This would be useful if there was search-specific balancing, like always send the same query back to the same server. That can make your cache far more effective. wunder On 1/22/09 1:13 PM, Otis Gospodnetic (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/SOLR-844?page=com.atlassian.jira.plugin. system.issuetabpanels:comment-tabpanelfocusedCommentId=12666296#action_126662 96 ] otis edited comment on SOLR-844 at 1/22/09 1:12 PM: I'm not sure there is a clear consensus about this functionality being a good thing (also 0 votes). Perhaps we can get more people's opinions? was (Author: otis): I'm not sure there is a clear consensus about this functionality being a good thing. Perhaps we can get more people's opinions? A SolrServer impl to front-end multiple urls Key: SOLR-844 URL: https://issues.apache.org/jira/browse/SOLR-844 Project: Solr Issue Type: New Feature Components: clients - java Affects Versions: 1.3 Reporter: Noble Paul Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-844.patch, SOLR-844.patch, SOLR-844.patch Currently a {{CommonsHttpSolrServer}} can talk to only one server. This demands that the user have a LoadBalancer or do the roundrobin on their own. We must have a {{LBHttpSolrServer}} which must automatically do a Loadbalancing between multiple hosts. This can be backed by the {{CommonsHttpSolrServer}} This can have the following other features * Automatic failover * Optionally take in a file /url containing the the urls of servers so that the server list can be automatically updated by periodically loading the config * Support for adding removing servers during runtime * Pluggable Loadbalancing mechanism. (round-robin, weighted round-robin, random etc) * Pluggable Failover mechanisms
[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer
[ https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642188#action_12642188 ] Walter Underwood commented on SOLR-822: --- Yes, it should be in Lucene. LIke this: http://webui.sourcelabs.com/lucene/issues/1343 There are (at least) four kinds of character mapping: Unicode normalization from decomposed to composed forms (always safe). Unicode normalization from compatability forms to standard forms (may change the look, like fullwidth to halfwidth Latin). Language-specific normalization, like oe to รถ (German-only). Mappings that improve search but are linguistically dodgy, like stripping accents and mapping katakana to hirigana. wunder CharFilter - normalize characters before tokenizer -- Key: SOLR-822 URL: https://issues.apache.org/jira/browse/SOLR-822 Project: Solr Issue Type: New Feature Components: Analysis Reporter: Koji Sekiguchi Priority: Minor Attachments: character-normalization.JPG, sample_mapping_ja.txt, SOLR-822.patch, SOLR-822.patch A new plugin which can be placed in front of tokenizer/. {code:xml} fieldType name=textCharNorm class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.MappingCharFilterFactory mapping=mapping_ja.txt / tokenizer class=solr.MappingCJKTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType {code} charFilter/ can be multiple (chained). I'll post a JPEG file to show character normalization sample soon. MOTIVATION: In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and Morphological Analyzer. When we use morphological analyzer, because the analyzer uses Japanese dictionary to detect terms, we need to normalize characters before tokenization. I'll post a patch soon, too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-815) Add new Japanese half-width/full-width normalizaton Filter and Factory
[ https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641071#action_12641071 ] Walter Underwood commented on SOLR-815: --- I looked it up, and even found a reason to do it the right way. Latin should be normalized to halfwidth (in the Latin-1 character space). Kana should be normalized to fullwidth. Normalizing Latin characters to fullwidth would mean you could not use the existing accent-stripping filters or probably any other filter that expected Latin-1, like synonyms. Normalizing to halfwidth makes the rest of Solr and Lucene work as expected. See section 12.5: http://www.unicode.org/versions/Unicode5.0.0/ch12.pdf The compatability forms (the ones we normalize away from) are int the Unicode range U+FF00 to U+FFEF. The correct mappings from those forms are in this doc: http://www.unicode.org/charts/PDF/UFF00.pdf Other charts are here: http://www.unicode.org/charts/ Add new Japanese half-width/full-width normalizaton Filter and Factory -- Key: SOLR-815 URL: https://issues.apache.org/jira/browse/SOLR-815 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Todd Feak Assignee: Koji Sekiguchi Priority: Minor Attachments: SOLR-815.patch Japanese Katakana and Latin alphabet characters exist as both a half-width and full-width version. This new Filter normalizes to the full-width version to allow searching and indexing using both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-814) Add new Japanese Hiragana Filter and Factory
[ https://issues.apache.org/jira/browse/SOLR-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12640605#action_12640605 ] Walter Underwood commented on SOLR-814: --- This seems like a bad idea. Hirigana and katakana are used quite differently in Japanese. They are not interchangeable. I was the engineer for Japanese support in Ultraseek for years and even visited our distributor there, but no one ever asked for this feature. They asked for a lot of things, but never this. It is very useful, maybe essential, to normalize full-width and half-width versions of hirigana, katakana, and ASCII. Add new Japanese Hiragana Filter and Factory Key: SOLR-814 URL: https://issues.apache.org/jira/browse/SOLR-814 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Todd Feak Priority: Minor Attachments: SOLR-814.patch Japanese Hiragana and Katakana character sets can be easily translated between. This filter normalizes all Hiragana characters to their Katakana counterpart, allowing for indexing and searching using either. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-815) Add new Japanese half-width/full-width normalizaton Filter and Factory
[ https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12640609#action_12640609 ] Walter Underwood commented on SOLR-815: --- If I remember correctly, Latin characters should normalize to half-width, not full-width. Add new Japanese half-width/full-width normalizaton Filter and Factory -- Key: SOLR-815 URL: https://issues.apache.org/jira/browse/SOLR-815 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Todd Feak Priority: Minor Attachments: SOLR-815.patch Japanese Katakana and Latin alphabet characters exist as both a half-width and full-width version. This new Filter normalizes to the full-width version to allow searching and indexing using both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Offer to submit some custom enhancements
Python marshal format supports everything we need and is easy to implement in Java. It is roughly equivalent to JSON, but binary. http://docs.python.org/library/marshal.html wunder On 10/16/08 8:16 AM, Shalin Shekhar Mangar [EMAIL PROTECTED] wrote: Hi Todd, AFAIK, protocol buffers cannot be used for Solr because it is unable to support the NamedList structure that all Solr components use. The binary protocol (NamedListCodec) that SolrJ uses to communicate with Solr server is extremely optimized for our response format. However it is Java only. There are other projects such as Apache Thrift ( http://incubator.apache.org/thrift/) and Etch (both in incubation) which can be looked at. There are a few issues in Thrift which may help us in the future: https://issues.apache.org/jira/browse/THRIFT-110 https://issues.apache.org/jira/browse/THRIFT-122 On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd [EMAIL PROTECTED]wrote: Reposting, as I inadvertently thread hijacked on the first one. My bad. Hi all, I have a handful of custom classes that we've created for our purposes here. I'd like to share them if you think they have value for the rest of the community, but I wanted to check here before creating JIRA tickets and patches. Here's what I have: 1. DoubleMetaphoneFilter and Factory. This replaces usage of the PhoneticFilter and Factory allowing access to set maxCodeLength() on the DoubleMetaphone encoder and access to the alternate encodings that the encoder provides for some words. 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and Latin alphabet) exist in both a FullWidth and HalfWidth form. This filter normalizes by switching to the FullWidth form for all the characters. I have seen at least one JIRA ticket about this issue. This implementation doesn't rely on Java 1.6. 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be translated to Katakana. This filter normalizes to Katakana so that data and queries can come in either way and get hits. Also, I have been requested to create a prototype that you may be interested in. I'm to construct a QueryResponseWriter that returns documents using Google's Protocol Buffers. This would rely on an existing patch that exposes the OutputStream, but I would like to start the work soon. Are there license concerns that would block sharing this with you? Is there any interest in this? Thanks for your consideration, Todd Feak
[jira] Commented: (SOLR-777) backword match search, for domain search etc.
[ https://issues.apache.org/jira/browse/SOLR-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12632489#action_12632489 ] Walter Underwood commented on SOLR-777: --- You don't need backwards matching for this, and it doesn't really do the right thing. Split the string on ., reverse the list, and join successive sublists with .. Don't index the length one list, since that is .com, .net, etc. Do the same processing at query time. This is a special analyzer. backword match search, for domain search etc. - Key: SOLR-777 URL: https://issues.apache.org/jira/browse/SOLR-777 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Koji Sekiguchi Priority: Minor There is a requirement for searching domains with backward match. For example, using apache.org for a query string, www.apache.org, lucene.apache.org could be returned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: replace stax API with Geronimo-stax+Woodstox
We've been using woodstox in production for over a year. No problems. wunder On 9/9/08 8:07 AM, Yonik Seeley [EMAIL PROTECTED] wrote: FYI, I'm testing Solr with woodstox now and will probably do some ad hoc stress testing too. But woodstox is a quality parser. I expect less problems then we had with the reference implementation (and it may even be faster too) -Yonik
Re: Solr changes date format?
On 8/12/08 11:42 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : by a point but, as you can see, the separator is converted to a comma when : is accesed : from Solr (i can see this too from Solr web admin) this boggles my mind ... i can't think of *anything* in Solr that would do this .. If a European locale was used when the seconds portion of the date was formatted, it would use a comma for the radix point. wunder
Re: [VOTE] Set Solr 1.3 freeze and release date
I would strongly prefer a released version of Lucene. We made some changes to Solr 1.1 that required tweaks inside of Lucene, and it was quite a treasure hunt to a suitable set of Lucene source. It just seems wrong for Solr to release a version of Lucene. wunder On 8/6/08 8:53 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : Yes, it's good that lots of Solr people are also Lucene people. But I : don't think that makes it alright to ship Lucene nightlies or : snapshots. Apache Lucene is a TLP, Apache Solr and Apache Lucene-Java are just individual products/sub-projects of that TLP. If the Apache Lucene PMC votes to release a particular bundle of source code as Apache Solr 1.3 and that bundle includes source (or binary) code from the Lucene-Java subproject that hasn't already been released (via PMC vote) then it is by definition officially released Apache Lucene software. So in a nutshell: yes it is alright for Solr to ship Lucene nightlies -- because once the PMC votes on that Solr release, it doesn't matter where that Lucene-Java jar came from, it's officially released code. I'm told there is even precedence for the PMC of a TLP X to vote and officially release code from completley seperate TLP Y because Y had not had a release and X was ready to go. Where dependencies on snapshots in official releases causes problems is when those snapshots are from third parties and/or are not reproducable -- where the specific version of the dependencies is unknown and as a result the dependee can not be reproduced. We do not have that problem with any Apache codebase we have a dependency on. We know exactly which svn revision the dependencies come from, and since the SVN repository is public, anyone can recreate it. -Hoss
Re: Solr Logo thought
I kind of like the flaming version at http://www.solrmarc.org/ Not very fired up about the other choices. wunder On 8/1/08 9:45 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hola, Yes, logo, trivial issue (hi Lance). But logos are important, so: I've cast my vote, but I don't really love even the logo I voted for (#2 -- a little too pale/shinny, not very bold, so to speak). Lukas (BCCed) did the logo for Mahout. He made a number of variations and was very open to suggestions during the process. I wonder if we could ask him to give Solr logo a shot if he is not on vacation. Do we have time for another logo, assuming Lukas is willing to contribute? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
[jira] Commented: (SOLR-600) XML parser stops working under heavy load
[ https://issues.apache.org/jira/browse/SOLR-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12605751#action_12605751 ] Walter Underwood commented on SOLR-600: --- It could also be a concurrency bug in Solr that shows up on the IBM JVM because the thread scheduler makes different decisions. XML parser stops working under heavy load - Key: SOLR-600 URL: https://issues.apache.org/jira/browse/SOLR-600 Project: Solr Issue Type: Bug Components: update Affects Versions: 1.3 Environment: Linux 2.6.19.7-ss0 #4 SMP Wed Mar 12 02:56:42 GMT 2008 x86_64 Intel(R) Xeon(R) CPU X5450 @ 3.00GHz GenuineIntel GNU/Linux Tomcat 6.0.16 SOLR nightly 16 Jun 2008, and versions prior JRE 1.6.0 Reporter: John Smith Under heavy load, the following is spat out for every update: org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at java.util.AbstractList$SimpleListIterator.hasNext(Unknown Source) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:225) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:66) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:196) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:735) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: IDF in Distributed Search
Global IDF does not require another request/response. It is nearly free if you return the right info. Return the total number of docs and the df in the original response. Sum the doc counts and dfs, recompute the idf, and re-rank. See this post for an efficient way to do it: http://wunderwood.org/most_casual_observer/2007/04/progressive_reranking.htm l This works best if you treat the results from each server as a queue and refill just that queue when it is exhausted. All the good results might be from one server. wunder On 4/11/08 8:50 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Fri, Apr 11, 2008 at 11:39 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: So, I'd like to see what it would take to add distributed IDF info to Solr's distributed search. Here are some questions to get the discussion going: - Is anyone already working on it? - Does anyone plan on working on it in the very near future? - Does anyone already have thoughts how and where dist. idf could be plugged in? - There is a mention of dist idf and performance cost up there - any idea how costly dist idf would It's relatively easy to implement, but the performance cost is is not negligible since it adds another search phase (another request-response). It should be optional of course (globalidf=true), so there is no reason not to add this feature. I also left room for this stage (ResponseBuilder.STAGE_PARSE_QUERY), which is ordered before query execution. -Yonik
[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches
[ https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12567068#action_12567068 ] Walter Underwood commented on SOLR-127: --- Two reasons to do HTTP caching for Solr: First, Solr is HTTP and needs to implement that correctly. Second, caches are much harder to implement and test than the cache information in HTTP. HTTP caches already exist and are well tested, so the implementation cost is zero and deployment is very easy. The HTTP spec already covers which responses should be cached. A 400 response may only be cached if it includes explicit cache control headers which allow that. See RFC 2616. We are using a caching load balancer and caching in Apache front ends to Tomcat. We see an increase of more than 2X in the capacity of our search farm. I would recommend against Solr-specific cache information in the XML part of the responses. Distributed caching is extremely difficult to get right. Around 25% of the HTTP 1.1 spec is devoted to caching and there are still grey areas. Make Solr more friendly to external HTTP caches --- Key: SOLR-127 URL: https://issues.apache.org/jira/browse/SOLR-127 Project: Solr Issue Type: Wish Reporter: Hoss Man Assignee: Hoss Man Fix For: 1.3 Attachments: CacheUnitTest.patch, CacheUnitTest.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch an offhand comment I saw recently reminded me of something that really bugged me about the serach solution i used *before* Solr -- it didn't play nicely with HTTP caches that might be sitting in front of it. at the moment, Solr doesn't put in particularly usefull info in the HTTP Response headers to aid in caching (ie: Last-Modified), responds to all HEAD requests with a 400, and doesn't do anything special with If-Modified-Since. t the very least, we can set a Last-Modified based on when the current IndexReder was open (if not the Date on the IndexReader) and use the same info to determing how to respond to If-Modified-Since requests. (for the record, i think the reason this hasn't occured to me in the 2+ years i've been using Solr, is because with the internal caching, i've yet to need to put a proxy cache in front of Solr) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: remote solrj using xml versus json
If you want speed, you should use Python marshal format. It handles data structures equivalent to JSON, but in binary. Very easy to convert to Java data types. --wunder On 11/9/07 12:56 PM, Erik Hatcher [EMAIL PROTECTED] wrote: anybody compared/contrasted the two? seems like yonik's noggit parser might have a performance edge on xml parsing ?! Erik
Re: default text type and stop words
I also said, Stopword removal is a reasonable default because it works fairly well for a general text corpus. Ultraseek keeps stopwords but most engines don't. I think it is fine as a default. I also think you have to understand stopwords at some point. wunder On 11/5/07 9:59 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : This isn't a problem in Lucene or Solr. It is a result of the analyzers : you have chosen to use. If you choose to remove stopwords, you will not : be able to match stopwords. I believe paul's point was that this use of stopwords is in the text fieldtype in the example schema.xml ... which many people use as is. I'm personally of the mindset that it's fine like it is. While people who understand that an is a stop word might ask why does 'rating:PG AND name:an' match 40K movies, it should match 0? there is another (probably larger) group of people who won't know how the search is implemented, or that an is a stop word, and they will look at the same results and ask why am i getting 40K results? most of these don't have 'an' in the title? i should only be getting X results. That second group of people aren't going to be any happier if you give them 0 results instead -- at least this way people get some results to work with. -Hoss
Re: HTTP or RMI, Jini, JavaSpaces for distributed search
Please don't switch to RMI. We've spent the past year converting our entire middle tier from RMI to HTTP. We are so glad that we no longer have any RMI servers. The big advantage of HTTP is that there are hundreds, maybe thousands, of engineers working on making it fast, on tools for it, on caches, etc. If you really need more compact responses, I would recommend coding the JSON output in Python marshal format. That is compact, fast, and easy to parse. We used that for a Java client in Ultraseek. wunder On 9/21/07 11:08 AM, Yonik Seeley [EMAIL PROTECTED] wrote: I wanted to take a step back for a second and think about if HTTP was really the right choice for the transport for distributed search. I think the high-level approach in SOLR-303 is the right way to go about it, but I'm unsure if HTTP is the right transport. Pro HTTP: - using HTTP allows one to use an http load-balancer to distribute load across multiple copies of the same shard by assigning a VIP (virtual IP) to each shard. - because you do pretty much everything by hand, you know that there isn't some hidden limitation that will jump out and bite you later. Cons HTTP: - you end up doing everything by hand... connection handling, request serialization, response parsing, etc... - goes through normal servlet channels... every sub-request will be logged to the access logs, slowing things down. - more network bandwidth used unless we come up with a new BinaryResponseWriter and Parser Currently, SOLR-303 uses and parses the XML response format, which has some serious downsides: - response size limits scalability and how deep in responses you can go... If you want to retrieve documents 5000 through 5009, even though the user only requested 10 documents, the top-level searcher needs to get the top 5009 documents from *each* shard... and that can quickly exhaust the network bandwidth of the NIC. XML parsing on the order of nShards*5009 entries won't be any picnic either. I'm thinking the load-balancing of HTTP is overrated also, because it's inflexible. Adding another shard requires adding another VIP in the load-balancer, and changing which servers have which shards or adding new copies of a shard also requires load-balancer configuration. Everything points to Solr being able to do the load-balancing itself in the future, and there wouldn't seem to be much benefit to using a load-balancer w/ VIPS for each shard vs having Solr do it. So even if we stuck with HTTP, Solr would need - a binary protocol to minimize network bandwidth use - load balancing across shard copies itself Given that, would it make sense to just go with RMI instead? And perhaps leverage some other higher level services (Jini? JavaSpaces?) I'd like to hear from people with more experience with RMI friends, and what the potential downsides are to using these technologies. -Yonik
[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches
[ https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527694 ] Walter Underwood commented on SOLR-127: --- Last-modified does require monotonic time, but ETags are version stamps without any ordering. The indexVersion should be fine for an ETag. Make Solr more friendly to external HTTP caches --- Key: SOLR-127 URL: https://issues.apache.org/jira/browse/SOLR-127 Project: Solr Issue Type: Wish Reporter: Hoss Man Attachments: HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch an offhand comment I saw recently reminded me of something that really bugged me about the serach solution i used *before* Solr -- it didn't play nicely with HTTP caches that might be sitting in front of it. at the moment, Solr doesn't put in particularly usefull info in the HTTP Response headers to aid in caching (ie: Last-Modified), responds to all HEAD requests with a 400, and doesn't do anything special with If-Modified-Since. t the very least, we can set a Last-Modified based on when the current IndexReder was open (if not the Date on the IndexReader) and use the same info to determing how to respond to If-Modified-Since requests. (for the record, i think the reason this hasn't occured to me in the 2+ years i've been using Solr, is because with the internal caching, i've yet to need to put a proxy cache in front of Solr) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-277) Character Entity of XHTML is not supported with XmlUpdateRequestHandler .
[ https://issues.apache.org/jira/browse/SOLR-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508408 ] Walter Underwood commented on SOLR-277: --- This is not a bug. Solr accepts XML, not XHTML. It does not accept XHTML-only entities. The Solr update XML format is a specific Solr XML format, not XML, not DocBook, not anything else. To index XHTML, parse it and convert it to Solr XML update format. Character Entity of XHTML is not supported with XmlUpdateRequestHandler . - Key: SOLR-277 URL: https://issues.apache.org/jira/browse/SOLR-277 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.3 Reporter: Toru Matsuzawa Attachments: XmlUpdateRequestHandler.patch Character Entity of XHTML is not supported with XmlUpdateRequestHandler . http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent It is necessary to correspond with XmlUpdateRequestHandler because xpp3 cannot use !DOCTYPE. I think it is necessary until StaxUpdateRequestHandler becomes /update. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-216) Improvements to solr.py
[ https://issues.apache.org/jira/browse/SOLR-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499923 ] Walter Underwood commented on SOLR-216: --- GET is the right semantic for a query, since it doesn't change the resource. It also allows HTTP caching. If Solr has URL length limits, that's a bug. Improvements to solr.py --- Key: SOLR-216 URL: https://issues.apache.org/jira/browse/SOLR-216 Project: Solr Issue Type: Improvement Components: clients - python Affects Versions: 1.2 Reporter: Jason Cater Assignee: Mike Klaas Priority: Trivial Attachments: solr.py I've taken the original solr.py code and extended it to include higher-level functions. * Requires python 2.3+ * Supports SSL (https://) schema * Conforms (mostly) to PEP 8 -- the Python Style Guide * Provides a high-level results object with implicit data type conversion * Supports batching of update commands -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r541391 - in /lucene/solr/trunk: CHANGES.txt example/solr/conf/xslt/example_atom.xsl example/solr/conf/xslt/example_rss.xsl
On 5/25/07 10:45 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : I'd slap versions to those 2 XSL files to immediately answer which : version of Atom|RSS does this produce? i'm comfortable calling the example_rss.xsl RSS, since most RSS readers will know what do do with it, but i don't know that i'm comfrotable calling it any specific version of RSS, people are more likely to get irrate about claiming ot be a specific version if one little thing is wrong then they are about not claiming to be anything in particular. Some versions of RSS are quite incompatible, so we MUST say what version we are implementing. RSS 1.0 is completely different from the 0.9 series and 2.0. Atom doesn't have a version number, but RFC 4287 Atom is informally called 1.0. wunder
[jira] Commented: (SOLR-208) RSS feed XSL example
[ https://issues.apache.org/jira/browse/SOLR-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496624 ] Walter Underwood commented on SOLR-208: --- I wasn't in the RSS wars, either, but I was on the Atom working group. That was a bunch of volunteers making a clean, testable spec for RSS functionality (http://www.ietf.org/rfc/rfc4287). RSS 2.0 has some bad ambiguities, especially around ampersand and entities in titles. The default has changed over the years and clients do different, incompatible things. GData is just a way to do search result stuff that we would need anyway. It is standard set of URL parameters for query, start-index, and categories, and a few Atom extensions for total results, items per page, and next/previous. http://code.google.com/apis/gdata/reference.html RSS feed XSL example Key: SOLR-208 URL: https://issues.apache.org/jira/browse/SOLR-208 Project: Solr Issue Type: New Feature Components: clients - java Affects Versions: 1.2 Reporter: Brian Whitman Assigned To: Hoss Man Priority: Trivial Attachments: rss.xsl A quick .xsl file for transforming solr queries into RSS feeds. To get the date and time in properly you'll need an XSL 2.0 processor, as in http://wiki.apache.org/solr/XsltResponseWriter . Tested to work with the example solr distribution in the nightly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: dynamic copyFields
That syntax is from the ed editor. I learned it in 1975 on Unix v6/PWB, running on a PDP-11/70. --wunder On 5/2/07 5:04 PM, Mike Klaas [EMAIL PROTECTED] wrote: On 5/2/07, Ryan McKinley [EMAIL PROTECTED] wrote: How about Mike's other suggestion: copyField regexp=s/(.*)_s/\1_t/ / this would keep the glob style for source and dest, but use regex to transform a sorce - dest Wow, I didn't even remember suggesting that. I agree (with Hoss) that backward compatibility is important, but I disagree (with myself) that the above syntax is nice. Outside of perl, I'm not sure how common the s/ / / syntax is (is it used in java?) perhaps copyField re_source=(.*)_s dest=\1_t/ ? -Mike
Re: Progressive Query Relaxation
From the name, I thought this was an adaptive precision scheme, where the engine automatically tries broader matching if there are no matches or just a few. We talked about doing that with Ultraseek, but it is pretty tricky. Deciding when to adjust it is harder than making it variable. Instead, this is an old idea that search amateurs seem to like. Show all exact matches, then near matches, etc. This is the kind of thing people suggest when they don't understand that a ranking algorithm combines that evidence in a much more powerful way. I talked customers out of this once or twice each year at Ultraseek. This approach fails for: * common words * misspellings Since both of those happen a lot, this idea fails for a lot of queries. I presume that Oracle implemented this to shut up some big customer, since it isn't a useful feature unless it closes a sale. DisMax gives you something somewhat similar to this, by selecting the best matching field. That is much more powerful and gives much better results. wunder On 4/9/07 12:46 AM, J. Delgado [EMAIL PROTECTED] wrote: Has anyone within the Lucene or Solr community attempted to code a progressive query relaxation technique similar to the one described here for Oracle Text? http://www.oracle.com/technology/products/text/htdocs/prog_relax.html Thanks, -- J.D.
Re: Progressive Query Relaxation
On 4/10/07 10:06 AM, J. Delgado [EMAIL PROTECTED] wrote: Progressive relaxation, at least as Oracle has defined it, is a flexible, developer defined series of queries that are efficiently executed in progression and in one trip to the engine, until minimum of hits required is satisfied. It is not a self adapting precision scheme nor it tries to guess what is the best match. Correct. Search engines are all about the best match. Why would you show anything else? This is an RDBMS flavored approach, not an approach that considers natural language text. Sets of matches, not a ranked list. It fails as soon as one of the sets gets too big, like when someone searches for laserjet at HP.com. That happens a lot. It assumes that all keywords are the same, something that Gerry Salton figured out was false thirty years ago. That is why we use tf.idf instead of sets of matches. I see a lot of design without any talk about what problem they are solving. What queries don't work? How do we make those better? Let's work from real logs and real data. Oracle's hack doesn't solve any problem I've see in real query logs. I'm doing e-commerce search, and our current engine does pretty much what Oracle is offering. The results are not good, and we are replacing it with Solr and DisMax. My off-line relevance testing shows a big improvement. wunder -- Search Guru, Netflix
Re: Progressive Query Relaxation
On 4/10/07 10:38 AM, J. Delgado [EMAIL PROTECTED] wrote: I think you have something personal against Oracle... Hey I have no interest in defending Oracle, but this no hack. It's true, I don't have much respect for Oracle's text search. When I was working on enterprise search, we never really worried about them because their quality and speed just wasn't competitive. I do not look to them as a reliable source of good ideas for search. Oracle's problem statement has a plausible strawman, but there are lots of better ways to deal with misspellings. Heck, my dev instance of Solr gives Michael Crichton as the first hit for Michel Crichton. It is not true that hits which are a poor match will be mixed in with hits which are a good match. Hmmm, Crichton is much more likely to be misspelled than Michael, so maybe their strawman isn't very good. wunder
[jira] Created: (SOLR-161) Dangling dash causes stack trace
Dangling dash causes stack trace Key: SOLR-161 URL: https://issues.apache.org/jira/browse/SOLR-161 Project: Solr Issue Type: Bug Components: search Affects Versions: 1.1.0 Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel Reporter: Walter Underwood I'm running tests from our search logs, and we have a query that ends in a dash. That caused a stack trace. org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the truth -': Encountered EOF at line 1, column 23. Was expecting one of: ( ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127) at org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272) at org.apache.solr.core.SolrCore.execute(SolrCore.java:595) at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-161) Dangling dash causes stack trace
[ https://issues.apache.org/jira/browse/SOLR-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473625 ] Walter Underwood commented on SOLR-161: --- The parser can have a rule for this rather than exploding. A trailing dash is never meaningful and can be omitted, whether we're allowing +/- or not. Seems like a grammar bug to me. --wunder Dangling dash causes stack trace Key: SOLR-161 URL: https://issues.apache.org/jira/browse/SOLR-161 Project: Solr Issue Type: Bug Components: search Affects Versions: 1.1.0 Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel Reporter: Walter Underwood I'm running tests from our search logs, and we have a query that ends in a dash. That caused a stack trace. org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the truth -': Encountered EOF at line 1, column 23. Was expecting one of: ( ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127) at org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272) at org.apache.solr.core.SolrCore.execute(SolrCore.java:595) at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-161) Dangling dash causes stack trace
[ https://issues.apache.org/jira/browse/SOLR-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473628 ] Walter Underwood commented on SOLR-161: --- It is really a Lucene query parser bug, but it wouldn't hurt to do s/(.*)-// as a workaround. Assuming my ed(1) syntax is still fresh. Regardless, no query string should ever give a stack trace. --wunder Dangling dash causes stack trace Key: SOLR-161 URL: https://issues.apache.org/jira/browse/SOLR-161 Project: Solr Issue Type: Bug Components: search Affects Versions: 1.1.0 Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel Reporter: Walter Underwood I'm running tests from our search logs, and we have a query that ends in a dash. That caused a stack trace. org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the truth -': Encountered EOF at line 1, column 23. Was expecting one of: ( ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127) at org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272) at org.apache.solr.core.SolrCore.execute(SolrCore.java:595) at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: resin and UTF-8 in URLs
Let's not make this complicated for situations that we've never seen in practice. Java is a Unicode language and always has been. Anyone running a Java system with a Shift-JIS default should already know the pitfalls, and know them much better than us (and I know a lot about Shift-JIS). The URI spec says UTF-8, so we can be compliant and tell people to fix their code. If they need to add specific hacks for their broken software, that is OK. We don't need generic design features for a few broken clients. RFC 3896 has been out for two years now. That is long enough for decently-maintained software to get it right. wunder On 2/1/07 2:14 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : If we can do something small that makes the most normal cases work : even if the container is not configured, that seems good. but how do we know the user wants what we consider a normal cases to work? ... if every servlet container lets you configure your default charset differently, we have no easy way to tell if/when they've configured the default properly, to know if we should override it. If someone does everything in Shift-JIS, and sets up their servlet container with Shift-JIS as their default, and installs solr -- i don't want them to think Solr sucks because there is a default in Solr they don't know about (or know how to disable) that assumes UTF-8. On the other hand: if someone really hasn't thought about charsets at all, then it doesn't seem that bad to use whatever default their servlet container says to use -- as I understand it some containers (tomcat included) pick their default based on the JVMs configuration (i assume from the user.language sysproperty) ... that certainly seems like a better default then for us ot asume UTF-8 -- even if it is latin1 for en, because most novice users are probably okay with latin1 ... if you're starting to worry about more complex characters that aren't in the default charset your servlet container picks for you, then reading a little documentation is a good idea. : At the very lease, we should change the examples in: : http://wiki.apache.org/solr/SolrResin etc oh absolutely. -Hoss
Re: resin and UTF-8 in URLs
On 2/1/07 3:18 PM, Chris Hostetter [EMAIL PROTECTED] wrote: As for XML, or any other format a user might POST to solr (or ask solr to fetch from a remote source) what possible reason would we have to only supporting UTF-8? .. why do you suggest that the XML standard specify UTF-8, [so] we should use UTF-8 ... doesn't the XML standard say we should use the charset specified in the content-type if there is one, and if not use the encoding specified in the xml header, ie... ?xml encoding='EUC-JP'? The XML spec says that XML parsers are only required to support UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different encoding for XML, there is no guarantee that a conforming parser will accept it. Ultraseek has been indexing XML for the past nine years, and I remember a single customer that had XML in a non-standard encoding. Effectively all real-world XML is in one of the standard encodings. The right spec for XML over HTTP is RFC 3023. For text/xml with no charset spec, the XML must be interpreted as US-ASCII. From section 8.5: Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is us-ascii. wunder
Re: loading many documents by ID
On 1/31/07 3:39 PM, Chris Hostetter [EMAIL PROTECTED] wrote: : Oh, and there have been numerous people interested in updateable : documents, so it would be nice if that part was in the update handler. We'd have to make it very clear that this only works if all fields are STORED. Isn't there some way to do this automatically instead of relying on documentation? We might need to add something, maybe a required attribute on fields, but a runtime error would be much, much better than a page on the wiki. wunder
Re: loading many documents by ID
On 1/31/07 9:05 PM, Ryan McKinley [EMAIL PROTECTED] wrote: We'd have to make it very clear that this only works if all fields are STORED. Isn't there some way to do this automatically instead of relying on documentation? We might need to add something, maybe a required attribute on fields, but a runtime error would be much, much better than a page on the wiki. what about copyField? With copyField, it is reasonable to have fields that are not stored and are generated from the other stored fields. (this is what my setup looks like). Mine, too. That is why I suggested explicit declarations in the schema to say which fields are required. wunder
Re: Can this be achieved? (Was: document support for file system crawling)
On 1/19/07 10:33 AM, Chris Hostetter [EMAIL PROTECTED] wrote: [...] but if your interest is in having an enterprise search solution that people can deploy on a box and haveit start working for them, then there is no reason for all of that code to run in a single JVM using a single code base -- i'm going to go out on a limb and guess that that the Google Appliances run more then a single process :) Ultraseek does exactly that and is a single multi-threaded process. A single process is much easier for the admin. A multi-process solution is more complicated to start up, monitor, shut down, and upgrade. There is decent demand for a spidering enterprise search engine. Look at the Google Appliance, Ultraseek, and IBM OmniFind. The free IBM OmniFind Yahoo! Edition uses Lucene. I'd love to see the Ultraseek spider connected to Solr, but that depends on Autonomy. wunder -- Walter Underwood Search Guru, Netflix
Re: Java version for solr development (was Re: Update Plugins)
On 1/16/07 8:03 PM, Yonik Seeley [EMAIL PROTECTED] wrote: I think it's a bit soon to move to 1.6 - I don't know how many platforms it's available for yet. It is still in early release from IBM for their PowerPC servers, so requiring 1.6 would be a serious problem for us. wunder -- Walter Underwood Search Guru, Netflix
Re: [jira] Commented: (SOLR-85) [PATCH] Add update form to the admin screen
On 12/18/06 7:52 AM, Thorsten Scherler [EMAIL PROTECTED] wrote: On Fri, 2006-12-15 at 11:16 -0800, Chris Hostetter wrote: : The next thing on my list is to write a small cli based on httpclient to : send the update docs as alternative of the post.sh. You may want to take a look at SOLR-20 and SOLR-30 ... those issues are first stabs at Java Client APIs for query/update which if cleaned up a bit could become the basis for your CLI. Hmm, I had a look at them but actually what I came up with is way smaller and more focused on the update part. https://issues.apache.org/jira/browse/SOLR-86 It is a replacement of the post.sh not much more (yet). I'll take a look at this. I also wrote my own, because I had no idea that the Java client code existed. wunder -- Walter Underwood Search Guru, Netflix
Heavily-populated bit sets
As an aside to SOLR-80, there is a standard trick for compressing a bit set with more than half the bits set. You invert it, make it less than half full, then store that. Basically, store the zeroes instead of the ones. It costs one extra bit to say whether it is inverted or not. wunder -- Walter Underwood Search Guru, Netflix
[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names
[ http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12454159 ] Walter Underwood commented on SOLR-73: -- I think the aliases are harder to read. You need to go elsewhere to figure them out. I read documentation, but I didn't find the part of the wiki that explained them and I had to ask the mailing list. The javadoc uses the full class name. Google and Yahoo searches should work better with the full class name (Yahoo is working much better than Google for that right now). The aliases save typing, but I don't think they improve usability. Full class names are simple and unambiguous. If we want usability for non-programmers, we can't have them editing an XML file. schema.xml and solrconfig.xml use CNET-internal class names --- Key: SOLR-73 URL: http://issues.apache.org/jira/browse/SOLR-73 Project: Solr Issue Type: Bug Components: search Reporter: Walter Underwood The configuration files in the example directory still use the old CNET-internal class names, like solr.LRUCache instead of org.apache.solr.search.LRUCache. This is confusing to new users and should be fixed before the first release. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names
[ http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12454190 ] Walter Underwood commented on SOLR-73: -- The context required to resolve the ambiguity is a wiki page that I didn't know existed. Since I didn't know about it, I tried to figure it out by reading the code, and then by sending e-mail to the list. In my case, I was writing two tiny classes, but the issue would be the same if I was a non-programmer adding some simple plug-ins. With a full class name, there is no ambiguity. Again, this saves typing at the cost of requiring an indirection through some unspecified documentation. I saw every customer support e-mail for eight years with Ultraseek, so I'm pretty familiar with the problems that search engine admins run into. One of the things we learned was that documentation doesn't fix an unclear product. You fix the product instead of documenting how to understand it. Requiring users to edit an XML file is a separate issue, but I think it is a serious problem, especially because any error messages show up in the server logs. schema.xml and solrconfig.xml use CNET-internal class names --- Key: SOLR-73 URL: http://issues.apache.org/jira/browse/SOLR-73 Project: Solr Issue Type: Bug Components: search Reporter: Walter Underwood The configuration files in the example directory still use the old CNET-internal class names, like solr.LRUCache instead of org.apache.solr.search.LRUCache. This is confusing to new users and should be fixed before the first release. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30
On 11/20/06 5:51 PM, Yonik Seeley [EMAIL PROTECTED] wrote: : If you really want to handle failure in an error response, write that : to a string and if that fails, send a hard-coded string. Hmmm... i could definitely get on board an idea like that. I took pains to make things streamable.. I'd hate to discard that. How do other servers handle streaming back a response and hitting an error? You found the design tradeoff! We can stream the results or we can give reliable error codes for errors that happen during result processing. We can't do both. Ultraseek does streaming, but we were generating HTML, so we could print reasonable errors in-line. Streaming is very useful for HTML pages, because it allows the first pixels to be painted as soon as possible. It isn't as important on the back end, unless someone has gone to the considerable trouble of making their entire front-end able to stream the back-end results to HTML. If we aren't calling Writer.flush occasionally, then the streaming is just filling up a buffer smoothly. The client won't see anything until TCP decides to send it. Does Lucene access fetch information from disk while we iterate through the search results? If that happens a few times, then streaming might make a difference. If it is mostly CPU-bound, then streaming probably doesn't help. wunder -- Walter Underwood Search Guru, Netflix
Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30
On 11/20/06 7:22 PM, Fuad Efendi [EMAIL PROTECTED] wrote: This is just a sample... 1. What is an Error? 2. What is a Mistake? 3. What is an application bug? 4. What is a 'system crash'? These are not HTTP concepts. The request on a URI can succeed or fail or result in other codes. Mistakes and crashes are outside of the HTTP protocol. Of cource, XML-over-HTTP engine is not the same as HTML-over-HTTP... However... Walter noticed 'crawling'... I can't imagine a company which will put SOLR as a front-end accessible to crawlers... (To crawl an indexing service instead of source documents!?) XML-over-HTTP is exactly the same as HTML-over-HTTP. In HTML, we could return detailed error information in a meta tag. No difference. If something is on HTTP, a good crawler can find it. All it takes is one link, probably to the admin URL. Once found, that crawler will happily pound on errors returned by 200. XSLT support means you could build the search UI natively on Solr, so that might happen. Even without a crawler, we must work with caches and load balancers. I will be using Solr with a load balancer in production. If Solr is a broken HTTP server, we will have to build something else. I am sure that mixing XML-based interface with HTTP status codes is not an attractive 'architecture', we shold separate conserns and leave HTTP code handling to a servlet container as much as possible... We don't need to use HTTP response codes deep in Solr, but we do need to separate bad parameters, retryable errors, non-retryable errors, and so on. We can call them what ever we want internally, but we need to report them properly over HTTP. wunder -- Walter Underwood Search Guru, Netflix
Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30
On 11/20/06 5:51 PM, Yonik Seeley [EMAIL PROTECTED] wrote: Now that I think about it though, one nice change would be to get rid of the long stack trace for 400 exceptions... it's not needed, right? That is correct. A client error (400) should not be reported with a server stack trace. --wunder
Phonetic Token Filter
I've written a simple phonetic token filter (and factory) based on the Double Metaphone implementation in Jakarta Codecs to contribute. Three questions: 1. Does this sound like a generally useful addition? 2. Should we have a Jira issue first? 3. This adds a depencency on the codecs jar. How do we add that to the distro? The code is very simple, but I need to learn the contribution process and build some tests, so this won't happen in one day. wunder -- Walter Underwood Search Guru, Netflix
Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30
One way to think about this is to assume caches, proxies, and load balancing in the HTTP path, then think about their behavior. A 500 response may make the load balancer drop this server from the pool, for example. A 200 OK can be cached, so temporary errors shouldn't be sent with that code. On 11/20/06 10:51 AM, Chris Hostetter [EMAIL PROTECTED] wrote: ...there's kind of a chicken/egg problem with this discussion ... the egg being what should the HTTP response look like in an 'error' situation the chicken being what is the internal API to allow a RequestHandler to denote an 'error' situation ... talking about specific cases only gets us so far since those cases may not be errors in all RequestHandlers. We can get most of the benefit with a few kinds of errors: 400, 403, 404, 500, and 503. Roughly: 400 - error in the request, fix it and try again 403 - forbidden, don't try again 404 - not found, don't try again unless you think it is there now 500 - server error, don't try again 503 - server error, try again These can be mapped from internal error types. the problem gets even more complicated when you try to answer the question: what should Solr do if an OutputWriter encounters an error? ... we can't generate a valid JSON response dnoting an error if the JSONOutputWriter is failing :) Write the response to a string before sending the headers. This can be slower than writing the response out as it is computed, but the response codes can be accurate. Also, it allows optimal buffering, so it might scale better. If you really want to handle failure in an error response, write that to a string and if that fails, send a hard-coded string. wunder -- Walter Underwood Search Guru, Netflix
Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30
On 11/17/06 2:50 PM, Fuad Efendi [EMAIL PROTECTED] wrote: We should probably separate business-related end-user errors (such as when user submits empty query) and make it XML-like (instead of HTTP 400) Speaking as a former web spider maintainer, it is very important to keep the HTTP response codes accurate. Never return an error with a 200. If we want more info, return an entity (body) with the 400 response. wunder -- Walter Underwood Search Guru, Netflix
Re: Adding Phonetic Search to Solr
On 11/8/06 10:30 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : Also, the phonetic matches are ranked a bit high, so I'm trying a : sub-1.0 boost. I was expecting the lower idf to fix that automatically. : The metaphone will almost always have a lower idf because multiple : words are mapped to one metaphone, so the encoded term occurs in more : documents than the surface terms. That all makes sense, and yet it's not what you are observing ... which leads me to believe you (and I since i want to agree with you) are missing something subtle what does the the Explanation look like for two documenets where you feel like one should score higher then the other but they don't? That is my next step. Maybe create some test documents in my corpus and spend some quality time with Explain and grokking DisMax. I need to customize Similarity anyway. wunder -- Walter Underwood Search Guru, Netflix
Re: [jira] Commented: (SOLR-66) bulk data loader
On 11/7/06 11:22 AM, Yonik Seeley (JIRA) [EMAIL PROTECTED] wrote: Yes, posting queries work because it's all form-data (query args). But, what if we want to post a complete file, *and* some extra info/parameters about how that file should be handled? One approach is the Atom Publishing Protocol. That is pretty clear about content and metainformation. It isn't designed to solve every problem, but it handles a broad range of publishing, so it could be a good fit for many uses of Solr. APP is nearly finished. The latest draft is here (second URL also has HTML versions). http://www.ietf.org/internet-drafts/draft-ietf-atompub-protocol-11.txt http://tools.ietf.org/wg/atompub/draft-ietf-atompub-protocol/ wunder -- Walter Underwood Search Guru, Netflix
Re: Adding Phonetic Search to Solr
On 11/7/06 2:30 PM, Mike Klaas [EMAIL PROTECTED] wrote: On 11/7/06, Walter Underwood [EMAIL PROTECTED] wrote: 1. Adding fuzzy to the DisMax specs. What do you envisage the implementation looking like? Probably continue with the template-like patterns already there. title^2.0 (search title field with boost of 2.0) title~ (search title field with fuzzy matching) 2. Adding a phonetic token filter and relying on the per-field analyzer support. I'm not sure why any modification to solr would be necessary. You could add a field with a phonetic analyzer and use copyField to copy your search fields to it. Search will use the modified analyzer automatically. Ah, I missed the analyzer example with a stock Lucene analyzer. Oops. I still need to write an Analyzer, because there is no standard phonetic search in Lucene today. There are some patches and addons floating around. Still, it seems like others might want to use a phonetic token filter with the filter specs. I'd be glad to contribute that, if others think it would be useful. wunder -- Walter Underwood Search Guru, Netflix
Re: Adding Phonetic Search to Solr
On 11/7/06 3:26 PM, Mike Klaas [EMAIL PROTECTED] wrote: Is the state of the art in phonetic token generation reasonable? I've been rather disappointed with some implementations (eg. SOUNDEX in MySQL, MSSQL). SOUNDEX is excellent technology for its time, but its time was 1920. Double Metaphone is far more complex and works fairly well. There is an Apache commons codec implementation available. It is certainly good enough for matching proper names, like Moody and Mudie or Cathy and Kathie. There are some commercial phonetic coders, but I don't have any experience with those. wunder -- Walter Underwood Search Guru, Netflix
Re: [jira] Created: (SOLR-60) Remove overwritePending, overwriteCommitted flags?
+1 as well. --wunder On 11/1/06 11:17 AM, Mike Klaas [EMAIL PROTECTED] wrote: +1 On 11/1/06, Yonik Seeley (JIRA) [EMAIL PROTECTED] wrote: Remove overwritePending, overwriteCommitted flags? -- Key: SOLR-60 URL: http://issues.apache.org/jira/browse/SOLR-60 Project: Solr Issue Type: Improvement Components: update Reporter: Yonik Seeley Priority: Minor The overwritePending, overwriteCommitted, allowDups flags seem needlessly complex and don't add much value. Do people need/use separate control over pending vs committed documents? Perhaps all most people need is overwrite=true/false? overwritePending, overwriteCommitted were originally added because it was a (mis)feature that another internal search tool had. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Copying the request parameters to Solr's response
Returning the query parameters is really useful. I'm not sure it needs to be optional, they are small and options multiply the test cases. It can even be useful to return the values of the defaults. All those go into the key for any client side caching, for example. wunder On 10/24/06 1:55 AM, Erik Hatcher [EMAIL PROTECTED] wrote: I think its a good idea, but it probably should be made optional. Clients can keep track of the state themselves, and keeping the response size as small as possible is valuable. But it would be helpful in some situations for the client to get the original query context sent back too. Erik On Oct 24, 2006, at 4:20 AM, Bertrand Delacretaz wrote: Hi, I need to implement paging of Solr result sets, and (unless I have overlooked something that already exists) it would be useful to copy the request parameters to the output. I'm thinking of adding something like this to the XML output: responseHeader lst name=queryParameters str name=qauthor:Leonardo/str str name=start24/str str name=rows12/str etc... I don't think the SolrParams class provides an Iterator to retrieve all parameters, I'll add one to implement this. WDYT? -Bertrand
Re: Solr NightlyBuild
I agree that a release would be useful for marketing, but I also think it would help exercise the community and the release process. I just discovered Solr on Friday and I've been telling people about it, but every e-mail includes you need to be OK with nightly builds. Being OK with nightly builds means that you need to run your own QA on the whole build every time you change. Kinda expensive. wunder -- Walter Underwood Search Guru, Netflix
Re: double curl calls in post.sh?
Also, do not use text/xml. Even with a charset parameter. In a correct implementation, that will override the XML declaration of charset. With text/xml, the charset parameter must be correct. When it is omitted, the content MUST be interpreted as US-ASCII (yuk). Instead, use a media type of application/xml, so that the server is allowed to sniff the content to discover the character encoding. For the gory details, see RFC 3023: http://www.ietf.org/rfc/rfc3023.txt wunder == Walter Underwood Search Guru, Netflix On 9/17/06 1:00 PM, Chris Hostetter [EMAIL PROTECTED] wrote: am i smoking crack of is post.sh mistakenly sending every doc twice in a row? ... for f in $FILES; do echo Posting file $f to $URL curl $URL --data-binary @$f curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8' echo done ...is there any reason not to delete that first execution of curl? -Hoss