Re: GIT does not support empty directories

2010-04-16 Thread Walter Underwood
This directory intentionally left empty. --wunder

On Apr 16, 2010, at 12:33 PM, Ted Dunning wrote:

 Put a readme file in the directory and be done with it.
 
 On Fri, Apr 16, 2010 at 8:40 AM, Robert Muir rcm...@gmail.com wrote:
 
 I don't like the idea of complicating lucene/solr's build system any more
 than it already is, unless its absolutely necessary. its already too
 complicated.
 
 Instead of adding more hacks, what is actually broken (git) is what should
 be fixed, as the link states:
 
 Currently the design of the git index (staging area) only permits *files*
 to
 be listed, and nobody competent enough to make the change to allow empty
 directories has cared enough about this situation to remedy it.
 
 On Fri, Apr 16, 2010 at 11:14 AM, Smiley, David W. dsmi...@mitre.org
 wrote:
 
 Seriously.
 I sympathize with your point that git should support empty directories.
 But as a practical matter, it's easy to make the ant build tolerant of
 them.
 
 ~ David Smiley
 
 From: Robert Muir [rcm...@gmail.com]
 Sent: Friday, April 16, 2010 6:53 AM
 To: solr-dev@lucene.apache.org
 Subject: Re: GIT does not support empty directories
 
 Seriously? We should hack our ant files around the bugs in every crappy
 source control system that comes out?
 
 Fix Git.
 
 On Thu, Apr 15, 2010 at 10:55 PM, Smiley, David W. dsmi...@mitre.org
 wrote:
 
 I've run into this too.  I don't think this needs to be documented, I
 think
 it needs to be *fixed* -- that is, the relevant ant tasks need to not
 assume
 these directories exist and create them if not.
 
 ~ David Smiley
 
 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: Wednesday, April 14, 2010 11:14 PM
 To: solr-dev
 Subject: GIT does not support empty directories
 
 There are some empty directories in the Solr source tree, both in 1.4
 and the trunk.
 
 example/work
 example/webapp
 example/logs
 
 Git does not support empty directories:
 
 
 
 https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F
 
 And so, when you check out from the Apache GIT repository, these empty
 directories do not appear and 'ant example' and 'ant run-example'
 fail. There is no 'how to use the solr git stuff' wiki page; that
 seems like the right place to document this. I'm not git-smart enough
 to write that page.
 --
 Lance Norskog
 goks...@gmail.com
 
 
 
 
 --
 Robert Muir
 rcm...@gmail.com
 
 
 
 
 --
 Robert Muir
 rcm...@gmail.com
 

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto





[jira] Commented: (SOLR-534) Return all query results with parameter rows=-1

2010-02-10 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832351#action_12832351
 ] 

Walter Underwood commented on SOLR-534:
---

-1

This adds a denial of service vulnerability to Solr. One query can use lots of 
CPU or memory, or even crash the server.

This could also take out an entire distributed system.

If this is added, we MUST add a config option to disable it.

Let's take this back to the mailing list and find out why they believe all 
results are needed.There must be a better way to solve this.

 Return all query results with parameter rows=-1
 ---

 Key: SOLR-534
 URL: https://issues.apache.org/jira/browse/SOLR-534
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
 Environment: Tomcat 5.5
Reporter: Lars Kotthoff
Priority: Minor
 Attachments: solr-all-results.patch


 The searcher should return all results matching a query when the parameter 
 rows=-1 is given.
 I know that it is a bad idea to do this in general, but as it explicitly 
 requires a special parameter, people using this feature will be aware of what 
 they are doing. The main use case for this feature is probably debugging, but 
 in some cases one might actually need to retrieve all results because they 
 e.g. are to be merged with results from different sources.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Namespaces in response (SOLR-1586)

2009-12-09 Thread Walter Underwood
On Dec 9, 2009, at 11:11 AM, Mattmann, Chris A (388J) wrote:

 
 Any parser that does that is so broken that you should stop using it
 immediately. --wunder
 
 Walter, totally agree here.

To elaborate my position:

1. Validation is a user option. The XML spec makes that very clear. We've had 
10 years to get that right, and anyone who auto-validates is not paying 
attention. Validation is very useful when you are creating XML, rarely useful 
when reading it.

2. XML namespaces are string prefixes that use the URL syntax. They do not 
follow URI rules for anything but syntax and there is no guarantee that they 
can be resolved. In fact, an XML parser can't do anything standard with the 
result if they do resolve. Again, we've had 10 years to figure that out.

Yes, this can be confusing, but if a parser author can't figure it out, don't 
use their parser because they are already getting the simple stuff wrong.

wunder






Re: Functions, floats and doubles

2009-11-13 Thread Walter Underwood
Float is almost never good enough. The loss of precision is horrific.

wunder

On Nov 13, 2009, at 9:58 AM, Yonik Seeley wrote:

 On Fri, Nov 13, 2009 at 12:52 PM, Grant Ingersoll gsing...@apache.org wrote:
 Implementing my first function (distance stuff) and notices that functions 
 seem to have a float bent to them.  Not even sure what would be involved, 
 but there are cases for distance that I could see wanting double precision.  
 Thoughts?
 
 
 It's an issue in general.
 
 But for something like gdist(point_a,point_b), the internal
 calculations can be done in double precision and if the result is cast
 to a float at the end, it should be good enough for most uses, right?
 
 -Yonik
 http://www.lucidimagination.com
 



Re: Functions, floats and doubles

2009-11-13 Thread Walter Underwood
Float is often OK until you try and use it for further calculation. Maybe it is 
good enough for printing out distance, but maybe not for further use.

wunder

On Nov 13, 2009, at 10:32 AM, Yonik Seeley wrote:

 On Fri, Nov 13, 2009 at 1:01 PM, Walter Underwood wun...@wunderwood.org 
 wrote:
 Float is almost never good enough. The loss of precision is horrific.
 
 Are you saying it's not good enough for this case (the final answer of
 a relative distance calculation)?
 7 digits of precision is enough to represent a distance across the US
 down to the meter... and points closer together would have higher
 precision of course.
 
 For storage of the points themselves, 32 bit floats may also often be
 enough (~2.4 meter resolution at the equator).  Allowing doubles as an
 option would be nice too - but I expect that doubling the fieldcache
 may not be worth it for many.
 Actually, a 32 bit fixed point representation would have a lot more
 accuracy for this (256 times the resolution at the cost of on-the-fly
 conversion to a double for calculations).
 
 -Yonik
 http://www.lucidimagination.com
 



Re: Another RC

2009-10-19 Thread Walter Underwood
Please wait for an official release of Lucene. It makes thing SO much  
easier when you need to dig into the Lucene code.


It is well worth a week delay.

wunder

On Oct 19, 2009, at 10:27 AM, Yonik Seeley wrote:

On Mon, Oct 19, 2009 at 10:59 AM, Grant Ingersoll  
gsing...@apache.org wrote:

Are we ready for a release?


+1

I don't think we need to wait for Lucene 2.9.1 - we have all the fixes
in our version, and there's little point in pushing things off yet
another week.

Seems like the next RC should be a *real* one (i.e. no RC label in the
version, immediately call a VOTE).

-Yonik
http://www.lucidimagination.com


 I got busy at work and haven't been able to
address things as much, but it seems like things are progressing.

Shall I generate another RC or are we waiting for Lucene 2.9.1?  If  
we go w/

the 2.9.1-dev, then we just need to restore the Maven stuff for them.
 Hopefully, that stuff was just commented out and not completely  
removed so

as to make it a little easier to restore.

-Grant






Re: 8 for 1.4

2009-09-29 Thread Walter Underwood
It might not be proper to use the name Solr, because it is really  
Apache Solr. At a minimum, it is misleading to use an Apache project  
name on GPL'ed code.


I agree that changing to GPL is a bad idea. I've worked at eight or  
nine companies since the GPL was created, and GPL'ed code was  
forbidden at every one of them. GPL is where code goes to die.


wunder

On Sep 29, 2009, at 3:34 AM, Grant Ingersoll wrote:



On Sep 29, 2009, at 4:00 AM, Matthias Epheser wrote:


Grant Ingersoll schrieb:
Moving to GPL doesn't seem like a good solution to me, but I don't  
know what else to propose.  Why don't we just hold it from this  
release, but keep it in trunk and encourage the Drupal guys and  
others to submit their changes?  Perhaps by then Matthias or you  
or someone else will have stepped up.

concerning GPL:

The message from the drupal guys is that the code altered that much  
from initial solrjs that they think it's legally acceptable to get  
their new code out under GPL and only mention that it was  
inspired by the still existing Apache License solrjs.


Sounds reasonable for me but I have few experience with this kind  
of legal issues. So what do you think?


Oh, it's legally fine.  The ASL let's you do pretty much whatever  
you want.  But that is pretty much the point.  You're taking code  
with no restrictions on it and putting a whole slew of them back in,  
preventing Solr from ever distributing it in the future.  Something  
about that stinks to me.   There is a pretty large reason why we do  
our work at the ASF and not under GPL.  I won't go into it here, but  
suffice it to say one can go read volumes of backstory on this  
elsewhere by searching for GPL vs ASL (or BSD).  Furthermore,  
Matthias, it may be the case in the future that all that work you  
did for SolrJS may not even be accessible to you, the original  
author, under the GPL terms, depending on the company (many, many  
companies explicitly forbid GPL), etc. that you work for.  Is that  
what you want?


Also, they can't call it SolrJS, though, as that is the name of our  
version.






Re: [PMX:FAKE_SENDER] Re: large OR-boolean query

2009-09-25 Thread Walter Underwood
This would work a lot better if you did the join at index time. For  
each paper, add a field with all the related drug names (or whatever  
you want to search for), then search on that field.


With the current design, it will never be fast and never scale. Each  
lookup has a cost, so expanding a query to a thousand terms will  
always be slow. Distributing the query to multiple shards will only  
make a bad design slightly faster.


This is fundamental to search index design. The schema is flat, fully- 
denormalized, no joins. You tag each document with the terms that you  
will use to find it. Then you search for those terms directly.


wunder

On Sep 25, 2009, at 7:52 AM, Luo, Jeff wrote:

We are searching strings, not numbers. The reason we are doing this  
kind
of query is that we have two big indexes, say, a collection of  
medicine

drugs and a collection of research papers. I first run a query against
the drugs index and get 102400 unique drug names back. Then I need to
find all the research papers where one or more of the 102400 drug  
names

are mentioned, hence the large OR query. This is a kind of JOIN query
between 2 indexes, which an article in the lucid web site comparing
databases and search engines briefly touched.

I was able to issue 100 parallel small queries against solr shards and
get the results back successfully (even sorted). My custom code is  
less
than 100 lines, mostly in my SearchHandler.handleRequestBody. But I  
have
problem summing up the correct facet counts because the faceting  
counts

from each shard are not disjunctive.

Based on what is suggested by two other responses to my question, I
think it is possible that the master can pass the original large query
to each shard, and each shard will split the large query into 100  
lower
level disjunctive lucene queries, fire them against its Lucene index  
in
a parallel way and merge the results. Then each shard shall only  
return

1(instead of 100) result set to the master with disjunctive faceting
counts. It seems that the faceting problem can be solved in this  
way. I

would appreciate it if you could let me know if this approach is
feasible and correct; what solr plug-ins are needed(my guess is a  
custom

parser and query-component?)

Thanks,

Jeff



-Original Message-
From: Grant Ingersoll [mailto:gsing...@apache.org]
Sent: Thursday, September 24, 2009 10:01 AM
To: solr-dev@lucene.apache.org
Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query


On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:


Hi,

We are experimenting a parallel approach to issue a large OR-Boolean
query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
solr shards.

The way we are trying is to break the large query into smaller ones,
e.g.,
the example above can be broken into 10 small queries: keywords:(1
OR 2
OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
etc

Now each shard will get 10 requests and the master will merge the
results coming back from each shard, similar to the regular
distributed
search.



Can you tell us a little bit more about the why/what of this?  Are you
really searching numbers or are those just for example?  Do you care
about the score or do you just need to know whether the result is
there or not?


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search





Re: Solr Slow in Unix

2009-07-16 Thread Walter Underwood
In particular, are you using local disc or network storage? --wunder

On 7/16/09 8:24 AM, Yonik Seeley yo...@lucidimagination.com wrote:

 On Thu, Jul 16, 2009 at 4:18 AM, Anand Kumar
 Prabhakaranand2...@gmail.com wrote:
 I'm running a Solr instance in Apache Tomcat 6 in a Solaris Box. The QTimes
 are high when compared to the same configuration on a Windows machine. Can
 anyone help with the configurations i can check to improve the performance?
 
 What's the hardware actually look like on each machine?
 
 -Yonik
 http://www.lucidimagination.com



Re: lucene releases vs trunk

2009-06-25 Thread Walter Underwood
This is an excellent idea.

When I find a problem and want to research the Lucene bugs that might
describe it, that is really hard with a trunk build. It's easy with a
release build.

wunder

On 6/25/09 4:18 AM, Yonik Seeley yo...@lucidimagination.com wrote:

 For the next release cycle (presumably 1.5?) I think we should really
 try to stick to released versions of Lucene, and not use dev/trunk
 versions.
 Early in Solr's lifetime, Lucene trunk was more stable (APIs changed
 little, even on non-released versions), and Lucene releases were few
 and far between.
 Today, the pace of change in Lucene has quickened, and Lucene APIs are
 much more in flux until a release is made.  It's also now harder to
 support a Lucene dev release given the growth in complexity
 (particularly for indexing code).  Releases are made more often too,
 making using released versions more practical.
 Many of our users dislike our use of dev versions of Lucene too.
 
 And yes, 1.4 isn't out the door yet - but people often tend to hit the
 ground running on the next release.
 
 -Yonik
 http://www.lucidimagination.com



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719609#action_12719609
 ] 

Walter Underwood commented on SOLR-1216:


sync is a weak name, because it doesn't say whether it is a push or pull 
synchronization.


 disambiguate the replication command names
 --

 Key: SOLR-1216
 URL: https://issues.apache.org/jira/browse/SOLR-1216
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1216.patch


 There is a lot of confusion in the naming of various commands such as 
 snappull, snapshot etc. This is a vestige of the script based replication we 
 currently have. The commands can be renamed to make more sense
 * 'snappull' to be renamed to 'sync'
 * 'snapshot' to be renamed to 'backup'
 thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1216) disambiguate the replication command names

2009-06-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719625#action_12719625
 ] 

Walter Underwood commented on SOLR-1216:


If we choose a name for the thing we are pulling, like image, then we can use 
makeimage, pullimage, etc.


 disambiguate the replication command names
 --

 Key: SOLR-1216
 URL: https://issues.apache.org/jira/browse/SOLR-1216
 Project: Solr
  Issue Type: Improvement
  Components: replication (java)
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1216.patch


 There is a lot of confusion in the naming of various commands such as 
 snappull, snapshot etc. This is a vestige of the script based replication we 
 currently have. The commands can be renamed to make more sense
 * 'snappull' to be renamed to 'sync'
 * 'snapshot' to be renamed to 'backup'
 thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Streaming Docs, Terms, TermVectors

2009-05-30 Thread Walter Underwood
Don't stream, request chunks of 10 or 100 at a time. It works fine and
you don't have to write or test any new code. In addition, it works
well with HTTP caches, so if two clients want to get the same data,
the second can get it from the cache.

We do that at Netflix. Each front-end box does a series of queries
to get all the movie titles, then loads them into a local index for
autocomplete.

wunder

On 5/30/09 11:01 AM, Kaktu Chakarabati jimmoe...@gmail.com wrote:

 For a streaming-like solution, it is possible infact to have a working
 buffer in-memory that emits chunks on an http connection which is kept alive
 by the server until the full response has been sent.
 This is quite similar for example to how video streaming protocols which can
 operate on top of HTTP work ( cf. a more general discussion on
 http://ajaxpatterns.org/HTTP_Streaming#In_A_Blink ).
 Another (non-mutually exclusive) possibility is to introduce a novel binary
 format for the transmission of such data ( i.e a new wt=.. type ) over
 http (or any other comm. protocol) so that data can be more effectively
 compressed and made to better fit into memory.
 One such format which has been widely circulating and already has many open
 source projects implementing it is Adobe's AMF (
 http://osflash.org/documentation/amf ). It is however a proprietary format
 so i'm not sure whether it is incorporable under apache foundation terms.
 
 -Chak
 
 
 On Sat, May 30, 2009 at 9:58 AM, Dietrich Featherston
 d...@dfeatherston.comwrote:
 
 I was actually curious about the same thing.  Perhaps an endpoint reference
 could be passed in the request where the documents can be sent
 asynchronously, such as a jms topic.
 
 solr/query?q=*:*epr=/my/topiceprtype=jms
 
 Then we would need to consider how to break up the response, how to cancel
 a running query, etc.
 
 Is this along the lines of what you're looking for?  I would be interested
 in looking at how the request/response contract changes and what types of
 endpoint references would be supported.
 
 Thanks,
 D
 
 On May 30, 2009, at 12:45 PM, Grant Ingersoll gsing...@apache.org wrote:
 
  Anyone have any thoughts on what is involved with streaming lots of
 results out of Solr?
 
 For instance, if I wanted to get something like 1M docs out of Solr (or
 more) via *:* query, how can I tractably do this?  Likewise, if I wanted to
 return all the terms in the index or all the Term Vectors.
 
 Obviously, it is impossible to load all of these things into memory and
 then create a response, so I was wondering if anyone had any ideas on how to
 stream them.
 
 Thanks,
 Grant
 
 



[jira] Commented: (SOLR-1073) StrField should allow locale sensitive sorting

2009-04-28 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703893#action_12703893
 ] 

Walter Underwood commented on SOLR-1073:


Using the locale of the JVM is very, very bad for a multilingual server. Solr 
should always use the same, simple locale. It is OK to set a Locale in 
configuration for single-language installations, but using the JVM locale is a 
recipe for disaster. You move Solr to a different server and everything breaks. 
Very, very bad.  

In a multi-lingual config, locales must be set per-request.

Ideally, requests should send an ISO language code as context for the query.




 StrField should allow locale sensitive sorting
 --

 Key: SOLR-1073
 URL: https://issues.apache.org/jira/browse/SOLR-1073
 Project: Solr
  Issue Type: Improvement
 Environment: All
Reporter: Sachin
 Attachments: LocaleStrField.java


 Currently, StrField does not take a parameter which it can pass to ctor of 
 SortField making the StrField's sorting rely on the locale of the JVM.  
 Ideally, StrField should allow setting the locale in the schema.xml and use 
 it to create a new instance of the SortField in getSortField() method, 
 something like:
 snip:
   public SortField getSortField(SchemaField field,boolean reverse)
   {
 ...
   Locale locale = new Locale(lang,country);
   return new SortField(field.getName(), locale, reverse);
  }
 More details about this issue here:
 http://www.nabble.com/CJKAnalyzer-and-Chinese-Text-sort-td22374195.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1044) Use Hadoop RPC for inter Solr communication

2009-03-03 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678601#action_12678601
 ] 

Walter Underwood commented on SOLR-1044:


During the Oscars, the HTTP cache in front of our Solr farm had a 90% hit rate. 
I think a 10X reduction in server load is a testimony to the superiority of the 
HTTP approach.


 Use Hadoop RPC for inter Solr communication
 ---

 Key: SOLR-1044
 URL: https://issues.apache.org/jira/browse/SOLR-1044
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Noble Paul

 Solr uses http for distributed search . We can make it a whole lot faster if 
 we use an RPC mechanism which is more lightweight/efficient. 
 Hadoop RPC looks like a good candidate for this.  
 The implementation should just have one protocol. It should follow the Solr's 
 idiom of making remote calls . A uri + params +[optional stream(s)] . The 
 response can be a stream of bytes.
 To make this work we must make the SolrServer implementation pluggable in 
 distributed search. Users should be able to choose between the current 
 CommonshttpSolrServer, or a HadoopRpcSolrServer . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?

2009-02-26 Thread Walter Underwood
That info is already available via Luke, right? --wunder

On 2/26/09 9:55 AM, Robert Douglass r...@robshouse.net wrote:

 A solution that I'd considering implementing for Drupal's ApacheSolr
 module is to do a *:* search and then make tag clouds from all of the
 facets. Pretty easy to sort all the facet terms into bins based on the
 number of documents they match, and then to translate bins to font
 sizes. Tag clouds make a nice alternate representation of facet blocks.
 
 Robert Douglass
 
 The RobsHouse.net Newsletter:
 http://robshouse.net/newsletter/robshousenet-newsletter
 Follow me on Twitter: http://twitter.com/robertDouglass
 
 On Feb 26, 2009, at 6:50 PM, Emmanuel Castro Santana wrote:
 
 
 I am developing a Solr based search application and need to get a
 kind of a
 keyword report for tag cloud generation. If there is anyone here who
 has
 ever had that necessity and has somehow found the way through, I would
 really appreciate some help.
 Thanks in advance
 -- 
 View this message in context:
 http://www.nabble.com/Is-there-a-built-in-keyword-report-%28Tag-Cloud%29-feat
 ure-on-Solr---tp9677p9677.html
 Sent from the Solr - Dev mailing list archive at Nabble.com.
 
 



Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?

2009-02-26 Thread Walter Underwood
Oops, missed that you wanted it by facet. Never mind. --wunder

On 2/26/09 9:57 AM, Walter Underwood wunderw...@netflix.com wrote:

 That info is already available via Luke, right? --wunder
 
 On 2/26/09 9:55 AM, Robert Douglass r...@robshouse.net wrote:
 
 A solution that I'd considering implementing for Drupal's ApacheSolr
 module is to do a *:* search and then make tag clouds from all of the
 facets. Pretty easy to sort all the facet terms into bins based on the
 number of documents they match, and then to translate bins to font
 sizes. Tag clouds make a nice alternate representation of facet blocks.
 
 Robert Douglass
 
 The RobsHouse.net Newsletter:
 http://robshouse.net/newsletter/robshousenet-newsletter
 Follow me on Twitter: http://twitter.com/robertDouglass
 
 On Feb 26, 2009, at 6:50 PM, Emmanuel Castro Santana wrote:
 
 
 I am developing a Solr based search application and need to get a
 kind of a
 keyword report for tag cloud generation. If there is anyone here who
 has
 ever had that necessity and has somehow found the way through, I would
 really appreciate some help.
 Thanks in advance
 -- 
 View this message in context:
 
http://www.nabble.com/Is-there-a-built-in-keyword-report-%28Tag-Cloud%29-fea
t
 ure-on-Solr---tp9677p9677.html
 Sent from the Solr - Dev mailing list archive at Nabble.com.
 
 
 



Re: Is there a built in keyword report (Tag Cloud) feature on Solr ?

2009-02-26 Thread Walter Underwood
If you want a tag cloud based on query freqency, start with your
HTTP log analysis tools. Most of those generate a list of top
queries and top words in queries.

wunder

On 2/26/09 2:54 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 
 : I may have not made myself clear. When I say keyword report, I mean a kind
 : of a most popular tag cloud, showing in bigger sizes the most searched
 : terms. Therefore I need information about how many times specific terms have
 : been searched and I can't see how I could accomplish that with this
 : solution 
 
 you have to be more explicit about what you ask for.  I've never heard
 anyone refer to a tag cloud as being based on how often a term is searched
 for -- everyone i know uses the frequency of words in the corpus,
 sometimes with a decay function to promote words mentioned in more recent
 docs.
 
 Solr doesn't keep any record of the searches performed, so to build a tag
 cloud based on query popularity you would need to mine your logs.
 
 if you want a tag cloud based on the frequency of words in your corpus,
 the faceting approach mentioned would work -- but a simpler way to get
 term counts for the whole index (*:*) would be the TermsComponent.  you
 only really need the facet based solution if you want a cloud based on a
 subset of documents, (ie: a cloud for all documents matching
 category:computer)
 
 
 
 -Hoss
 



Re: [jira] Issue Comment Edited: (SOLR-844) A SolrServer impl to front-end multiple urls

2009-01-22 Thread Walter Underwood
This would be useful if there was search-specific balancing,
like always send the same query back to the same server. That
can make your cache far more effective.

wunder

On 1/22/09 1:13 PM, Otis Gospodnetic (JIRA) j...@apache.org wrote:

 
 [ 
 https://issues.apache.org/jira/browse/SOLR-844?page=com.atlassian.jira.plugin.
 system.issuetabpanels:comment-tabpanelfocusedCommentId=12666296#action_126662
 96 ] 
 
 otis edited comment on SOLR-844 at 1/22/09 1:12 PM:
 
 
 I'm not sure there is a clear consensus about this functionality being a good
 thing (also 0 votes).  Perhaps we can get more people's opinions?
 
 
   was (Author: otis):
 I'm not sure there is a clear consensus about this functionality being a
 good thing.  Perhaps we can get more people's opinions?
 
   
 A SolrServer impl to front-end multiple urls
 
 
 Key: SOLR-844
 URL: https://issues.apache.org/jira/browse/SOLR-844
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Affects Versions: 1.3
Reporter: Noble Paul
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4
 
 Attachments: SOLR-844.patch, SOLR-844.patch, SOLR-844.patch
 
 
 Currently a {{CommonsHttpSolrServer}} can talk to only one server. This
 demands that the user have a LoadBalancer or do the roundrobin on their own.
 We must have a {{LBHttpSolrServer}} which must automatically do a
 Loadbalancing between multiple hosts. This can be backed by the
 {{CommonsHttpSolrServer}}
 This can have the following other features
 * Automatic failover
 * Optionally take in  a file /url containing the the urls of servers so that
 the server list can be automatically updated  by periodically loading the
 config
 * Support for adding removing servers during runtime
 * Pluggable Loadbalancing mechanism. (round-robin, weighted round-robin,
 random etc)
 * Pluggable Failover mechanisms



[jira] Commented: (SOLR-822) CharFilter - normalize characters before tokenizer

2008-10-23 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12642188#action_12642188
 ] 

Walter Underwood commented on SOLR-822:
---

Yes, it should be in Lucene. LIke this: 
http://webui.sourcelabs.com/lucene/issues/1343

There are (at least) four kinds of character mapping:

Unicode normalization from decomposed to composed forms (always safe).

Unicode normalization from compatability forms to standard forms (may change 
the look, like fullwidth to halfwidth Latin).

Language-specific normalization, like oe to รถ (German-only).

Mappings that improve search but are linguistically dodgy, like stripping 
accents and mapping katakana to hirigana.

wunder


 CharFilter - normalize characters before tokenizer
 --

 Key: SOLR-822
 URL: https://issues.apache.org/jira/browse/SOLR-822
 Project: Solr
  Issue Type: New Feature
  Components: Analysis
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: character-normalization.JPG, sample_mapping_ja.txt, 
 SOLR-822.patch, SOLR-822.patch


 A new plugin which can be placed in front of tokenizer/.
 {code:xml}
 fieldType name=textCharNorm class=solr.TextField 
 positionIncrementGap=100 
   analyzer
 charFilter class=solr.MappingCharFilterFactory 
 mapping=mapping_ja.txt /
 tokenizer class=solr.MappingCJKTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
 words=stopwords.txt/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType
 {code}
 charFilter/ can be multiple (chained). I'll post a JPEG file to show 
 character normalization sample soon.
 MOTIVATION:
 In Japan, there are two types of tokenizers -- N-gram (CJKTokenizer) and 
 Morphological Analyzer.
 When we use morphological analyzer, because the analyzer uses Japanese 
 dictionary to detect terms,
 we need to normalize characters before tokenization.
 I'll post a patch soon, too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-815) Add new Japanese half-width/full-width normalizaton Filter and Factory

2008-10-20 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641071#action_12641071
 ] 

Walter Underwood commented on SOLR-815:
---

I looked it up, and even found a reason to do it the right way.

Latin should be normalized to halfwidth (in the Latin-1 character space).

Kana should be normalized to fullwidth.

Normalizing Latin characters to fullwidth would mean you could not use the 
existing accent-stripping filters or probably any other filter that expected 
Latin-1, like synonyms. Normalizing to halfwidth makes the rest of Solr and 
Lucene work as expected.

See section 12.5: http://www.unicode.org/versions/Unicode5.0.0/ch12.pdf

The compatability forms (the ones we normalize away from) are int the Unicode 
range U+FF00 to U+FFEF.
The correct mappings from those forms are in this doc: 
http://www.unicode.org/charts/PDF/UFF00.pdf

Other charts are here: http://www.unicode.org/charts/


 Add new Japanese half-width/full-width normalizaton Filter and Factory
 --

 Key: SOLR-815
 URL: https://issues.apache.org/jira/browse/SOLR-815
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Todd Feak
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-815.patch


 Japanese Katakana and  Latin alphabet characters exist as both a half-width 
 and full-width version. This new Filter normalizes to the full-width 
 version to allow searching and indexing using both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-814) Add new Japanese Hiragana Filter and Factory

2008-10-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12640605#action_12640605
 ] 

Walter Underwood commented on SOLR-814:
---

This seems like a bad idea. Hirigana and katakana are used quite differently in 
Japanese. They are not interchangeable.

I was the engineer for Japanese support in Ultraseek for years and even visited 
our distributor there, but no one ever asked for this feature. They asked for a 
lot of things, but never this.

It is very useful, maybe essential, to normalize full-width and half-width 
versions of hirigana, katakana, and ASCII.


 Add new Japanese Hiragana Filter and Factory
 

 Key: SOLR-814
 URL: https://issues.apache.org/jira/browse/SOLR-814
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Todd Feak
Priority: Minor
 Attachments: SOLR-814.patch


 Japanese Hiragana and Katakana character sets can be easily translated 
 between. This filter normalizes all Hiragana characters to their Katakana 
 counterpart, allowing for indexing and searching using either.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-815) Add new Japanese half-width/full-width normalizaton Filter and Factory

2008-10-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12640609#action_12640609
 ] 

Walter Underwood commented on SOLR-815:
---

If I remember correctly, Latin characters should normalize to half-width, not 
full-width.


 Add new Japanese half-width/full-width normalizaton Filter and Factory
 --

 Key: SOLR-815
 URL: https://issues.apache.org/jira/browse/SOLR-815
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Todd Feak
Priority: Minor
 Attachments: SOLR-815.patch


 Japanese Katakana and  Latin alphabet characters exist as both a half-width 
 and full-width version. This new Filter normalizes to the full-width 
 version to allow searching and indexing using both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Offer to submit some custom enhancements

2008-10-16 Thread Walter Underwood
Python marshal format supports everything we need and is easy to implement
in Java. It is roughly equivalent to JSON, but binary.

http://docs.python.org/library/marshal.html

wunder

On 10/16/08 8:16 AM, Shalin Shekhar Mangar [EMAIL PROTECTED] wrote:

 Hi Todd,
 
 AFAIK, protocol buffers cannot be used for Solr because it is unable to
 support the NamedList structure that all Solr components use.
 
 The binary protocol (NamedListCodec) that SolrJ uses to communicate with
 Solr server is extremely optimized for our response format. However it is
 Java only.
 
 There are other projects such as Apache Thrift (
 http://incubator.apache.org/thrift/) and Etch (both in incubation) which can
 be looked at. There are a few issues in Thrift which may help us in the
 future:
 
 https://issues.apache.org/jira/browse/THRIFT-110
 https://issues.apache.org/jira/browse/THRIFT-122
 
 On Thu, Oct 16, 2008 at 12:18 AM, Feak, Todd [EMAIL PROTECTED]wrote:
 
 Reposting, as I inadvertently thread hijacked on the first one. My bad.
 
 Hi all,
 
 I have a handful of custom classes that we've created for our purposes
 here. I'd like to share them if you think they have value for the rest
 of the community, but I wanted to check here before creating JIRA
 tickets and patches.
 
 Here's what I have:
 
 1. DoubleMetaphoneFilter and Factory. This replaces usage of the
 PhoneticFilter and Factory allowing access to set maxCodeLength() on the
 DoubleMetaphone encoder and access to the alternate encodings that the
 encoder provides for some words.
 
 2. JapaneseHalfWidthFilter and Factory. Some Japanese characters (and
 Latin alphabet) exist in both a FullWidth and HalfWidth form. This
 filter normalizes by switching to the FullWidth form for all the
 characters. I have seen at least one JIRA ticket about this issue. This
 implementation doesn't rely on Java 1.6.
 
 3. JapaneseHiraganaFilter and Factory. Japanese Hiragana can be
 translated to Katakana. This filter normalizes to Katakana so that data
 and queries can come in either way and get hits.
 
 
 Also, I have been requested to create a prototype that you may be
 interested in. I'm to construct a QueryResponseWriter that returns
 documents using Google's Protocol Buffers. This would rely on an
 existing patch that exposes the OutputStream, but I would like to start
 the work soon. Are there license concerns that would block sharing this
 with you? Is there any interest in this?
 
 Thanks for your consideration,
 Todd Feak
 
 
 



[jira] Commented: (SOLR-777) backword match search, for domain search etc.

2008-09-18 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12632489#action_12632489
 ] 

Walter Underwood commented on SOLR-777:
---

You don't need backwards matching for this, and it doesn't really do the right 
thing.

Split the string on ., reverse the list, and join successive sublists with 
.. Don't index the length one list, since that is .com, .net, etc. Do the 
same processing at query time.

This is a special analyzer.



 backword match search, for domain search etc.
 -

 Key: SOLR-777
 URL: https://issues.apache.org/jira/browse/SOLR-777
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Koji Sekiguchi
Priority: Minor

 There is a requirement for searching domains with backward match. For 
 example, using apache.org for a query string, www.apache.org, 
 lucene.apache.org could be returned.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: replace stax API with Geronimo-stax+Woodstox

2008-09-09 Thread Walter Underwood
We've been using woodstox in production for over a year.
No problems.

wunder

On 9/9/08 8:07 AM, Yonik Seeley [EMAIL PROTECTED] wrote:

 FYI, I'm testing Solr with woodstox now and will probably do some ad
 hoc stress testing too.
 But woodstox is a quality parser.  I expect less problems then we had
 with the reference implementation (and it may even be faster too)
 
 -Yonik



Re: Solr changes date format?

2008-08-12 Thread Walter Underwood
On 8/12/08 11:42 AM, Chris Hostetter [EMAIL PROTECTED] wrote:

 : by a point but, as you can see, the separator is converted to a comma when
 : is accesed
 : from Solr (i can see this too from Solr web admin)
 
 this boggles my mind ... i can't think of *anything* in Solr that would do
 this .. 

If a European locale was used when the seconds portion of the date
was formatted, it would use a comma for the radix point.

wunder



Re: [VOTE] Set Solr 1.3 freeze and release date

2008-08-06 Thread Walter Underwood
I would strongly prefer a released version of Lucene. We made some changes
to Solr 1.1 that required tweaks inside of Lucene, and it was quite a
treasure hunt to a suitable set of Lucene source.

It just seems wrong for Solr to release a version of Lucene.

wunder 

On 8/6/08 8:53 AM, Chris Hostetter [EMAIL PROTECTED] wrote:

 
 : Yes, it's good that lots of Solr people are also Lucene people. But I
 : don't think that makes it alright to ship Lucene nightlies or
 : snapshots.
 
 Apache Lucene is a TLP, Apache Solr and Apache Lucene-Java are just
 individual products/sub-projects of that TLP.
 
 If the Apache Lucene PMC votes to release a particular bundle of source
 code as Apache Solr 1.3 and that bundle includes source (or binary) code
 from the Lucene-Java subproject that hasn't already been released (via PMC
 vote) then it is by definition officially released Apache Lucene software.
 
 So in a nutshell: yes it is alright for Solr to ship Lucene nightlies --
 because once the PMC votes on that Solr release, it doesn't matter where
 that Lucene-Java jar came from, it's officially released code.
 
 I'm told there is even precedence for the PMC of a TLP X to vote
 and officially release code from completley seperate TLP Y because Y had
 not had a release and X was ready to go.
 
 Where dependencies on snapshots in official releases causes problems is
 when those snapshots are from third parties and/or are not reproducable --
 where the specific version of the dependencies is unknown and as a result
 the dependee can not be reproduced.  We do not have that problem
 with any Apache codebase we have a dependency on.  We know exactly which
 svn revision the dependencies come from, and since the SVN repository is
 public, anyone can recreate it.
 
 
 -Hoss
 



Re: Solr Logo thought

2008-08-01 Thread Walter Underwood
I kind of like the flaming version at http://www.solrmarc.org/
Not very fired up about the other choices.

wunder

On 8/1/08 9:45 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote:

 Hola,
 
 Yes, logo, trivial issue (hi Lance).  But logos are important, so:
 
 I've cast my vote, but I don't really love even the logo I voted for (#2 -- a
 little too pale/shinny, not very bold, so to speak).  Lukas (BCCed) did the
 logo for Mahout.  He made a number of variations and was very open to
 suggestions during the process.  I wonder if we could ask him to give Solr
 logo a shot if he is not on vacation.  Do we have time for another logo,
 assuming Lukas is willing to contribute?
 
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




[jira] Commented: (SOLR-600) XML parser stops working under heavy load

2008-06-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12605751#action_12605751
 ] 

Walter Underwood commented on SOLR-600:
---

It could also be a concurrency bug in Solr that shows up on the IBM JVM because 
the thread scheduler makes different decisions. 

 XML parser stops working under heavy load
 -

 Key: SOLR-600
 URL: https://issues.apache.org/jira/browse/SOLR-600
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 1.3
 Environment: Linux 2.6.19.7-ss0 #4 SMP Wed Mar 12 02:56:42 GMT 2008 
 x86_64 Intel(R) Xeon(R) CPU X5450 @ 3.00GHz GenuineIntel GNU/Linux
 Tomcat 6.0.16
 SOLR nightly 16 Jun 2008, and versions prior
 JRE 1.6.0
Reporter: John Smith

 Under heavy load, the following is spat out for every update:
 org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
 at java.util.AbstractList$SimpleListIterator.hasNext(Unknown Source)
 at 
 org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:225)
 at 
 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:66)
 at 
 org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:196)
 at 
 org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:125)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
 at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
 at java.lang.Thread.run(Thread.java:735)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: IDF in Distributed Search

2008-04-11 Thread Walter Underwood
Global IDF does not require another request/response.
It is nearly free if you return the right info.

Return the total number of docs and the df in the original
response. Sum the doc counts and dfs, recompute the idf,
and re-rank.

See this post for an efficient way to do it:

  
http://wunderwood.org/most_casual_observer/2007/04/progressive_reranking.htm
l

This works best if you treat the results from each server as
a queue and refill just that queue when it is exhausted. All the
good results might be from one server.

wunder

On 4/11/08 8:50 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 On Fri, Apr 11, 2008 at 11:39 PM, Otis Gospodnetic
 [EMAIL PROTECTED] wrote:
  So, I'd like to see what it would take to add distributed IDF info to Solr's
 distributed search.
  Here are some questions to get the discussion going:
  - Is anyone already working on it?
  - Does anyone plan on working on it in the very near future?
  - Does anyone already have thoughts how and where dist. idf could be plugged
 in?
  - There is a mention of dist idf and performance cost up there - any idea
 how costly dist idf would
 
 It's relatively easy to implement, but the performance cost is is not
 negligible since it adds another search phase (another
 request-response).  It should be optional of course (globalidf=true),
 so there is no reason not to add this feature.
 
 I also left room for this stage (ResponseBuilder.STAGE_PARSE_QUERY),
 which is ordered before query execution.
 
 -Yonik



[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches

2008-02-08 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12567068#action_12567068
 ] 

Walter Underwood commented on SOLR-127:
---

Two reasons to do HTTP caching for Solr: First, Solr is HTTP and needs to 
implement that correctly. Second, caches are much harder to implement and test 
than the cache information in HTTP. HTTP caches already exist and are well 
tested, so the implementation cost is zero and deployment is very easy.

The HTTP spec already covers which responses should be cached.  A 400 response 
may only be cached if it includes explicit cache control headers which allow 
that. See RFC 2616.

We are using a caching load balancer and caching in Apache front ends to 
Tomcat. We see an increase of more than 2X in the capacity of our search farm.

I would recommend against Solr-specific cache information in the XML part of 
the responses. Distributed caching is extremely difficult to get right. Around 
25% of the HTTP 1.1 spec is devoted to caching and there are still grey areas.

 Make Solr more friendly to external HTTP caches
 ---

 Key: SOLR-127
 URL: https://issues.apache.org/jira/browse/SOLR-127
 Project: Solr
  Issue Type: Wish
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 1.3

 Attachments: CacheUnitTest.patch, CacheUnitTest.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch


 an offhand comment I saw recently reminded me of something that really bugged 
 me about the serach solution i used *before* Solr -- it didn't play nicely 
 with HTTP caches that might be sitting in front of it.
 at the moment, Solr doesn't put in particularly usefull info in the HTTP 
 Response headers to aid in caching (ie: Last-Modified), responds to all HEAD 
 requests with a 400, and doesn't do anything special with If-Modified-Since.
 t the very least, we can set a Last-Modified based on when the current 
 IndexReder was open (if not the Date on the IndexReader) and use the same 
 info to determing how to respond to If-Modified-Since requests.
 (for the record, i think the reason this hasn't occured to me in the 2+ years 
 i've been using Solr, is because with the internal caching, i've yet to need 
 to put a proxy cache in front of Solr)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: remote solrj using xml versus json

2007-11-09 Thread Walter Underwood
If you want speed, you should use Python marshal format. It handles
data structures equivalent to JSON, but in binary. Very easy to
convert to Java data types. --wunder

On 11/9/07 12:56 PM, Erik Hatcher [EMAIL PROTECTED] wrote:

 anybody compared/contrasted the two?   seems like yonik's noggit
 parser might have a performance edge on xml parsing ?!
 
 Erik




Re: default text type and stop words

2007-11-05 Thread Walter Underwood
I also said, Stopword removal is a reasonable default because it works
fairly well for a general text corpus. Ultraseek keeps stopwords but
most engines don't. I think it is fine as a default. I also think you
have to understand stopwords at some point.

wunder

On 11/5/07 9:59 PM, Chris Hostetter [EMAIL PROTECTED] wrote:

 
 : This isn't a problem in Lucene or Solr. It is a result of the analyzers
 : you have chosen to use. If you choose to remove stopwords, you will not
 : be able to match stopwords.
 
 I believe paul's point was that this use of stopwords is in the text
 fieldtype in the example schema.xml ... which many people use as is.
 
 I'm personally of the mindset that it's fine like it is.  While people who
 understand that an is a stop word might ask why does 'rating:PG AND
 name:an' match 40K movies, it should match 0? there is another (probably
 larger) group of people who won't know how the search is implemented, or
 that an is a stop word, and they will look at the same results and ask
 why am i getting 40K results? most of these don't have 'an' in the title?
 i should only be getting X results.
 
 That second group of people aren't going to be any happier if you
 give them 0 results instead -- at least this way people get some results
 to work with.
 
 -Hoss




Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Walter Underwood
Please don't switch to RMI. We've spent the past year converting
our entire middle tier from RMI to HTTP. We are so glad that we
no longer have any RMI servers.

The big advantage of HTTP is that there are hundreds, maybe
thousands, of engineers working on making it fast, on tools for it,
on caches, etc.

If you really need more compact responses, I would recommend
coding the JSON output in Python marshal format. That is compact,
fast, and easy to parse. We used that for a Java client in Ultraseek.

wunder

On 9/21/07 11:08 AM, Yonik Seeley [EMAIL PROTECTED] wrote:

 I wanted to take a step back for a second and think about if HTTP was
 really the right choice for the transport for distributed search.
 
 I think the high-level approach in SOLR-303 is the right way to go
 about it, but I'm unsure if HTTP is the right transport.
 
 Pro HTTP:
   - using HTTP allows one to use an http load-balancer to distribute
 load across multiple copies of the same shard by assigning a VIP
 (virtual IP) to each shard.
   - because you do pretty much everything by hand, you know that there
 isn't some hidden limitation that will jump out and bite you later.
 
 Cons HTTP:
  - you end up doing everything by hand... connection handling, request
 serialization, response parsing, etc...
  - goes through normal servlet channels... every sub-request will be
 logged to the access logs, slowing things down.
 - more network bandwidth used unless we come up with a new
 BinaryResponseWriter and Parser
 
 Currently, SOLR-303 uses and parses the XML response format, which has
 some serious downsides:
 - response size limits scalability and how deep in responses you can go...
   If you want to retrieve documents 5000 through 5009, even though the
 user only requested 10 documents, the top-level searcher needs to get
 the top 5009 documents from *each* shard... and that can quickly
 exhaust the network bandwidth of the NIC.  XML parsing on the order of
 nShards*5009 entries won't be any picnic either.
 
 I'm thinking the load-balancing of HTTP is overrated also, because
 it's inflexible.  Adding another shard requires adding another VIP in
 the load-balancer, and changing which servers have which shards or
 adding new copies of a shard also requires load-balancer
 configuration.  Everything points to Solr being able to do the
 load-balancing itself in the future, and there wouldn't seem to be
 much benefit to using a load-balancer w/ VIPS for each shard vs having
 Solr do it.
 
 So even if we stuck with HTTP, Solr would need
  - a binary protocol to minimize network bandwidth use
  - load balancing across shard copies itself
 
 Given that, would it make sense to just go with RMI instead?
 And perhaps leverage some other higher level services (Jini? JavaSpaces?)
 
 I'd like to hear from people with more experience with RMI  friends,
 and what the potential downsides are to using these technologies.
 
 -Yonik



[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches

2007-09-14 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527694
 ] 

Walter Underwood commented on SOLR-127:
---

Last-modified does require monotonic time, but ETags are version stamps without 
any ordering. The indexVersion should be fine for an ETag.

 Make Solr more friendly to external HTTP caches
 ---

 Key: SOLR-127
 URL: https://issues.apache.org/jira/browse/SOLR-127
 Project: Solr
  Issue Type: Wish
Reporter: Hoss Man
 Attachments: HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch


 an offhand comment I saw recently reminded me of something that really bugged 
 me about the serach solution i used *before* Solr -- it didn't play nicely 
 with HTTP caches that might be sitting in front of it.
 at the moment, Solr doesn't put in particularly usefull info in the HTTP 
 Response headers to aid in caching (ie: Last-Modified), responds to all HEAD 
 requests with a 400, and doesn't do anything special with If-Modified-Since.
 t the very least, we can set a Last-Modified based on when the current 
 IndexReder was open (if not the Date on the IndexReader) and use the same 
 info to determing how to respond to If-Modified-Since requests.
 (for the record, i think the reason this hasn't occured to me in the 2+ years 
 i've been using Solr, is because with the internal caching, i've yet to need 
 to put a proxy cache in front of Solr)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-277) Character Entity of XHTML is not supported with XmlUpdateRequestHandler .

2007-06-26 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508408
 ] 

Walter Underwood commented on SOLR-277:
---

This is not a bug. Solr accepts XML, not XHTML. It does not accept XHTML-only 
entities. 

The Solr update XML format is a specific Solr XML format, not XML, not DocBook, 
not
anything else.

To index XHTML, parse it and convert it to Solr XML update format.


 Character Entity of XHTML is not supported with XmlUpdateRequestHandler .
 -

 Key: SOLR-277
 URL: https://issues.apache.org/jira/browse/SOLR-277
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 1.3
Reporter: Toru Matsuzawa
 Attachments: XmlUpdateRequestHandler.patch


 Character Entity of XHTML is not supported with XmlUpdateRequestHandler .
 http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
 http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
 http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
 It is necessary to correspond with XmlUpdateRequestHandler because xpp3 
 cannot use !DOCTYPE.
 I think it is necessary until StaxUpdateRequestHandler becomes /update.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-216) Improvements to solr.py

2007-05-29 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499923
 ] 

Walter Underwood commented on SOLR-216:
---

GET is the right semantic for a query, since it doesn't change the resource. It 
also allows HTTP caching.

If Solr has URL length limits, that's a bug.


 Improvements to solr.py
 ---

 Key: SOLR-216
 URL: https://issues.apache.org/jira/browse/SOLR-216
 Project: Solr
  Issue Type: Improvement
  Components: clients - python
Affects Versions: 1.2
Reporter: Jason Cater
Assignee: Mike Klaas
Priority: Trivial
 Attachments: solr.py


 I've taken the original solr.py code and extended it to include higher-level 
 functions.
   * Requires python 2.3+
   * Supports SSL (https://) schema
   * Conforms (mostly) to PEP 8 -- the Python Style Guide
   * Provides a high-level results object with implicit data type conversion
   * Supports batching of update commands

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r541391 - in /lucene/solr/trunk: CHANGES.txt example/solr/conf/xslt/example_atom.xsl example/solr/conf/xslt/example_rss.xsl

2007-05-25 Thread Walter Underwood
On 5/25/07 10:45 AM, Chris Hostetter [EMAIL PROTECTED] wrote:
 
 : I'd slap versions to those 2 XSL files to immediately answer which
 : version of Atom|RSS does this produce?
 
 i'm comfortable calling the example_rss.xsl RSS, since most RSS
 readers will know what do do with it, but i don't know that i'm
 comfrotable calling it any specific version of RSS, people are more likely
 to get irrate about claiming ot be a specific version if one little thing
 is wrong then they are about not claiming to be anything in particular.

Some versions of RSS are quite incompatible, so we MUST say what
version we are implementing. RSS 1.0 is completely different from
the 0.9 series and 2.0.

Atom doesn't have a version number, but RFC 4287 Atom is informally
called 1.0. 

wunder



[jira] Commented: (SOLR-208) RSS feed XSL example

2007-05-17 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496624
 ] 

Walter Underwood commented on SOLR-208:
---

I wasn't in the RSS wars, either, but I was on the Atom working group. That was 
a bunch of volunteers making a clean, testable spec for RSS functionality 
(http://www.ietf.org/rfc/rfc4287). RSS 2.0 has some bad ambiguities, especially 
around ampersand and entities in titles. The default has changed over the years 
and clients do different, incompatible things.

GData is just a way to do search result stuff that we would need anyway. It is 
standard set of URL parameters for query, start-index, and categories, and a 
few Atom extensions for total results, items per page, and next/previous.

http://code.google.com/apis/gdata/reference.html


 RSS feed XSL example
 

 Key: SOLR-208
 URL: https://issues.apache.org/jira/browse/SOLR-208
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Affects Versions: 1.2
Reporter: Brian Whitman
 Assigned To: Hoss Man
Priority: Trivial
 Attachments: rss.xsl


 A quick .xsl file for transforming solr queries into RSS feeds. To get the 
 date and time in properly you'll need an XSL 2.0 processor, as in 
 http://wiki.apache.org/solr/XsltResponseWriter .  Tested to work with the 
 example solr distribution in the nightly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: dynamic copyFields

2007-05-02 Thread Walter Underwood
That syntax is from the ed editor. I learned it in 1975
on Unix v6/PWB, running on a PDP-11/70. --wunder

On 5/2/07 5:04 PM, Mike Klaas [EMAIL PROTECTED] wrote:

 On 5/2/07, Ryan McKinley [EMAIL PROTECTED] wrote:
 
 How about Mike's other suggestion:
   copyField regexp=s/(.*)_s/\1_t/ /
 
 this would keep the glob style for source and dest, but use regex
 to transform a sorce - dest
 
 Wow, I didn't even remember suggesting that.  I agree (with Hoss) that
 backward compatibility is important, but I disagree (with myself) that
 the above syntax is nice.  Outside of perl, I'm not sure how common
 the s/ / / syntax is (is it used in java?)
 
 perhaps
 
 copyField re_source=(.*)_s dest=\1_t/
 
 ?
 
 -Mike



Re: Progressive Query Relaxation

2007-04-10 Thread Walter Underwood
From the name, I thought this was an adaptive precision scheme,
where the engine automatically tries broader matching if there
are no matches or just a few. We talked about doing that with
Ultraseek, but it is pretty tricky. Deciding when to adjust it is
harder than making it variable.

Instead, this is an old idea that search amateurs seem to like.
Show all exact matches, then near matches, etc. This is the
kind of thing people suggest when they don't understand that
a ranking algorithm combines that evidence in a much more
powerful way. I talked customers out of this once or twice
each year at Ultraseek.

This approach fails for:

* common words
* misspellings

Since both of those happen a lot, this idea fails for a lot
of queries.

I presume that Oracle implemented this to shut up some big customer,
since it isn't a useful feature unless it closes a sale.

DisMax gives you something somewhat similar to this, by
selecting the best matching field. That is much more powerful
and gives much better results.

wunder

On 4/9/07 12:46 AM, J. Delgado [EMAIL PROTECTED] wrote:

 Has anyone within the Lucene or Solr community attempted to code a
 progressive query relaxation technique similar to the one described
 here for Oracle Text?
 http://www.oracle.com/technology/products/text/htdocs/prog_relax.html
 
 Thanks,
 
 -- J.D.



Re: Progressive Query Relaxation

2007-04-10 Thread Walter Underwood
On 4/10/07 10:06 AM, J. Delgado [EMAIL PROTECTED] wrote:

 Progressive relaxation, at least as Oracle has defined it, is a
 flexible, developer defined series of queries that are efficiently
 executed in progression and in one trip to the engine, until minimum
 of hits required is satisfied. It is not a self adapting precision
 scheme nor it tries to guess what is the best match.

Correct. Search engines are all about the best match. Why would
you show anything else?

This is an RDBMS flavored approach, not an approach that considers
natural language text. Sets of matches, not a ranked list. It fails
as soon as one of the sets gets too big, like when someone searches
for laserjet at HP.com. That happens a lot.

It assumes that all keywords are the same, something that Gerry
Salton figured out was false thirty years ago. That is why we
use tf.idf instead of sets of matches.

I see a lot of design without any talk about what problem they are
solving. What queries don't work? How do we make those better?
Let's work from real logs and real data. Oracle's hack doesn't
solve any problem I've see in real query logs.

I'm doing e-commerce search, and our current engine does pretty
much what Oracle is offering. The results are not good, and we
are replacing it with Solr and DisMax. My off-line relevance testing
shows a big improvement.

wunder
--
Search Guru, Netflix




Re: Progressive Query Relaxation

2007-04-10 Thread Walter Underwood
On 4/10/07 10:38 AM, J. Delgado [EMAIL PROTECTED] wrote:

 I think you have something personal against Oracle... Hey I have no
 interest in defending Oracle, but this no hack.

It's true, I don't have much respect for Oracle's text search.
When I was working on enterprise search, we never really worried
about them because their quality and speed just wasn't competitive.
I do not look to them as a reliable source of good ideas for search.

Oracle's problem statement has a plausible strawman, but there are
lots of better ways to deal with misspellings. Heck, my dev instance
of Solr gives Michael Crichton as the first hit for Michel Crichton.
It is not true that hits which are a poor match will be mixed in
with hits which are a good match.

Hmmm, Crichton is much more likely to be misspelled than Michael,
so maybe their strawman isn't very good.

wunder



[jira] Created: (SOLR-161) Dangling dash causes stack trace

2007-02-15 Thread Walter Underwood (JIRA)
Dangling dash causes stack trace


 Key: SOLR-161
 URL: https://issues.apache.org/jira/browse/SOLR-161
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.1.0
 Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel
Reporter: Walter Underwood


I'm running tests from our search logs, and we have a query that ends in a 
dash. That caused a stack trace.

org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the 
truth -': Encountered EOF at line 1, column 23.
Was expecting one of:
( ...
QUOTED ...
TERM ...
PREFIXTERM ...
WILDTERM ...
[ ...
{ ...
NUMBER ...

at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127)
at 
org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)
at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-161) Dangling dash causes stack trace

2007-02-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473625
 ] 

Walter Underwood commented on SOLR-161:
---

The parser can have a rule for this rather than exploding. A trailing dash is 
never meaningful and can be omitted, whether we're allowing +/- or not. Seems 
like a grammar bug to me. --wunder

 Dangling dash causes stack trace
 

 Key: SOLR-161
 URL: https://issues.apache.org/jira/browse/SOLR-161
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.1.0
 Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel
Reporter: Walter Underwood

 I'm running tests from our search logs, and we have a query that ends in a 
 dash. That caused a stack trace.
 org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the 
 truth -': Encountered EOF at line 1, column 23.
 Was expecting one of:
 ( ...
 QUOTED ...
 TERM ...
 PREFIXTERM ...
 WILDTERM ...
 [ ...
 { ...
 NUMBER ...
 
   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127)
   at 
 org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)
   at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-161) Dangling dash causes stack trace

2007-02-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473628
 ] 

Walter Underwood commented on SOLR-161:
---

It is really a Lucene query parser bug, but it wouldn't hurt to do s/(.*)-// 
as a workaround. Assuming my ed(1) syntax is still fresh. Regardless, no query 
string should ever give a stack trace. --wunder

 Dangling dash causes stack trace
 

 Key: SOLR-161
 URL: https://issues.apache.org/jira/browse/SOLR-161
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.1.0
 Environment: Java 1.5, Tomcat 5.5.17, Fedora Core 4, Intel
Reporter: Walter Underwood

 I'm running tests from our search logs, and we have a query that ends in a 
 dash. That caused a stack trace.
 org.apache.lucene.queryParser.ParseException: Cannot parse 'digging for the 
 truth -': Encountered EOF at line 1, column 23.
 Was expecting one of:
 ( ...
 QUOTED ...
 TERM ...
 PREFIXTERM ...
 WILDTERM ...
 [ ...
 { ...
 NUMBER ...
 
   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:127)
   at 
 org.apache.solr.request.DisMaxRequestHandler.handleRequest(DisMaxRequestHandler.java:272)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)
   at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
Let's not make this complicated for situations that we've never
seen in practice. Java is a Unicode language and always has been.
Anyone running a Java system with a Shift-JIS default should already
know the pitfalls, and know them much better than us (and I know a
lot about Shift-JIS).

The URI spec says UTF-8, so we can be compliant and tell people
to fix their code. If they need to add specific hacks for their
broken software, that is OK. We don't need generic design features
for a few broken clients.

RFC 3896 has been out for two years now. That is long enough for
decently-maintained software to get it right.

wunder

On 2/1/07 2:14 PM, Chris Hostetter [EMAIL PROTECTED] wrote:

 
 : If we can do something small that makes the most normal cases work
 : even if the container is not configured, that seems good.
 
 but how do we know the user wants what we consider a normal cases to
 work? ... if every servlet container lets you configure your default
 charset differently, we have no easy way to tell if/when they've
 configured the default properly, to know if we should override it.
 
 If someone does everything in Shift-JIS, and sets up their servlet
 container with Shift-JIS as their default, and installs solr -- i don't
 want them to think Solr sucks because there is a default in Solr they
 don't know about (or know how to disable) that assumes UTF-8.
 
 On the other hand: if someone really hasn't thought about charsets at all,
 then it doesn't seem that bad to use whatever default their servlet
 container says to use -- as I understand it some containers (tomcat
 included) pick their default based on the JVMs
 configuration (i assume from the user.language sysproperty) ... that
 certainly seems like a better default then for us ot asume UTF-8 -- even
 if it is latin1 for en, because most novice users are probably okay
 with latin1 ... if you're starting to worry about more complex characters
 that aren't in the default charset your servlet container picks for you,
 then reading a little documentation is a good idea.
 
 
 : At the very lease, we should change the examples in:
 : http://wiki.apache.org/solr/SolrResin etc
 
 oh absolutely.
 
 
 
 
 -Hoss
 



Re: resin and UTF-8 in URLs

2007-02-01 Thread Walter Underwood
On 2/1/07 3:18 PM, Chris Hostetter [EMAIL PROTECTED] wrote:

 As for XML, or any other format a user might POST to solr (or ask solr
 to fetch from a remote source) what possible reason would we have to only
 supporting UTF-8? .. why do you suggest that the XML standard specify
 UTF-8, [so] we should use UTF-8 ... doesn't the XML standard say we
 should use the charset specified in the content-type if there is one, and
 if not use the encoding specified in the xml header, ie...
 
 ?xml encoding='EUC-JP'?

The XML spec says that XML parsers are only required to support
UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different
encoding for XML, there is no guarantee that a conforming parser
will accept it.

Ultraseek has been indexing XML for the past nine years, and
I remember a single customer that had XML in a non-standard
encoding. Effectively all real-world XML is in one of the
standard encodings.

The right spec for XML over HTTP is RFC 3023. For text/xml
with no charset spec, the XML must be interpreted as US-ASCII.
From section 8.5:

   Omitting the charset parameter is NOT RECOMMENDED for text/xml.  For
   example, even if the contents of the XML MIME entity are UTF-16 or
   UTF-8, or the XML MIME entity has an explicit encoding declaration,
   XML and MIME processors MUST assume the charset is us-ascii.

wunder




Re: loading many documents by ID

2007-01-31 Thread Walter Underwood
On 1/31/07 3:39 PM, Chris Hostetter [EMAIL PROTECTED] wrote:
 
 : Oh, and there have been numerous people interested in updateable
 : documents, so it would be nice if that part was in the update handler.
 
 We'd have to make it very clear that this only works if all fields are
 STORED.

Isn't there some way to do this automatically instead of relying
on documentation? We might need to add something, maybe a
required attribute on fields, but a runtime error would be
much, much better than a page on the wiki.

wunder



Re: loading many documents by ID

2007-01-31 Thread Walter Underwood
On 1/31/07 9:05 PM, Ryan McKinley [EMAIL PROTECTED] wrote:
 
 We'd have to make it very clear that this only works if all fields are
 STORED.
 
 Isn't there some way to do this automatically instead of relying
 on documentation? We might need to add something, maybe a
 required attribute on fields, but a runtime error would be
 much, much better than a page on the wiki.
 
 what about copyField?
 
 With copyField, it is reasonable to have fields that are not stored
 and are generated from the other stored fields.  (this is what my
 setup looks like).

Mine, too. That is why I suggested explicit declarations in the
schema to say which fields are required.

wunder



Re: Can this be achieved? (Was: document support for file system crawling)

2007-01-19 Thread Walter Underwood
On 1/19/07 10:33 AM, Chris Hostetter [EMAIL PROTECTED] wrote:

 [...] but if your interest is in
 having an enterprise search solution that people can deploy on a box
 and haveit start working for them, then there is no reason for all of that
 code to run in a single JVM using a single code base -- i'm going to go
 out on a limb and guess that that the Google Appliances run more then a
 single process :)

Ultraseek does exactly that and is a single multi-threaded process.
A single process is much easier for the admin. A multi-process solution
is more complicated to start up, monitor, shut down, and upgrade.

There is decent demand for a spidering enterprise search engine.
Look at the Google Appliance, Ultraseek, and IBM OmniFind. The
free IBM OmniFind Yahoo! Edition uses Lucene.

I'd love to see the Ultraseek spider connected to Solr, but that
depends on Autonomy.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Java version for solr development (was Re: Update Plugins)

2007-01-16 Thread Walter Underwood
On 1/16/07 8:03 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 I think it's a bit soon to move to 1.6 - I don't know how many
 platforms it's available for yet.

It is still in early release from IBM for their PowerPC
servers, so requiring 1.6 would be a serious problem for us.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: [jira] Commented: (SOLR-85) [PATCH] Add update form to the admin screen

2006-12-18 Thread Walter Underwood
On 12/18/06 7:52 AM, Thorsten Scherler
[EMAIL PROTECTED] wrote:

 On Fri, 2006-12-15 at 11:16 -0800, Chris Hostetter wrote:
 : The next thing on my list is to write a small cli based on httpclient to
 : send the update docs as alternative of the post.sh.
 
 You may want to take a look at SOLR-20 and SOLR-30 ... those issues are
 first stabs at Java Client APIs for query/update which if cleaned up a bit
 could become the basis for your CLI.
 
 Hmm, I had a look at them but actually what I came up with is way
 smaller and more focused on the update part.
 
 https://issues.apache.org/jira/browse/SOLR-86
 
 It is a replacement of the post.sh not much more (yet).

I'll take a look at this. I also wrote my own, because
I had no idea that the Java client code existed.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Heavily-populated bit sets

2006-12-12 Thread Walter Underwood
As an aside to SOLR-80, there is a standard trick for compressing a bit
set with more than half the bits set. You invert it, make it less than
half full, then store that. Basically, store the zeroes instead of the
ones. It costs one extra bit to say whether it is inverted or not.

wunder
-- 
Walter Underwood
Search Guru, Netflix




[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names

2006-11-28 Thread Walter Underwood (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12454159 ] 

Walter Underwood commented on SOLR-73:
--

I think the aliases are harder to read. You need to go elsewhere to figure them 
out. I read documentation, but I didn't find the part of the wiki that 
explained them and I had to ask the mailing list.

The javadoc uses the full class name. Google and Yahoo searches should work 
better with the full class name (Yahoo is working much better than Google for 
that right now).

The aliases save typing, but I don't think they improve usability. Full class 
names are simple and unambiguous.

If we want usability for non-programmers, we can't have them editing an XML 
file. 


 schema.xml and solrconfig.xml use CNET-internal class names
 ---

 Key: SOLR-73
 URL: http://issues.apache.org/jira/browse/SOLR-73
 Project: Solr
  Issue Type: Bug
  Components: search
Reporter: Walter Underwood

 The configuration files in the example directory still use the old 
 CNET-internal class names, like solr.LRUCache instead of 
 org.apache.solr.search.LRUCache.  This is confusing to new users and should 
 be fixed before the first release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-73) schema.xml and solrconfig.xml use CNET-internal class names

2006-11-28 Thread Walter Underwood (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-73?page=comments#action_12454190 ] 

Walter Underwood commented on SOLR-73:
--

The context required to resolve the ambiguity is a wiki page that I didn't know 
existed. Since I didn't know about it, I tried to figure it out by reading the 
code, and then by sending e-mail to the list. In my case, I was writing two 
tiny classes, but the issue would be the same if I was a non-programmer adding 
some simple plug-ins.

With a full class name, there is no ambiguity. Again, this saves typing at the 
cost of requiring an indirection through some unspecified documentation.

I saw every customer support e-mail for eight years with Ultraseek, so I'm 
pretty familiar with the problems that search engine admins run into. 
One of the things we learned was that documentation doesn't fix an unclear 
product. You fix the product instead of documenting how to understand it.

Requiring users to edit an XML file is a separate issue, but I think it is a 
serious problem, especially because any error messages show up in the server 
logs. 


 schema.xml and solrconfig.xml use CNET-internal class names
 ---

 Key: SOLR-73
 URL: http://issues.apache.org/jira/browse/SOLR-73
 Project: Solr
  Issue Type: Bug
  Components: search
Reporter: Walter Underwood

 The configuration files in the example directory still use the old 
 CNET-internal class names, like solr.LRUCache instead of 
 org.apache.solr.search.LRUCache.  This is confusing to new users and should 
 be fixed before the first release.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-22 Thread Walter Underwood
On 11/20/06 5:51 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 : If you really want to handle failure in an error response, write that
 : to a string and if that fails, send a hard-coded string.
 
 Hmmm... i could definitely get on board an idea like that.
 
 I took pains to make things streamable.. I'd hate to discard that.
 How do other servers handle streaming back a response and hitting an error?

You found the design tradeoff! We can stream the results or we can
give reliable error codes for errors that happen during result processing.
We can't do both. Ultraseek does streaming, but we were generating
HTML, so we could print reasonable errors in-line.

Streaming is very useful for HTML pages, because it allows the first
pixels to be painted as soon as possible. It isn't as important on the
back end, unless someone has gone to the considerable trouble of making
their entire front-end able to stream the back-end results to HTML.

If we aren't calling Writer.flush occasionally, then the streaming is
just filling up a buffer smoothly. The client won't see anything until
TCP decides to send it.

Does Lucene access fetch information from disk while we iterate
through the search results? If that happens a few times, then
streaming might make a difference. If it is mostly CPU-bound,
then streaming probably doesn't help.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Walter Underwood
On 11/20/06 7:22 PM, Fuad Efendi [EMAIL PROTECTED] wrote:
 This is just a sample...
 
 1. What is an Error?
 2. What is a Mistake?
 3. What is an application bug?
 4. What is a 'system crash'?

These are not HTTP concepts. The request on a URI can succeed or fail
or result in other codes. Mistakes and crashes are outside of the HTTP
protocol.

 Of cource, XML-over-HTTP engine is not the same as HTML-over-HTTP...
 However... Walter noticed 'crawling'... I can't imagine a company which will
 put SOLR as a front-end accessible to crawlers... (To crawl an indexing
 service instead of source documents!?)

XML-over-HTTP is exactly the same as HTML-over-HTTP. In HTML, we
could return detailed error information in a meta tag. No difference.

If something is on HTTP, a good crawler can find it. All it takes is
one link, probably to the admin URL. Once found, that crawler will
happily pound on errors returned by 200.

XSLT support means you could build the search UI natively on Solr,
so that might happen.

Even without a crawler, we must work with caches and load balancers.
I will be using Solr with a load balancer in production. If Solr is
a broken HTTP server, we will have to build something else.

 I am sure that mixing XML-based interface with HTTP status codes is not an
 attractive 'architecture', we shold separate conserns and leave HTTP code
 handling to a servlet container as much as possible...

We don't need to use HTTP response codes deep in Solr, but we do need
to separate bad parameters, retryable errors, non-retryable errors, and
so on. We can call them what ever we want internally, but we need to
report them properly over HTTP.

wunder
-- 
Walter Underwood
Search Guru, Netflix

 



Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Walter Underwood
On 11/20/06 5:51 PM, Yonik Seeley [EMAIL PROTECTED] wrote:

 Now that I think about it though, one nice change would be to get rid
 of the long stack trace for 400 exceptions... it's not needed, right?

That is correct. A client error (400) should not be reported with a
server stack trace. --wunder



Phonetic Token Filter

2006-11-21 Thread Walter Underwood
I've written a simple phonetic token filter (and factory) based
on the Double Metaphone implementation in Jakarta Codecs to
contribute. Three questions:

1. Does this sound like a generally useful addition?

2. Should we have a Jira issue first?

3. This adds a depencency on the codecs jar. How do we add that
to the distro?

The code is very simple, but I need to learn the contribution
process and build some tests, so this won't happen in one day.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-21 Thread Walter Underwood
One way to think about this is to assume caches, proxies, and load balancing
in the HTTP path, then think about their behavior. A 500 response may make
the load balancer drop this server from the pool, for example. A 200 OK
can be cached, so temporary errors shouldn't be sent with that code.

On 11/20/06 10:51 AM, Chris Hostetter [EMAIL PROTECTED] wrote:
 
 ...there's kind of a chicken/egg problem with this discussion ... the egg
 being what should the HTTP response look like in an 'error' situation
 the chicken being what is the internal API to allow a RequestHandler to
 denote an 'error' situation ... talking about specific cases only gets us
 so far since those cases may not be errors in all RequestHandlers.

We can get most of the benefit with a few kinds of errors: 400, 403, 404,
500, and 503. Roughly:

400 - error in the request, fix it and try again
403 - forbidden, don't try again
404 - not found, don't try again unless you think it is there now
500 - server error, don't try again
503 - server error, try again

These can be mapped from internal error types.

 the problem gets even more complicated when you try to answer the
 question: what should Solr do if an OutputWriter encounters an error? ...
 we can't generate a valid JSON response dnoting an error if the
 JSONOutputWriter is failing :)

Write the response to a string before sending the headers. This can be
slower than writing the response out as it is computed, but the response
codes can be accurate. Also, it allows optimal buffering, so it might
scale better.

If you really want to handle failure in an error response, write that
to a string and if that fails, send a hard-coded string.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Cocoon-2.1.9 vs. SOLR-20 SOLR-30

2006-11-17 Thread Walter Underwood
On 11/17/06 2:50 PM, Fuad Efendi [EMAIL PROTECTED] wrote:

 We should probably separate business-related end-user errors (such as when
 user submits empty query) and make it XML-like (instead of HTTP 400)

Speaking as a former web spider maintainer, it is very important to keep
the HTTP response codes accurate. Never return an error with a 200.

If we want more info, return an entity (body) with the 400 response.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Adding Phonetic Search to Solr

2006-11-08 Thread Walter Underwood

On 11/8/06 10:30 AM, Chris Hostetter [EMAIL PROTECTED] wrote:

 : Also, the phonetic matches are ranked a bit high, so I'm trying a
 : sub-1.0 boost. I was expecting the lower idf to fix that automatically.
 : The metaphone will almost always have a lower idf because multiple
 : words are mapped to one metaphone, so the encoded term occurs in more
 : documents than the surface terms.
 
 That all makes sense, and yet it's not what you are observing ... which
 leads me to believe you (and I since i want to agree with you) are missing
 something subtle  what does the the Explanation look like for two
 documenets where you feel like one should score higher then the other but
 they don't?

That is my next step. Maybe create some test documents in my corpus and
spend some quality time with Explain and grokking DisMax. I need to
customize Similarity anyway.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: [jira] Commented: (SOLR-66) bulk data loader

2006-11-07 Thread Walter Underwood
On 11/7/06 11:22 AM, Yonik Seeley (JIRA) [EMAIL PROTECTED] wrote:

 Yes, posting queries work because it's all form-data (query args).
 But, what if we want to post a complete file, *and* some extra info/parameters
 about how that file should be handled?

One approach is the Atom Publishing Protocol. That is pretty clear
about content and metainformation. It isn't designed to solve every
problem, but it handles a broad range of publishing, so it could be
a good fit for many uses of Solr.

APP is nearly finished. The latest draft is here (second URL also
has HTML versions).

 http://www.ietf.org/internet-drafts/draft-ietf-atompub-protocol-11.txt
 http://tools.ietf.org/wg/atompub/draft-ietf-atompub-protocol/

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Adding Phonetic Search to Solr

2006-11-07 Thread Walter Underwood
On 11/7/06 2:30 PM, Mike Klaas [EMAIL PROTECTED] wrote:
 On 11/7/06, Walter Underwood [EMAIL PROTECTED] wrote:
 
 1. Adding fuzzy to the DisMax specs.
 
 What do you envisage the implementation looking like?

Probably continue with the template-like patterns already there.

  title^2.0   (search title field with boost of 2.0)
  title~  (search title field with fuzzy matching)

 2. Adding a phonetic token filter and relying on the per-field analyzer
 support.
 
 I'm not sure why any modification to solr would be necessary.  You
 could add a field with a phonetic analyzer and use copyField to copy
 your search fields to it.  Search will use the modified analyzer
 automatically.

Ah, I missed the analyzer example with a stock Lucene analyzer.
Oops. I still need to write an Analyzer, because there is no standard
phonetic search in Lucene today. There are some patches and addons
floating around.

Still, it seems like others might want to use a phonetic token
filter with the filter specs. I'd be glad to contribute that,
if others think it would be useful.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: Adding Phonetic Search to Solr

2006-11-07 Thread Walter Underwood
On 11/7/06 3:26 PM, Mike Klaas [EMAIL PROTECTED] wrote:

 Is the state of the art in phonetic token generation reasonable?  I've
 been rather disappointed with some implementations (eg. SOUNDEX in
 MySQL, MSSQL).

SOUNDEX is excellent technology for its time, but its time was 1920.

Double Metaphone is far more complex and works fairly well. There is
an Apache commons codec implementation available. It is certainly
good enough for matching proper names, like Moody and Mudie or
Cathy and Kathie.

There are some commercial phonetic coders, but I don't have any
experience with those.

wunder
-- 
Walter Underwood
Search Guru, Netflix




Re: [jira] Created: (SOLR-60) Remove overwritePending, overwriteCommitted flags?

2006-11-01 Thread Walter Underwood
+1 as well. --wunder

On 11/1/06 11:17 AM, Mike Klaas [EMAIL PROTECTED] wrote:

 +1
 
 On 11/1/06, Yonik Seeley (JIRA) [EMAIL PROTECTED] wrote:
 Remove overwritePending, overwriteCommitted flags?
 --
 
  Key: SOLR-60
  URL: http://issues.apache.org/jira/browse/SOLR-60
  Project: Solr
   Issue Type: Improvement
   Components: update
 Reporter: Yonik Seeley
 Priority: Minor
 
 
 The overwritePending, overwriteCommitted, allowDups flags seem needlessly
 complex and don't add much value.  Do people need/use separate control over
 pending vs committed documents?
 
 Perhaps all most people need is overwrite=true/false?
 
 overwritePending, overwriteCommitted were originally added because it was a
 (mis)feature that another internal search tool had.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
 http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see: http://www.atlassian.com/software/jira
 
 
 



Re: Copying the request parameters to Solr's response

2006-10-24 Thread Walter Underwood
Returning the query parameters is really useful. I'm not sure it
needs to be optional, they are small and options multiply the test
cases.

It can even be useful to return the values of the defaults.

All those go into the key for any client side caching, for example.

wunder

On 10/24/06 1:55 AM, Erik Hatcher [EMAIL PROTECTED] wrote:

 I think its a good idea, but it probably should be made optional.
 Clients can keep track of the state themselves, and keeping the
 response size as small as possible is valuable.  But it would be
 helpful in some situations for the client to get the original query
 context sent back too.
 
 Erik
 
 
 On Oct 24, 2006, at 4:20 AM, Bertrand Delacretaz wrote:
 
 Hi,
 
 I need to implement paging of Solr result sets, and (unless I have
 overlooked something that already exists) it would be useful to copy
 the request parameters to the output.
 
 I'm thinking of adding something like this to the XML output:
 
  responseHeader
  lst name=queryParameters
str name=qauthor:Leonardo/str
str name=start24/str
str name=rows12/str
   etc...
 
 I don't think the SolrParams class provides an Iterator to retrieve
 all parameters, I'll add one to implement this.
 
 WDYT?
 
 -Bertrand
 



Re: Solr NightlyBuild

2006-09-20 Thread Walter Underwood
I agree that a release would be useful for marketing, but I also
think it would help exercise the community and the release process.

I just discovered Solr on Friday and I've been telling people about
it, but every e-mail includes you need to be OK with nightly builds.

Being OK with nightly builds means that you need to run your own
QA on the whole build every time you change. Kinda expensive.

wunder
--
Walter Underwood
Search Guru, Netflix



Re: double curl calls in post.sh?

2006-09-18 Thread Walter Underwood
Also, do not use text/xml. Even with a charset parameter. In a correct
implementation, that will override the XML declaration of charset.
With text/xml, the charset parameter must be correct. When it is
omitted, the content MUST be interpreted as US-ASCII (yuk).

Instead, use a media type of application/xml, so that the server
is allowed to sniff the content to discover the character encoding.

For the gory details, see RFC 3023:

  http://www.ietf.org/rfc/rfc3023.txt

wunder
==
Walter Underwood
Search Guru, Netflix

On 9/17/06 1:00 PM, Chris Hostetter [EMAIL PROTECTED] wrote:

 
 am i smoking crack of is post.sh mistakenly sending every doc twice in a
 row? ...
 
 for f in $FILES; do
   echo Posting file $f to $URL
   curl $URL --data-binary @$f
   curl $URL --data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
   echo
 done
 
 
 ...is there any reason not to delete that first execution of curl?
 
 
 
 -Hoss