Re: multi-field highlighting

2005-05-06 Thread markharw00d
Phrase highlighting (and spans) would certainly be useful, as would multi-field. Before we leap into adding code into the highlighter though I think it's worth considering what we are trying to fix here in a more general sense. As a basic principle I think highlighting should attempt to show the

Re: multi-field highlighting

2005-05-06 Thread markharw00d
Hi Martin, welcome to the group. >>You can see it in action here: Very nice work! I like the forward/backward links between hits. Unfortunately, it involves significant additions to the Lucene core. In essence it relies on an amped-up span system that is capable of scoring the spans, as well as rec

Re: multi-field highlighting

2005-05-10 Thread markharw00d
Doug Cutting wrote: Shouldn't the search code already take care of that? No, the search may return documents that happen to contain "Doug Cutting" and Google - the current highlighter implementation uses all query terms (ignoring any AND/OR() operators) and looks for matches. Ideally "Doug Cut

Term.compareTerm and MemoryIndex

2005-06-29 Thread markharw00d
Anyone have any objections to committing this addition to Term.java? _http://www.mail-archive.com/java-dev@lucene.apache.org/msg00618.html_ It's a simple addition to avoid fieldName.intern() overheads by safely constructing new Term objects from existing Term objects and re-using it's pre-inte

[VOTE] Wolfgang as committer

2005-07-01 Thread markharw00d
I'd like to propose Wolfgang Hoschek should be given commit rights to maintain his MemoryIndex contribution. Thoughts? ___ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos ht

Re: regex-based query contribution

2005-10-13 Thread markharw00d
Sounds like a very useful addition but as yet another variant of "term expanding" queries (fuzzy/prefix/range/wildcard) now might be a good time to re-raise the scoring issue I originally identified here with all such queries: http://issues.apache.org/jira/browse/LUCENE-329 The issue is that "

Re: Highlighter using spans

2005-11-13 Thread markharw00d
Hi Erik, I posted what I thought would be the best approach to fixing this here along with pointers to some existing code: http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2 Unfortunately I've been way too caught up in other work lately to implement this and this particular "

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread markharw00d
I was thinking about the challenges of holding a score per document recently whilst trying to optimize the Lucene-embedded-in-Derby/HSQLDB code. I found myself actually wanting to visualize the problem and to see the distribution of scores for a query in a graphical form eg how sparse the resu

Re: "Advanced" query language

2005-12-02 Thread markharw00d
What ever is generating the xml could just as easily create/instantiate the query objects. Yes, it is easier using the existing Java objects to construct queries but they are inappropriate when you consider the scenarios 1 to 3 I outlined earlier (query persistence, support for clients wri

Re: "Advanced" query language

2005-12-03 Thread markharw00d
Erik Hatcher wrote: Rest assured that human-readable query expressions aren't going away at all. I don't think Mark even implied that. That's right. The proposal is *not* to replace what is already there - QueryParser will always have a useful role to play supporting the "Google-like" que

Re: "Advanced" query language

2005-12-04 Thread markharw00d
Paul Elschot wrote: Would it be possible to privide such a GUI automatically (by introspection) given a set of Query classes of which objects can be mixed to form a query? Certainly possible - I've seen app servers with automatic GUI test clients which can introspect an EJB interface and l

Re: "Advanced" query language

2005-12-04 Thread markharw00d
I think I'm with Erik on this - I generally don't see end users keen to type anything other than "words with spaces" as queries. I do see them commonly using GUI forms with multiple inputs and behind the scenes application code assembling the query - the same way just about every web app in the

Re: "Advanced" query language

2005-12-05 Thread markharw00d
Erik's scenario pretty much nails it for me. I prefer the Ant-like XML approach over a Spring one because all the messy classnames are removed from document instances. ( I wasn't suggesting we use either technology, merely citing them as object assembly languages). Haven't seen HiveMind/Digest

"Advanced" query language

2005-12-15 Thread markharw00d
After our recent discussions on this topic I've found some time to put together a first cut of a SAX based Query parser, see here: http://www.inperspective.com/lucene/LXQueryV0_1.zip I've implemented just a few queries (Boolean, Term, FilteredQuery, BoostingQuery ...) but other queries are fai

Re: "Advanced" query language

2006-01-02 Thread markharw00d
I thought you said you "didn't really want to have to design a general API for parsing XML as part of this project" ? :) Having grown tired of messing with my own solution I tried using commons Digester with my example XML but ran into issues so I'm back looking at a custom solution. I'

Re: Preventing "killer" queries

2006-02-07 Thread markharw00d
[Answering my own question] I think a reasonable solution is to have a generic analyzer for use at query-time that can wrap my application's choice of analyzer and automatically filter out what it sees as stop words. It would initialize itself from an IndexReader and create a StopFilter for th

XML based Query Parser

2006-02-21 Thread markharw00d
Further to our discussions some time ago I've had some time to put together an XML-based query parser with support for many "advanced" query types not supported in the current Query parser. More details and code here: http://www.inperspective.com/lucene/LXQuery2.htm Cheers Mark

Re: XML based Query Parser

2006-02-27 Thread markharw00d
Hi Chris, Thanks for taking the time to look at this in detail. 1) The factory classes should have "removeBuilder" methods so people subclassing parsers can flat out remove support for a particular tag, not just replace it. Can do. 2) This DOM version definitely seems easier to follow

BooleanFilter proposal

2006-03-24 Thread markharw00d
What do folks feel about a BooleanFilter which is the equivalent of BooleanQuery ie a filter which contains other filters, combined with the same "must", "should" or "must not" logic. I know we already have ChainedFilter in the "misc" section of the contrib area but its methods do not echo the

SpanQuery doesn't implement extractTerms

2006-03-28 Thread markharw00d
Is there any particular reason why SpanQuery introduced public Collection getTerms() when we have: public void extractTerms(Set terms) in Query? I am changing the highlighter to make use of extractTerms and can either add this to SpanQuery: public void extractTerms(Set terms) { term

Re: [VOTE] 2.0 release this Friday?

2006-05-22 Thread markharw00d
+1 Send instant messages to your online friends http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and Java 1.5

2006-05-30 Thread markharw00d
Besides what has already been covered: Lucene Query and Filter objects are marked as Serializable so a remote client can serialize a request to a server which then rewrites and executes the request. This allows for a Webstart or applet-based architecture where the client can construct and send

Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-16 Thread markharw00d
>1.5 IS the Java version that the majority Lucene users use, not 1.4! >Does this mean we can now start accepting 1.5 code? This isn't simply about which JVM gets used the most wins. This is about "how many Lucene users will we inconvenience or lose by moving to 1.5?" Right now the survey sampl

Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-19 Thread markharw00d
>>One point that I feel keeps getting ignored is that we are talking about the _future_ releases. >>My guess is that we won't see a major new Lucene release before 2007, and by that time the latest JVM will probably be 1.6. I think that's a non-argument as it is common practice for people to w

Re: highlight - scoring fragments with more of the same token

2006-09-26 Thread markharw00d
If you were to score repeated terms then I suspect it would have to be done so that the repetitions didn't score as highly as the first occurrence - otherwise f2 could be selected as a better fragment than f3 for the query q1 in your example. Repetitions of a term in a fragment could be scored a

Re: highlight - scoring fragments with more of the same token

2006-09-26 Thread markharw00d
I was somewhat surprised to find that highlighting scoring simply counts how many unique query terms appear in the fragment. Guess was expecting a See QueryScorer(Query query, IndexReader reader, String fieldName) constructor - this will factor IDF into weighting for terms. Query boosts are aut

Re: Duplicate MoreLikeThis.java

2006-11-24 Thread markharw00d
I believe they are the same but the one to keep is in contrib/queries. The "queries" directory was suggested as a better location for organising contrib code - see here: http://www.gossamer-threads.com/lists/lucene/java-dev/32872#32872 I chose to copy MoreLikeThis to contrib/queries and not

Re: Lius into apache incubator

2007-01-31 Thread markharw00d
I would prefer to see a good open-source framework pulling together a collection of document parsers but which isn't tied directly to Lucene (that binding would be via *another* project). If the parser framework extracted document text in a standard document-and-application-neutral form (XML/Jav

Re: [jira] Created: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads

2007-02-12 Thread markharw00d
Thanks for the pointer, Robert. Was that the "MultiSegmentQueryFilter enhancement for interactive indexes?" thread? If so, was that not a solution to caching sets in the event of variable index content rather than my approach which is for caching sets in the event of variable query ranges? M

Re: [jira] Created: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads

2007-02-13 Thread markharw00d
Chris Hostetter wrote: i haven't read the patch, but based on the jira description i don't think this is attempting to reuse cached info across updates -- i think it's just trying to address the issue of eliminating redundent TermEnum/TermDoc iterating for similar ranges. Correct. I have conve

Re: BloomFilter-s with Lucene

2009-01-30 Thread markharw00d
Andrzej Bialecki wrote: Funny, I was having vague thoughts about this today too having been concerned about some of the big arrays that can end up in a typical Lucene app. Aside from providing space-efiicient lookups, another application for BloomFilters is in similarity measures e.g. ANDing 2

Re: WebLuke - include Jetty in Lucene binary distribution?

2008-04-24 Thread markharw00d
indows.jar contains the Java2Javascript compiler necessary for building and alone accounts for 10 mb. Including Jetty adds another ~6 mb on top of that. OK with this? -Grant On Dec 9, 2007, at 4:03 PM, markharw00d wrote: I've got a web-based version of Luke I'm happy to commit

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread markharw00d
>>Another important driver is the "out-of-the-box experience". >>we need a "standard distro" ...which would be the core plus cherry-pick certain important contrib modules (highlighter, >> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them together. Is that not Solr, or at least

Adding dependency to servlet-api

2008-11-04 Thread markharw00d
I'd like to add a web-based demo for the XML QueryParser but unlike the existing web demo I'd prefer to use some Java code that gets compiled rather than doing it all in JSP files that aren't part of the build. Doing it this way will add a dependency on servlet-api.jar which will need to be add

Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread markharw00d
The problem with that is that in most cases you still need a "string" based syntax that "people" can enter... The XML syntax includes a tag for embedding user input of this type. I guess you can always have an "advanced search" page that builds and submits the XML query behind the scenes.

Re: [jira] Created: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-10 Thread markharw00d
>>If wildcards and fuzzyies are supported, why not range ? Because Ranges don't rewrite to a BooleanQuery full of TermQueries so I can easily inspect them. Unlike fuzzy/wildcard/boolean I suspect they are not that generally useful as part of phrase query expressions. Feel free to tinker with t

Re: [jira] Updated: (LUCENE-725) NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all "boilerplate" text

2007-03-21 Thread markharw00d
The Analyzer keeps a window of (by default) the last 300 documents. Every token created in these cached documents is stored for reference and as new documents arrive their token sequences are examined to see if any of the sequences was seen before, in which case the analyzer does not emit them

Re: Packaging Lucene 2.1.0 for Debian; found 2 junit errors

2007-04-16 Thread markharw00d
Sami Siren wrote: I also saw those when I did my maven trials. I didn't dig any deeper. Fixed the highlighter problem in this report - see change here: http://svn.apache.org/viewvc?view=rev&revision=529417. Cheers, Mark --

How to handle servlet-api.jar in build?

2007-06-12 Thread markharw00d
As part of the documentation push I was considering putting together an updated demo web app which showed a number of things (indexing, search, highlighting, XML Query templates etc) and was wondering what that might mean to the build system if I was dependent on the servlet API. Are there any

Re: Fwd: Decouple Filter from BitSet: API change and xml query parser

2007-08-10 Thread markharw00d
Right, the only thing left is then how to get a Matcher from this iterator. I think the iterator *is* the equivalent of the Matcher as you've described it - a Scorer without the scores used once by a single thread to iterate across a set of doc ids. I suppose the Filter criterium is a

DuplicatesFilter - one for contrib?

2007-09-30 Thread markharw00d
I've put together a new Filter and Junit test for eliminating duplicates from search results. The typical usage scenario is where multiple documents exist in the index which share an untokenized field value (e.g. the same primary key or URL). It is desirable to keep copies in the index becaus

Re: Geographical indexing in Lucene

2007-10-01 Thread markharw00d
Great work, Evgeny! I'm certainly interested in this area and will be dissecting this in some detail. I've done similar work before but making use of JTS (Java Topology Suite), using the OpenGIS standards for spatial features/queries and 2-pass spatial queries (first rough pass is MBB only,

Possibility of introducing non-Apache license DTDDoc into build system?

2007-10-02 Thread markharw00d
I've put together DTDs documenting the full XML query syntax and used comments/examples that a DTDDoc ant task can turn into useful hyperlinked HTML help documents. Anyone know what the procedure/licensing implications are if I wanted to add a build dependency on DTDDoc? It looks like historic

Re: Possibility of introducing non-Apache license DTDDoc into build system?

2007-10-03 Thread markharw00d
>>it's just a tool for generating HTML docs from the DTD right? Yep. >>the generated HTML docs could be commited to the repository Its would be more convenient to build the docs automatically on the servers rather than upload generated copies manually but I can see that may not be possible. I

Re: Web-based Luke

2007-11-14 Thread markharw00d
Are there any licensing issues with GWT? (I've never used it) OK on that score I think : http://code.google.com/webtoolkit/terms.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL P

First cut at web-based Luke for contrib

2007-11-28 Thread markharw00d
Any takers to test this contrib layout before I commit it? http://www.inperspective.com/lucene/webluke.zip This is a (17MB) zip file which you can unzip to a new "webluke" directory under your copy of lucene/contrib and then run the usual Lucene Ant build ( or at least "ant build-contrib"). Y

Re: First cut at web-based Luke for contrib

2007-11-28 Thread markharw00d
The 17 MB bundle I provided is essentially the source plus dependencies, the bulk of which is jars, mainly the compile-time dependency gwt-dev-windows.jar weighing in at 10MB. The built WAR file is only 1.5 meg. The WAR file bundled with Jetty (as a convenience) is 8 meg. It may be possible to

WebLuke - include Jetty in Lucene binary distribution?

2007-12-09 Thread markharw00d
I've got a web-based version of Luke I'm happy to commit to contrib now. This version includes some tidy up for developers working on Luke. Eclipse .project and .classpath files have build path variables defined to cater for different install locations for GWT in development environments. Ful

Re: Out of memory - CachingWrappperFilter and multiple threads

2008-02-19 Thread markharw00d
I now think the main issue here is that a busy JVM gets into trouble trying to find large free blocks of memory for large bitsets. In my index of 64 million documents, ~8meg of contiguous free memory must be found for each bitset allocated. The terms I was trying to cache had 14 million entries