Phrase highlighting (and spans) would certainly be useful, as would
multi-field.
Before we leap into adding code into the highlighter though I think it's
worth considering what we are trying to fix here in a more general sense.
As a basic principle I think highlighting should attempt to show the
Hi Martin, welcome to the group.
>>You can see it in action here:
Very nice work! I like the forward/backward links between hits.
Unfortunately, it involves significant additions to the Lucene core. In essence it relies on an amped-up span system that is capable of scoring the spans, as well as rec
Doug Cutting wrote:
Shouldn't the search code already take care of that?
No, the search may return documents that happen to contain "Doug
Cutting" and Google - the current highlighter implementation uses all
query terms (ignoring any AND/OR() operators) and looks for matches.
Ideally "Doug Cut
Anyone have any objections to committing this addition to Term.java?
_http://www.mail-archive.com/java-dev@lucene.apache.org/msg00618.html_
It's a simple addition to avoid fieldName.intern() overheads by safely
constructing new Term objects from existing Term objects and re-using
it's pre-inte
I'd like to propose Wolfgang Hoschek should be given commit rights to
maintain his MemoryIndex contribution.
Thoughts?
___
How much free photo storage do you get? Store your holiday
snaps for FREE with Yahoo! Photos ht
Sounds like a very useful addition but as yet another variant of "term
expanding" queries (fuzzy/prefix/range/wildcard) now might be a good
time to re-raise the scoring issue I originally identified here with all
such queries: http://issues.apache.org/jira/browse/LUCENE-329
The issue is that "
Hi Erik,
I posted what I thought would be the best approach to fixing this here
along with pointers to some existing code:
http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
Unfortunately I've been way too caught up in other work lately to
implement this and this particular "
I was thinking about the challenges of holding a score per document
recently whilst trying to optimize the Lucene-embedded-in-Derby/HSQLDB
code.
I found myself actually wanting to visualize the problem and to see the
distribution of scores for a query in a graphical form eg how sparse the
resu
What ever is generating the xml could just as easily create/instantiate the
query objects.
Yes, it is easier using the existing Java objects to construct queries
but they are inappropriate when you consider the scenarios 1 to 3 I
outlined earlier (query persistence, support for clients wri
Erik Hatcher wrote:
Rest assured that human-readable query expressions aren't going away
at all. I don't think Mark even implied that.
That's right. The proposal is *not* to replace what is already there -
QueryParser will always have a useful role to play supporting the
"Google-like" que
Paul Elschot wrote:
Would it be possible to privide such a GUI automatically
(by introspection) given a set of Query classes of which objects
can be mixed to form a query?
Certainly possible - I've seen app servers with automatic GUI test
clients which can introspect an EJB interface and l
I think I'm with Erik on this - I generally don't see end users keen to
type anything other than "words with spaces" as queries.
I do see them commonly using GUI forms with multiple inputs and behind
the scenes application code assembling the query - the same way just
about every web app in the
Erik's scenario pretty much nails it for me.
I prefer the Ant-like XML approach over a Spring one because all the
messy classnames are removed from document instances. ( I wasn't
suggesting we use either technology, merely citing them as object
assembly languages). Haven't seen HiveMind/Digest
After our recent discussions on this topic I've found some time to put
together a first cut of a SAX based Query parser, see here:
http://www.inperspective.com/lucene/LXQueryV0_1.zip
I've implemented just a few queries (Boolean, Term, FilteredQuery,
BoostingQuery ...) but other queries are fai
I thought
you said you "didn't really want to have to design a general API for
parsing XML as part of this project" ? :)
Having grown tired of messing with my own solution I tried using commons
Digester with my example XML but ran into issues so I'm back looking at
a custom solution.
I'
[Answering my own question]
I think a reasonable solution is to have a generic analyzer for use at
query-time that can wrap my application's choice of analyzer and
automatically filter out what it sees as stop words. It would initialize
itself from an IndexReader and create a StopFilter for th
Further to our discussions some time ago I've had some time to put
together an XML-based query parser with support for many "advanced"
query types not supported in the current Query parser.
More details and code here: http://www.inperspective.com/lucene/LXQuery2.htm
Cheers
Mark
Hi Chris,
Thanks for taking the time to look at this in detail.
1) The factory classes should have "removeBuilder" methods so people
subclassing parsers can flat out remove support for a particular
tag, not just replace it.
Can do.
2) This DOM version definitely seems easier to follow
What do folks feel about a BooleanFilter which is the equivalent of
BooleanQuery ie a filter which contains other filters, combined with the
same "must", "should" or "must not" logic.
I know we already have ChainedFilter in the "misc" section of the
contrib area but its methods do not echo the
Is there any particular reason why SpanQuery introduced
public Collection getTerms()
when we have:
public void extractTerms(Set terms)
in Query?
I am changing the highlighter to make use of extractTerms and can either
add this to SpanQuery:
public void extractTerms(Set terms) {
term
+1
Send instant messages to your online friends http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Besides what has already been covered:
Lucene Query and Filter objects are marked as Serializable so a remote
client can serialize a request to a server which then rewrites and
executes the request. This allows for a Webstart or applet-based
architecture where the client can construct and send
>1.5 IS the Java version that the majority Lucene users use, not 1.4!
>Does this mean we can now start accepting 1.5 code?
This isn't simply about which JVM gets used the most wins.
This is about "how many Lucene users will we inconvenience or lose by
moving to 1.5?"
Right now the survey sampl
>>One point that I feel keeps getting ignored is that we are talking
about the _future_ releases.
>>My guess is that we won't see a major new Lucene release before 2007,
and by that time the latest JVM will probably be 1.6.
I think that's a non-argument as it is common practice for people to
w
If you were to score repeated terms then I suspect it would have to be
done so that the repetitions didn't score as highly as the first
occurrence - otherwise f2 could be selected as a better fragment than f3
for the query q1 in your example.
Repetitions of a term in a fragment could be scored a
I was somewhat surprised to find that highlighting scoring simply counts
how many unique query terms appear in the fragment. Guess was expecting a
See QueryScorer(Query query, IndexReader reader, String fieldName) constructor
- this will factor IDF into weighting for terms. Query boosts are aut
I believe they are the same but the one to keep is in contrib/queries.
The "queries" directory was suggested as a better location for
organising contrib code - see here:
http://www.gossamer-threads.com/lists/lucene/java-dev/32872#32872
I chose to copy MoreLikeThis to contrib/queries and not
I would prefer to see a good open-source framework pulling together a
collection of document parsers but which isn't tied directly to Lucene
(that binding would be via *another* project).
If the parser framework extracted document text in a standard
document-and-application-neutral form (XML/Jav
Thanks for the pointer, Robert.
Was that the "MultiSegmentQueryFilter enhancement for interactive
indexes?" thread?
If so, was that not a solution to caching sets in the event of variable
index content rather than my approach which is for caching sets in the
event of variable query ranges? M
Chris Hostetter wrote:
i haven't read the patch, but based on the jira description i don't think
this is attempting to reuse cached info across updates -- i think it's
just trying to address the issue of eliminating redundent TermEnum/TermDoc
iterating for similar ranges.
Correct. I have conve
Andrzej Bialecki wrote:
Funny, I was having vague thoughts about this today too having been
concerned about some of the big arrays that can end up in a typical
Lucene app. Aside from providing space-efiicient lookups, another
application for BloomFilters is in similarity measures e.g. ANDing 2
indows.jar contains the Java2Javascript
compiler necessary for building and alone accounts for 10 mb. Including
Jetty adds another ~6 mb on top of that.
OK with this?
-Grant
On Dec 9, 2007, at 4:03 PM, markharw00d wrote:
I've got a web-based version of Luke I'm happy to commit
>>Another important driver is the "out-of-the-box experience".
>>we need a "standard distro" ...which would be the core plus
cherry-pick certain important contrib modules (highlighter,
>> SweetSpotSimilarity,snowball, spellchecker, etc.) and bundle them
together.
Is that not Solr, or at least
I'd like to add a web-based demo for the XML QueryParser but unlike the
existing web demo I'd prefer to use some Java code that gets compiled
rather than doing it all in JSP files that aren't part of the build.
Doing it this way will add a dependency on servlet-api.jar which will
need to be add
The problem with that is that in most cases you still need a "string"
based syntax that "people" can enter...
The XML syntax includes a tag for embedding user input of
this type.
I guess you can always have an "advanced search" page that builds and
submits the XML query behind the scenes.
>>If wildcards and fuzzyies are supported, why not range ?
Because Ranges don't rewrite to a BooleanQuery full of TermQueries so I
can easily inspect them. Unlike fuzzy/wildcard/boolean I suspect they
are not that generally useful as part of phrase query expressions. Feel
free to tinker with t
The Analyzer keeps a window of (by default) the last 300 documents.
Every token created in these cached documents is stored for reference
and as new documents arrive their token sequences are examined to see if
any of the sequences was seen before, in which case the analyzer does
not emit them
Sami Siren wrote:
I also saw those when I did my maven trials. I didn't dig any deeper.
Fixed the highlighter problem in this report - see change here:
http://svn.apache.org/viewvc?view=rev&revision=529417.
Cheers,
Mark
--
As part of the documentation push I was considering putting together an
updated demo web app which showed a number of things (indexing, search,
highlighting, XML Query templates etc) and was wondering what that might
mean to the build system if I was dependent on the servlet API. Are
there any
Right, the only thing left is then how to get a Matcher from this iterator.
I think the iterator *is* the equivalent of the Matcher as you've
described it - a Scorer without the scores used once by a single thread
to iterate across a set of doc ids.
I suppose the Filter criterium is a
I've put together a new Filter and Junit test for eliminating duplicates
from search results.
The typical usage scenario is where multiple documents exist in the
index which share an untokenized field value (e.g. the same primary key
or URL). It is desirable to keep copies in the index becaus
Great work, Evgeny!
I'm certainly interested in this area and will be dissecting this in
some detail.
I've done similar work before but making use of JTS (Java Topology
Suite), using the OpenGIS standards for spatial features/queries and
2-pass spatial queries (first rough pass is MBB only,
I've put together DTDs documenting the full XML query syntax and used
comments/examples that a DTDDoc ant task can turn into useful
hyperlinked HTML help documents.
Anyone know what the procedure/licensing implications are if I wanted to
add a build dependency on DTDDoc? It looks like historic
>>it's just a tool for generating HTML docs from the DTD right?
Yep.
>>the generated HTML docs could be commited to the repository
Its would be more convenient to build the docs automatically on the
servers rather than upload generated copies manually but I can see that
may not be possible.
I
Are there any licensing issues with GWT? (I've never used it)
OK on that score I think :
http://code.google.com/webtoolkit/terms.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL P
Any takers to test this contrib layout before I commit it?
http://www.inperspective.com/lucene/webluke.zip
This is a (17MB) zip file which you can unzip to a new "webluke"
directory under your copy of lucene/contrib and then run the usual
Lucene Ant build ( or at least "ant build-contrib").
Y
The 17 MB bundle I provided is essentially the source plus dependencies,
the bulk of which is jars, mainly the compile-time dependency
gwt-dev-windows.jar weighing in at 10MB.
The built WAR file is only 1.5 meg.
The WAR file bundled with Jetty (as a convenience) is 8 meg.
It may be possible to
I've got a web-based version of Luke I'm happy to commit to contrib now.
This version includes some tidy up for developers working on Luke.
Eclipse .project and .classpath files have build path variables defined
to cater for different install locations for GWT in development
environments.
Ful
I now think the main issue here is that a busy JVM gets into trouble
trying to find large free blocks of memory for large bitsets.
In my index of 64 million documents, ~8meg of contiguous free memory
must be found for each bitset allocated. The terms I was trying to cache
had 14 million entries
49 matches
Mail list logo