Re: How can we know if 2 lucene indexes are same?

2008-09-05 Thread markharw00d
I think this could be a generally useful feature? +1. I could definitely use a "commitUserData" option for the same reasons. Thinking more on this, we may not need to modify the index format at all for this use-case. This is easily achieved in the current system by adding a dummy document

Re: Lucene vs. Database

2008-10-01 Thread markharw00d
Pros of keeping content only in the database * Need only one stored copy of data (saved disk space) Pros of storing copy of content in Lucene: * A match is more easily explained If you collapse multiple DB fields into a single searchable field e.g. customer first name and surname database fiel

Re: Unique tokens analyzer

2008-10-14 Thread markharw00d
Related: https://issues.apache.org/jira/browse/LUCENE-725 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to combine filter in Lucene 2.4?

2008-11-09 Thread markharw00d
>>this can't be nearly as fast as OpenBitSet.intersect() or union, respectively, can it? I had a similar concern but it doesn't seem that bad: https://issues.apache.org/jira/browse/LUCENE-1187?focusedCommentId=12596546#action_12596546 The above test showed a slight improvement using bitset

Re: [ot] a reverse lucene

2008-11-23 Thread markharw00d
If you index the queries consider also that they can potentially be indexed in an optimised form. For example, take a phrase query for "Alonso Smith". You need only index one of these terms - an incoming document must contain both terms to be considered a match. If you chose to index this quer

Re: Poor QPS with highlighting

2009-02-03 Thread markharw00d
Can you describe this in a little more detail; I'm not exactly sure what you mean. Break your large text documents into multiple Lucene documents. Rather than dividing them up into entirely discreet chunks of text consider storing/indexing *overlapping* sections of text with an overlap as

Re: Lucene 2.9

2009-03-09 Thread markharw00d
>>(a "write once" schema) I like this idea. Enforcing consistent field-typing on instances of fields with the same name does not seem like an unreasonable restriction - especially given the upsides to this. It doesn't dispense with all the full schema logic in Solr but seems like a useful ba

Re: Lucene Highlighting and Dynamic Summaries

2009-03-11 Thread markharw00d
If you can supply a Junit test that recreates the problem I think we can start to make progress on this. Amin Mohammed-Coleman wrote: Hi Apologies for re sending this mail. Just wondering if anyone has experienced the below. I'm not sure if this could happen due nature of document. It does

Re: Memory during Indexing

2009-03-11 Thread markharw00d
Hi Niels, See the javadocs for IndexWriter.setRAMBufferSizeMB() Cheers Mark Niels Ott wrote: Hi Lucene professionals! This may sound like a dumb beginner's question, but anyways: Can Lucene run out of memory during indexing? Should I use IndexWriter.flush() or .commit(), and if so, how ofte

Re: Is there a FilterQueryParser?

2007-09-18 Thread markharw00d
Scott Tiger wrote: I want get BooleanFilter contains two RangeFilters from query string. The XMLQueryParser may be of interest. See BooleanFilter.xml and CachedFilter.xml examples in the XMLQueryParser Junit tests. I typically use QueryTemplateManager to transform user input provided in a

Re: Lucene Queries Over User-Editable Dynamic Categories of Documents

2007-10-24 Thread markharw00d
lucene user wrote: Thanks for all your help! We are using Lucene 2.1.0 and TermsFilter seems to be new in Lucene 2.2.0. I have not been able to find SortedVIntList in the javadocs at all. No, SortedVIntList is in the patch I provided a link to earlier. Because both SortedVIntList and a r

Re: Speeding up highlighting by storing a cached TokenStream

2007-10-25 Thread markharw00d
Anyone care to suggest an approach to making this faster? See TokenSources.java Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Help on FuzzyLikeThisQuery

2007-11-24 Thread markharw00d
Cool Coder wrote: >> Is there anyway I can specify which terms are "MUST", I mean they have to appear in the result and some terms are optional, One "hands off" approach you could try with this is to rewrite the fuzzyQuery and then set the minimum number of terms you want a match on. e.g.

Re: Why exactly are fuzzy queries so slow?

2007-11-24 Thread markharw00d
The added IO is one factor. Another is the CPU load from doing many edit-distance comparisons between index terms and the provided search term. You can limit the number of edit distance comparisons conducted by setting the minimum prefix length. This is a property of the QueryParser if parsing

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread markharw00d
can use Soundex and then if you're lucky files==philes but there's no room for error and they either match or they dont - there is no measure of similarity. There's no free lunch here. Timo Nentwig wrote: On Saturday 24 November 2007 18:28:48 markharw00d wrote: term. You can l

Re: Searching user-private annotations associated with indexed documents

2007-11-27 Thread markharw00d
lucene user wrote: Am I being clear? Now you are. I don't know what you mean by "PERSON_ANNOTATION works for Google". I suppose I meant annotations in the sense GATE and UIMA refer to annotations. They are like a highlighter pen marking a particular section of a document and adding me

Re: Lucene highlighting

2007-11-28 Thread markharw00d
I need to highlight an entire document as it is displayed See NullFragmenter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using multiple filters

2008-01-02 Thread markharw00d
BooleanFilter in contrib is similar to ChainedFilter but just expresses the boolean logic using the same vocabulary as BooleanQuery ("should"s, "must"s and "not"s). Cheers Mark Erick Erickson wrote: I think you can just throw them all together in a ChainedFilter and use the ChainedFilter wher

Re: Stemming and highlighting

2008-01-04 Thread markharw00d
Let's say for the query algorithm, the word algorith is also a match, how do the highlighter know that it should also highlight occurrences of the word algorith? (I am not sure it does this anyway) The highlighter knows to highlight stemmed words because both the query terms and the docume

Re: Inverted search / Search on profilenet

2008-01-17 Thread markharw00d
There is a trick to indexing queries in this way... you need only index the rarest term in queries which have one or more mandatory terms. As an example - for the phrase query "XYZ Group limited" you need only index the rarest term "XYZ" and thus avoid the selecting the query for execution with

Re: Matching w/in X% ?

2008-01-21 Thread markharw00d
See BooleanQuery.setMinimumNumberShouldMatch. Add the addresses as "SHOULD" termQuery clauses and set minumumNumberShouldMatch to the required value. Cheers Mark - Original Message From: Michael Prichard <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, January 21, 200

Re: Luke for Lucene 2.3?

2008-01-30 Thread markharw00d
I also read something about web-based Luke, but can't find it in the contrib in 2.3, is it part of Lucene 2.3? How do I use it? See here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg13287.html I think we decided to hold off until after the Lucene 2.3 release before adding to contrib

Re: How to index word-pairs and phrases

2008-02-19 Thread markharw00d
Further to Grant's useful background - there is an analyzer specifically for multi-word terms in "contrib". See Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle Cheers Mark Hi Ghinwa, A Term is simply a unit of tokenization that has been indexed for a Field, produced by a

Re: Which file in the lucene package is used to manipulate results..

2008-02-20 Thread markharw00d
Where can i find the information regarding embedding lucene with database. Thanks http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html http://issues.apache.org/jira/browse/LUCENE-434 Cheers Mark

Re: Contrib Highlighter and Phrase search

2008-03-18 Thread markharw00d
See https://issues.apache.org/jira/browse/LUCENE-794 Spencer Tickner wrote: Hi List, Thanks in advance for any help. I'm working with the contrib highlighting class and am having issues when doing searches with a phrase. I've been able to duplicate this behaviour in the HighlighterTest class.

Re: Indexing/Querying Annotations and Fields for a document

2008-03-18 Thread markharw00d
lucene-seme1 s wrote: Can you please share the custom Analyzer you have ? Unfortunately it's not mine to share but see the Lucene Token and Analyzer classes - it's not particularly hard to code. - To unsubscribe, e-mail: [

Re: Please help with Gradient Formatter

2008-04-30 Thread markharw00d
Here you go: Analyzer a=new StandardAnalyzer(); //open an index String textFieldName="contents"; IndexReader reader=IndexReader.open("E:/indexes/uksites"); IndexSearcher searcher=new IndexSearcher(reader); QueryParser qp=new QueryParser(textFieldNa

Re: Getting irrelevant results using fuzzy query

2008-06-18 Thread markharw00d
This looks like it is related to an issue I first raised here: http://markmail.org/message/37ywsemfudpos6uh At the time I identified 2 issues with FuzzyQuery - that the usual "coord" and "idf" scoring factors shouldn't be applied to fuzzy queries. The coord factor got fixed but idf remains a

Re: How to avoid duplicate records in lucene

2008-07-19 Thread markharw00d
Sebastin wrote: Hi All, Is there any possibility to avoid duplicate records in lucene 2.3.1? At index-time or query time? See DuplicateFilter in contrib/queries for a query-time filter Cheers Mark - To unsubscribe, e-

Re: How to avoid duplicate records in lucene

2008-07-21 Thread markharw00d
>>could you define duplicate? That's your choice of field that you want to de-dup on. That could be a field such as "DatabasePrimaryKey" or perhaps a field containing an MD5 hash of document content. The DuplicateFilter ensures only one document can exist in results for each unique value for th

Re: Error in QueryTermExtractor.getTermsFromBooleanQuery

2006-11-26 Thread markharw00d
Nope, not seen that one. Looks like the reference to no such field is in the Java instance data sense, not the Lucene document sense. Class versioning issues somewhere? That method takes a parameter called "prohibited" which is the name of the field reported in the error. Is the word "prohibite

Re: Clustering Lucene with 40 Servers

2006-12-28 Thread markharw00d
Not quite yet gone up to this scale but here are some points for consideration based on a smaller scale system I have in production that may be of interest: By clustering I presume you are only talking about replication. When we talk about scaling and using multiple machines we need to think

Re: Multiword Highlighting

2007-01-26 Thread markharw00d
This is a deficiency in the highlighter functionality that has been discussed several times before. The summary is - not a trivial fix. See here for background: http://marc2.theaimsgroup.com/?l=lucene-user&m=114631181214303&w=1 http://www.gossamer-threads.com/lists/engine?do=post_view_printa

Re: Multiword Highlighting

2007-01-27 Thread markharw00d
people seem to just want to highlight the source text. Any words of wisdom would be sorely appreciated. - Mark markharw00d wrote: This is a deficiency in the highlighter functionality that has been discussed several times before. The summary is - not a trivial fix. See here for back

Re: Multiword Highlighting

2007-01-28 Thread markharw00d
>>For what it's worth Mark (Miller), there *is* a need for "just highlight the query terms without trying to get excerpts" functionality >>- something a la Google cache (different colours...mmm, nice). FWIW, the existing highlighter doesn't *have* to fragment - just pass a NullFragmenter to the

Re: More Precise Highlighting

2007-02-13 Thread markharw00d
Not sure I fully understand the problem. The query is effectively "allContent:someTitleText" and you want to highlight the string "someTitleText" in the title field? If you pass null as a fieldname to the QueryTermExtractor it will use all term values, regardless of field, as string to highlight

Re: Running Lucene as a stateless session bean

2007-02-20 Thread markharw00d
Be careful with your use of GATE and multiple threads. I recently had some trouble with their Factory.delete.. methods which ended up requiring a change to the core and this was applied to the 4.0 trunk. A 3.1 patch has not been released so you'll need to be using the latest from SVN (now requi

Re: Find related question

2007-03-10 Thread markharw00d
>>most of the body text is the same, but I want to group them all under one result. I created this analyzer class to identify content that was "mostly similar" but not necessarily identical. http://issues.apache.org/jira/browse/LUCENE-725 If you feed a small set of documents through it (say y

Re: search timeout

2007-03-17 Thread markharw00d
Chris Hostetter wrote: this is something anyone using the Lucene API can do as long as they use a HitCollector ... the Nutch impl seems to ctually spin up a seperate thread I'm keen to understand the pros and cons of these two approaches. With the HitCollector approach is this just engineer

Re: Reverse search

2007-03-25 Thread markharw00d
On app startup: 1) parse all Queries and place in an array. 2) Create a RAMIndex containing a doc for each query with content consisting of the query's terms (see Query.extractTerms). For optimal performance only index the most rare term for queries with multiple mandatory criteria e.g. Phrase

Re: Reverse search

2007-03-27 Thread markharw00d
rectly. Thanks Mélanie -Original Message- From: markharw00d [mailto:[EMAIL PROTECTED] Sent: Monday, March 26, 2007 12:36 AM To: java-user@lucene.apache.org Subject: Re: Reverse search On app startup: 1) parse all Queries and place in an array. 2) Create a RAMIndex containing a doc fo

Re: highlighter highlights another term

2007-04-15 Thread markharw00d
See the Junit test example for field-sensitive highlighting. If you pass a fieldname to the QueryScorer constructor it only considers query terms for that field - the default without is all fields Cheers Mark - To unsubscribe

Re: FuzzyLikeThisQuery what does maxNumTerms mean

2007-05-09 Thread markharw00d
The shortlisting isn't based on stop words - a score is produced to prioritise term selections. The score uses the IDF (inverse document frequency) of the original term and mixes in the "edit-distance" for each of the fuzzy variations of original terms. Care is taken to ensure that in the query

Re: FuzzyLikeThisQuery what does maxNumTerms mean

2007-05-09 Thread markharw00d
bhecht wrote: Thanks Mark, I have updated my previous post I guess, before you had a chance to read it. Did you edit your post on Nabble? That edit didn't come through as a message to java-user so I didn't see it. You shouldn't need to call rewrite on your FuzzyLikeThisQuery unless you wan

Re: query syntax question

2007-05-10 Thread markharw00d
Here's a way to do it using the XML query parser in contrib 1) Create this query.xsl file (note use of cached double negative filter) xmlns:xsl="http://www.w3.org/1999/XSL/Transform";> upperTerm="z"/>

Re: Indexing large corpus of wikipedia

2007-05-30 Thread markharw00d
- Is it feasible to do it on a single machine with 1 GB of Physical Memory and 1.3GHz processor. Can lucene handle it efficiently. Yes, I've indexed all of English Wikipedia using 1GB RAM (with a 3Ghz processor) - Secondly, I wanted to know that when doing search does lucene load the whol

Re: Highlighter that works with phrase and span queries

2007-06-21 Thread markharw00d
Hi Mark, Good summary. I was running some timings earlier and my results echo your findings. >>I am currently trying to think of some possible hybrid approach to highlighting... I was thinking along the lines of wrapping some core classes such as IndexReader to somehow observe the query mat

Re: Pagination

2007-07-04 Thread markharw00d
It looks that we may have different cases. I was hoping to answer the original question which was how to retrieve pages of matching documents from a Lucene index (no database mentioned). >>So far worked just fine. I have 5000 rows of items and I think will still work fine later when I'd have

Re: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project

2007-07-09 Thread markharw00d
>>the case matters only for those words that should be included. Jong, just want to check we're on the same page - you do know MoreLikeThis has a kind of automatic Stop-Wording built in , yes? MoreLikeThis looks at the document frequency of all terms in the "this" text you provide and only sele

Re: Stop-words comparison in MoreLikeThis class in Lucene's contrib/queries project

2007-07-09 Thread markharw00d
>>So I'm afraid I can't use the technique you recommend. ah right - so the TermVector you use from the index will return mixed and lower case versions of the same text. One point to note - this would mean that of the 25 or so top terms selected by MoreLikeThis for querying there is a reasonable

Re: Large Numeric RangeQueries

2005-12-11 Thread markharw00d
On a related topic: yesterday I posted a round-up of all the possible filtering options in Lucene with timings and example code to the WIKI : http://wiki.apache.org/jakarta-lucene/FilteringOptions One of the options demonstrated is along the lines of Chris's suggestion. Cheers, Mark

Re: Query Scoring

2005-12-31 Thread markharw00d
Sorry to contradict, Erik, but the Highlighter's QueryScorer will make use of IDF, given a reader, in order to better prioritise which are the "best" bits of a document. However, In the particular example given, the criteria includes several non-text fields which are not useful for IDF and gener

Re: scalability recommendations for large performance-intensive indexes

2006-02-08 Thread markharw00d
Hi Vince, sounds like the same issue I highlighted recently on the java-dev list. See here: http://www.nabble.com/Preventing-%22killer%22-queries-t1077895.html The problem lies in the underlying cost of reading TermDocs for very common terms (a problem for both queries and filters) For your

Re: Highlighting text for queries with huge numbers of terms

2006-02-17 Thread markharw00d
Hi Daniel/Chris, Unfortunately, the contrib/highlighter code in source control fails to meet our needs in two ways: 1. We don't just want fragments, we want *all* of the text, with highlights in the appropriate places (although we do offer a means to display just the fragments as w

Re: Lucene Scoring

2006-03-08 Thread markharw00d
[EMAIL PROTECTED] wrote: Anyone have a doc or something that would allow me to explain this to execs? Roughly speaking: * Documents containing *all* the search terms are good * Matches on rare words are better than for common words * Long documents are not as good as short ones * Documents wh

Re: RangeQuery, FilterdQuery and HitCollector

2006-03-09 Thread markharw00d
FilteredQuery has the side effect of passing zero scoring docs to the hitcollector. This does break the contract for HitCollector.collect method because the JavaDocs state: "Called once for every non-zero scoring document, with the document number and its score." The quick fix is to simply add a t

Re: Joins between index and database

2006-03-23 Thread markharw00d
See RangeFilter.bits() for some example code that creates a filter from terms. Also see TermsFilter in the "queries" module in the contrib section. ___ To help you stay safe and secure online, we've developed the al

Re: Highlighter and complex queries

2006-04-29 Thread markharw00d
Hi Marios. >>Isn't this wrong? Yes but this is an itch that no one has been suffficently been bothered by to fix yet. I still haven't had the time or a desperate need to implement this so it will probably remain that way until someone feels strongly enough about the problem to fix it. Highligh

Re: Lucene search optimization

2006-05-31 Thread markharw00d
I tried the cityName:city~0.8, and it is still not fast enough.. something around 2 seconds... to return only 2 results... OK, so we trimmed down the search terms we actually used in the query but I suspect what you are seeing is the effect of having to perform edit-distance comparisons on ALL

Re: Lucene search optimization

2006-05-31 Thread markharw00d
See QueryParser.setFuzzyPrefixLength() This will apply to all fields parsed by the parser and is probably generally advisable anyway to avoid server CPU overload. Many production apps disable fuzzy searching completely in the search syntax for this reason. __

Re: Any existing query types that support equivalent of "-not interested" ?

2006-06-30 Thread markharw00d
Erik Hatcher wrote: wouldn't this work? +interested -"not interested" Hi Erik. Yes, sorry brain is disengaged with all the heat here - my example wasn't great and my scenario may be more complex than I originally outlined. I may have 20 different ways of saying "interested" and want to q

Re: Any existing query types that support equivalent of "-not interested" ?

2006-06-30 Thread markharw00d
Maybe this: SpanNotQuery(interested, SpanNearQuery(not,interested)) with a SpanTermQuery for each term? Thanks, Paul. This is working well for me and I can happily use multiple SpanTermQueries embedded in a SpanOrQuery in place of each of the single words in your example. SpanNotQuery

Re: [Lucene2.0]How to not highlight keywords in some fields?

2006-09-26 Thread markharw00d
Pass a field name to the QueryScorer constructor. See "testFieldSpecificHighlighting" method in the Junit test for the highlighter for an example. Cheers Mark zhu jiang wrote: Hi all, For example, if I have a document with two fields text and num like this: text:foo bar

Re: how to get results without getting total number of found documents?

2006-09-26 Thread markharw00d
>>- get the top 1000 results WITHOUT executing query across whole data set (Apologies if this is telling something you are already fully aware of ) - Counting matches doesn't involve scanning the text of all the docs so may be less expensive than you think for a single index. It very quickly l

Re: Fast access to a random page of the search results.

2005-03-07 Thread markharw00d
Did you mean this? http://marc.theaimsgroup.com/?l=lucene-user&m=108525376821114&w=2 Kelvin Tan wrote: This is a bump post... I'm wondering if there's any code (contributed, bugzilla, core or otherwise) that provides document lazy-loading functionality, i.e. only eager-initialize specific fields

Re: Document lazy-loading WAS [Re: Fast access to a random page of the search results.]

2005-03-08 Thread markharw00d
So this is just the old problem of avoiding reading large, less frequently accessed fields when you are trying to read just the smaller more frequently accessed fields eg titles. You can achieve this by: a) Modifying Lucene using something like the code I originally posted which stops reading

Re: highlighter and phrase search

2005-03-10 Thread markharw00d
The short answer is "no", there is not support for this currently. Implementing this support is possible but fiddly- there is a related discussion here which outlines some of the challenges : http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12435.html Cheers, Mark --

Re: Plural Stemming

2005-04-02 Thread markharw00d
>>Stemming doesn't have to produce intelligible words True, yes this should be fine for general search requirements. However, the code presented does make some attempt to produce intelligible words eg parties=party unlike Porter stemmer's parties=parti Does this make it a "lemmatizer"? This is a f

Re: highlight problem

2005-05-05 Thread markharw00d
All looks OK with that bit. At the risk of sounding obvious - are you mistaking the results from multiple documents as the highlighted content from just one document? eg the end of your "for" loop looks like this: System.out.print(result); } and you assume the printed display is from just

Re: Highlight problem

2005-05-05 Thread markharw00d
Thanks for pointing out this issue. The bug was related to having a doc bigger than the maxNumDocsToAnalyze setting. In this situation, the last fragment created was always sized from maxNumDocsToAnalyze position to the remainder of the doc (in your case, quite large!) I have fixed this in SVN

Re: Using Highlighter to highlight entire HTML documents?

2005-05-24 Thread markharw00d
Fred Toth wrote: Hi, We have a need to present HTML documents with all search terms highlighted. Everything I've seen regarding the Highlighter code seems to point to the typical case of extracting relevant fragments from the text for presentation of hit lists. If you dont want to fragment yo

Re: Vedr. Re: Design question [too many fields?]

2005-06-29 Thread markharw00d
I suspect the most performant is as follows (but could require bags of RAM) : Heres the pseudo code . [on IndexReader open, initialize map] int []luceneDocIdsByDbKey=new int [largestDbKey]; //could be large array! for (int i=0;i;Should be super-quick but requires (int size* num db records) m

Re: Does highlighter highlight phrases only?

2005-06-30 Thread markharw00d
Hi Erik, Yes I was thinking that code could form the basis of a new highlighter. I've just attached a QuerySpansExtractor to the bugzilla entry for the new highlighter. This class produces Spans from queries other than SpanXxxxQueries eg phrase, term and booleans. I'm thinking you can throw the

Re: Vedr. Re: Design question [too many fields?]

2005-07-01 Thread markharw00d
about 4900 room units which I think is OK as far as Still we have optimization work to do. Assuming your availability is a year in advance and yours is a reputable chain of hotels that books rooms by the day, (not the hour!) You only need: 4900 * 365 bits of true/false info to cache all the ava

Re: How to get the un-stemed word

2005-07-11 Thread markharw00d
Would that show up in the TermVectors? Yes, but uou would need a scheme for identifying "original, unstemmed" terms vs stems. For example, you could use another field and analyzer for the unstemmed forms. Andrew Boyd wrote: What about storing the unstemed word with the same position as the

Re: hit count within categories

2005-07-27 Thread markharw00d
I posted the code I use to do this (based on a single index) here: http://marc.theaimsgroup.com/?l=lucene-dev&m=111044178212335&w=2 Cheers Mark ___ Yahoo! Messenger - NEW crystal clear PC to PC calling

Re: Derby + Lucene

2005-07-27 Thread markharw00d
Thanks for the reminder, Otis. I haven't done any more on this since this post: http://archives.devshed.com/a/ml/200501-114586/lucene-query-sql-kind The scalability concerns with the user-defined-functions I created prevented me from taking it any further. A proper solution would need a tight

Re: Lucene vs Derby (vs MySQL) for spatial indexing

2005-07-28 Thread markharw00d
MySQL has spatial extensions now too. Your queries lack any free-text criteria so are probably best handled by a database, not Lucene.. >>In case anyone's interested, I'm writing a zoomable/pannable world map Save yourself some time. Just use the Google maps API. :-) __

Re: Did you mean?

2005-08-30 Thread markharw00d
The "did you mean" implementation should ideally use all of the other words in a query as context to guide the selection of spelling alternatives. Google appear to do this - not sure if they use the doc content or user queries to suggest the alternatives. I've got some colocation finding code wh

Re: Highlighter apply to Japanese

2005-09-05 Thread markharw00d
I don't know the behaviour of the Japanese Analyzer you are using. Can you add to your example diagnosis the Token.getPositionIncrement, Token.startOffset and Token.endOffset for each of the tokens? The highlighter groups tokens with overlapping start and end offsets into a single TokenGroup f

Re: Hits document offset information? Span query or Surround?

2005-09-05 Thread markharw00d
>>I believe I have heard that Span queries provide some way to access document offset information for their hits somehow. See http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2 Faithfully selecting extracts based *exactly* on query criteria will be hard given complex queries eg

Re: Reducing number of poor results from large BooleanQueries

2005-09-09 Thread markharw00d
Isn't the trouble with introducing a scoring threshold based on raw scores that the Similarity scoring mechanism is considering each document in isolation? At this stage we don't know if the query is generally a good one or not (ie spelt correctly, and not a Googlewhack combination of rarely co

Lucene database bindings

2005-09-16 Thread markharw00d
I know there have been some posts discussing how to integrate Lucene with Derby recently. I've added an example project that works with both HSQLDB and Derby here: http://issues.apache.org/jira/browse/LUCENE-434 The bindings allow you to use SQL that mixes database and Lucene functionality i

Re: Lucene database bindings

2005-09-17 Thread markharw00d
Mag Gam wrote: Does your example store the index in the derby db or somewhere else? I was thinking of indexing a table in a seperate column. The software is not an org.apache.lucene.store.Directory implementation ie an FSDirectory alternative for persisting Lucene data in a relational table

Re: Lucene database bindings

2005-09-17 Thread markharw00d
>>Basically your lucene_query function will return a true/false in one of the query predicates for each record. Almost, it returns a score - much more useful than just a boolean and the key difference between a search engine and a database (partial matching with relevance ranked scores). Thes

Re: Sort by relevance+distance

2005-09-19 Thread markharw00d
Here's an example I put together to illustrate the point. package distance; import java.io.IOException; import java.util.ArrayList; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lu

Re: Sort by relevance+distance

2005-09-20 Thread markharw00d
To avoid caching 10,025 docs when you only want to see 10,000 to 10,025 (and assuming the user was paging through results) you might have to remember the lowest score used in the previous page of results to avoid adding those 10,000 docs with score > lastLowScore to the HitQueue again.

Re: Storing HashMap as an UnIndexed Field

2005-09-20 Thread markharw00d
Or using XMLEncoder: HashMap map=new HashMap(); map.put("foo","bar"); ByteArrayOutputStream baos=new ByteArrayOutputStream(); XMLEncoder encoder =new XMLEncoder(baos); encoder.writeObject(map); encoder.flush(); System.out.println(baos.toString());

Re: How Fast is MemoryIndex? How Much Resource Does It Use?

2005-10-24 Thread markharw00d
If so, why not use it for the normal operation as well? Because MemoryIndex only allows you to store/query one document. It is fast, but I would not suggest running 1 queries against it. Why not try store the queries as documents in a special index and query them using the subject documen