Re: Nested Document support in Lucene

2011-03-22 Thread mark harwood
AFAIK this is still under heavy development and it doesn't seem to be ready in the near future. It's stable as far as I'm concerned. Lucene-2454 includes the code and Junit tests that work with the latest 3.0.3 release. I have versions of this running in production with 2.4 and 2.9-based

Re: revisit naming for grouping/join?

2011-07-01 Thread mark harwood
I think what would be best is a smallish but feature complete demo, For the nested stuff I had a reasonable demo on LUCENE-2454 that was based around resumes - that use case has the one-to-many characteristics that lends itself to nested e.g. a person has many different qualifications and

Continuous stream indexing and time-based segment management

2012-06-19 Thread mark harwood
There are a number of scenarios where Lucene might be used to index a fixed time range on a continuous stream of data e.g. a news feed. In these scenarios I imagine the following facilities would be useful: a) A MergePolicy that organized content into segments on the basis of increasing time

Re: Continuous stream indexing and time-based segment management

2012-06-19 Thread mark harwood
you can do that by subclassing IW and call some package private APIs / To date I have used separate physical indexes with a MultiReader to combine them then dropping the outdated indexes. At least this has the benefit that a custom MergePolicy is not required to keep content from the

Re: Welcome Greg Bowyer

2012-06-21 Thread mark harwood
Good to have you aboard, Greg! - Original Message - From: Erick Erickson erickerick...@gmail.com To: dev@lucene.apache.org Cc: Sent: Thursday, 21 June 2012, 11:56 Subject: Welcome Greg Bowyer I'm pleased to announce that Greg Bowyer has been added as a Lucene/Solr committer. Greg:

Adding another dimension to Lucene searches

2010-05-07 Thread mark harwood
I have been working on a hierarchical search capability for a while now and wanted to see if there was general interest in adopting some of the thinking into Lucene. The idea needs a little explanation so I've put some slides up here to kick things off:

Re: Adding another dimension to Lucene searches

2010-05-10 Thread mark harwood
I've put up code, example data and tests for the Nested Document feature here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip The data used in the unit tests is chosen to illustrate practical use of real-world content. The final unit tests will work on more abstract data

Re: Web-Based Luke

2010-07-09 Thread Mark Harwood
of Luke that Mark Harwood started ever get dumped to JIRA or anything? All I can find is a link to a war, but not the source. Mark? Anyone? - Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional

Re: Web-Based Luke

2010-07-11 Thread Mark Harwood
it under lucene contrib? Thanks -John On Fri, Jul 9, 2010 at 7:26 AM, Mark Harwood markharw...@yahoo.co.uk wrote: See http://search.lucidimagination.com/search/document/63cef9e98692a126/webluke_include_jetty_in_lucene_binary_distribution There's a link to a zip file with source

Re: Web-Based Luke

2010-07-12 Thread Mark Harwood
Agreed. I think apache is a preferable home. The major change to Luke in providing a Luke core api is the need to be remotable i.e. Use of an interface and serializable data objects used for args. Gwt rpc should take care of the marshalling and I've used similar frameworks for applet clients.

ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
Due to the odd behaviour of a custom Scorer of mine I discovered ConjunctionScorer.doNext() could loop indefinitely. It does not bail out as soon as any scorer.advance() call it makes reports back NO_MORE_DOCS. Is there not a performance optimisation to be gained in exiting as soon as this

Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
. - Original Message - From: mark harwood markharw...@yahoo.co.uk To: dev@lucene.apache.org dev@lucene.apache.org Cc: Sent: Thursday, 1 March 2012, 9:39 Subject: ConjunctionScorer.doNext() overstays? Due to the odd behaviour of a custom Scorer of mine I discovered ConjunctionScorer.doNext

Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
; mark harwood markharw...@yahoo.co.uk Cc: Sent: Thursday, 1 March 2012, 13:31 Subject: Re: ConjunctionScorer.doNext() overstays? Hmm, the tradeoff is an added per-hit check (doc != NO_MORE_DOCS), vs the one-time cost at the end of calling advance(NO_MORE_DOCS) for each sub-clause?  I think

Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
class could avoid a docID() method invocation?  Anyhoo the profiler did not show that method up as any sort of hotspot so I don't think it's an issue. Thanks, Mike. - Original Message - From: Michael McCandless luc...@mikemccandless.com To: dev@lucene.apache.org; mark harwood

Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread Mark Harwood
Ideally, consumers of DISI should hold onto the int docID returned from next/advance and use that... (ie, don't call docID() again, unless it's too hard to hold onto the returned doc). Yes, I remember raising that way back when:

Re: GSOC 2012?

2012-03-02 Thread mark harwood
Does anyone have any ideas? A framework for match metadata? Similar to the way tokenization was changed to allow tokenizers to to enrich a stream of tokens with arbitrary attributes, Scorers could provide MatchAttributes to provide arbitrary metadata about the stream of matches they produce.

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

2010-09-10 Thread mark harwood
Hi Mark I've played with Shingles recently in some auto-categorisation work where my starting assumption was that multi-word terms will hold more information value than individual words and that phrase queries on seperate terms will not give these term combos their true reward (in terms of IDF)

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

2010-09-11 Thread mark harwood
be content to just compare it to baseline random chance. Mark B -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Fri, Sep 10, 2010 at 3:17 AM, mark harwood markharw...@yahoo.co.uk wrote: Hi Mark I've played

Document links

2010-09-20 Thread mark harwood
I've been looking at Graph Databases recently (neo4j, OrientDb, InfiniteGraph) as a faster alternative to relational stores. I notice they either embed Lucene for indexing node properties or (in the case of OrientDB) are talking about doing this. I think their fundamental performance

Re: Document links

2010-09-21 Thread Mark Harwood
It should be possible to randomly add and delete such relationships after indexWriter.addDocument(), is that the idea? Yes. A like action may, for example allow me to tag an existing document by connecting 2 documents - my personal like document and a document with content of interest.

Re: Document links

2010-09-24 Thread mark harwood
This slideshow has a first-cut on the Lucene file format extensions required to support fast linking between documents: http://www.slideshare.net/MarkHarwood/linking-lucene-documents Interested in any of your thoughts. Cheers, Mark

Re: Document links

2010-09-24 Thread mark harwood
/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384 [2] http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e On Sep 24, 2010, at 10:36 AM, mark harwood wrote: This slideshow has a first-cut on the Lucene file format

Re: Document links

2010-09-25 Thread Mark Harwood
path finding analysis is perhaps not a typical Lucene application but other forms of link analysis e.g. recommendation engines require similar performance. Cheers Mark On 25 Sep 2010, at 11:41, Paul Elschot wrote: Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: While not exactly

Re: Polymorphic Index

2010-10-21 Thread Mark Harwood
Perhaps another way of thinking about the problem: Given a large range of IDs (eg your 300 million) you could constrain the number of unique terms using a double-hashing technique e.g. Pick a number n for the max number of unique terms you'll tolerate e.g. 1 million and store 2 terms for every

Re: Polymorphic Index

2010-10-21 Thread Mark Harwood
Good point, Toke. Forgot about that. Of course doubling the number of hash algos used to 4 increases the space massively. On 21 Oct 2010, at 22:51, Toke Eskildsen t...@statsbiblioteket.dk wrote: Mark Harwood [markharw...@yahoo.co.uk]: Given a large range of IDs (eg your 300 million) you

Re: Using filters to speed up queries

2010-10-23 Thread Mark Harwood
Look at BooleanQuery with 2 must clauses - one for the query, one for a ConstantScoreQuery wrapping the filter. BooleanQuery should then use automatically use skips when reading matching docs from the main query and skip to the next docs identified by the filter. Give it a try, otherwise you may

Re: How can I get started for investigating the source code of Lucene ?

2010-11-01 Thread mark harwood
Here's a rough overview I mapped out as a sequence diagram for the search side of things some time ago: http://goo.gl/lE6a - Original Message From: Jeff Zhang zjf...@gmail.com To: dev@lucene.apache.org Sent: Mon, 1 November, 2010 5:43:08 Subject: How can I get started for

Re: Document links

2010-11-08 Thread mark harwood
@lucene.apache.org Sent: Mon, 8 November, 2010 19:03:59 Subject: Re: Document links Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant thanks ryan On Sat, Sep 25, 2010 at 5:42 PM, mark harwood markharw...@yahoo.co.uk wrote

Re: Document links

2010-11-08 Thread Mark Harwood
, Ryan McKinley ryan...@gmail.com wrote: On Mon, Nov 8, 2010 at 2:52 PM, mark harwood markharw...@yahoo.co.uk wrote: I came to the conclusion that the transient meaning of document ids is too deeply ingrained in Lucene's design to use them to underpin any reliable linking. What about if we define

Re: Document links

2010-11-09 Thread Mark Harwood
I was using within-segment doc ids stored in link files named after both the source and target segments (a link after all is 2 endpoints). For a complete solution you ultimately have to deal with the fact that doc ids could be references to: * Stable, committed docs (the easy case) * Flushed but

BlockJoin concerns

2011-10-14 Thread mark harwood
I've been looking at the BlockJoin stuff in 3.4 in relation to children of multiple types and have a couple of concerns which are either issues, or my ignorance of the API: Concern #1 If I only retrieve children of type A all is well. If I only retrieve children of type B all is well.

Re: BlockJoin concerns

2011-10-14 Thread mark harwood
limited by the number of docs you can hold in RAM as part of the original IW.addDocuments call - i.e. not in the millions. Cheers, Mark - Original Message - From: Michael McCandless luc...@mikemccandless.com To: dev@lucene.apache.org; mark harwood markharw...@yahoo.co.uk Cc: Sent

Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread mark harwood
I've been spending quite a bit of time recently benchmarking various Key-Value stores for a demanding project and been largely disappointed with results However, I have developed a promising implementation based on these concepts:   http://www.slideshare.net/MarkHarwood/lucene-kvstore The code

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread Mark Harwood
. Did you try all the well known ones? http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis -- J On Thu, Mar 22, 2012 at 10:42 AM, mark harwood markharw...@yahoo.co.uk wrote: I've been spending quite a bit of time recently benchmarking various Key-Value stores

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread Mark Harwood
Random question: Do you basically end up with something very similar to LevelDB that many people where talking about a few weeks ago ? Haven't looked at LevelDB because I was concentrating on Java implementations. Riak's Bitcask is the most similar in principle but I didn't like the

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-24 Thread Mark Harwood
OK I have some code and benchmarks for this solution up on a Google Code project here: http://code.google.com/p/graphdb-load-tester/ The project exists to address the performance challenges I have encountered when dealing with large graphs. It uses all of the Wikipedia links as a test dataset

Re: New Lucene features and Solr indexes

2013-02-13 Thread mark harwood
Instead of making other APIs to accomodate BloomFilter's current brokenness: remove its custom per-field logic so it works with PerFieldPostingsFormat, like every other PF. Not looked at it in a while but I'm pretty certain, like every other PF, you can go ahead and use PerFieldPF with Bloom

Re: [VOTE] Solr to become a top-level Apache project (TLP)

2020-05-13 Thread Mark Harwood
+1 On 2020/05/12 07:36:57, Dawid Weiss wrote: > Dear Lucene and Solr developers! > > According to an earlier [DISCUSS] thread on the dev list [2], I am > calling for a vote on the proposal to make Solr a top-level Apache > project (TLP) and separate Lucene and Solr development into two >

QueryParser - proposed change may break existing queries.

2020-09-16 Thread Mark Harwood
In Lucene-9445 we'd like to add a case insensitive option to regex queries in the query parser of the form: /Foo/i However, today people can search for : /foo.com/index.html and not get an error. The searcher may think this is a query for a URL but it's actually parsed as a regex

Re: QueryParser - proposed change may break existing queries.

2020-09-17 Thread Mark Harwood
ys very skeptical of adding the regexes, as it breaks > many queries. Now it’s even more. > > > > Uwe > > > > - > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > *From:* Mark

Re: QueryParser - proposed change may break existing queries.

2020-09-18 Thread Mark Harwood
>You could avoid (some of?) these problems by supporting /(?i)foo/ instead of /foo/i That would avoid our parsing dilemma but brings some other concerns. This inline syntax can normally be used to selectively turn on case sensitivity for sections of a regex and then turn it off with (?-i). We

Re: QueryParser - proposed change may break existing queries.

2020-09-16 Thread Mark Harwood
n my opinion, the proposed syntax change should enforce to have whitespace > or any other separator chat after the regex “i” parameter. > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de >

2nd call - [Vote] Wolfgang Hoschek for committer

2005-07-11 Thread mark harwood
Responses were light last time around: I'd like to propose Wolfgang Hoschek should be given commit rights to maintain his MemoryIndex contribution. ___ How much free photo storage do you get? Store your holiday snaps for

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

2005-11-15 Thread mark harwood
That would use more memory, but still permit ranked searches. Worth it? Not sure. I expect FuzzyQuery results would suffer if the edit distance could no longer be factored in. At least there's a quality threshold to limit the more tenuous matches but all matches below the threshold would be

Re: Performance issues with ConjunctionScorer

2005-11-22 Thread mark harwood
The Highlighter in the lucene contrib section has a class called TokenSources which tries to find the best way of getting a TokenStream. It can build a TokenStream from either: a) an Analyzer b) TermPositionVector (if the field was created with one in the index) You may find that using

Re: How build something like smart tags

2005-11-25 Thread mark harwood
See IBM's UIMA project or Gate for Entity extraction tools. Cheers Mark ps this is a java-user question, not a java-dev topic. --- Mario Alejandro M. [EMAIL PROTECTED] wrote: I'm building a search engine couple with database info... I wanna to detect things like phone-numbers, adress,

Re: open source YourKit licence

2005-11-30 Thread mark harwood
a) do any other committers want a license, and I'd appreciate a license. b) would we be willing to put their logo somewhere in exchange? That seems a fair exchange provided that we 1) find the product useful and 2) that it doesn't contravene any apache directives about use of their

Re: Advanced query language

2005-12-16 Thread mark harwood
I don't think DOM and RAM is necessarily an issue. The object construction process accesses the content in the same order that a SAX based path takes so that just seems an appropriate approach. There is no need to leap around the structure in any other way from what I can see, which is where DOM

Re: Advanced query language

2005-12-20 Thread mark harwood
However the moment you are promoting INTEROPERABILITY with other search/retrieval systems by XMLizing the query input and the result output, like Mark is, then it makes sense to adhere to standards I think this is hijacking my original intentions to some extent. I may be accused of being

Re: Advanced query language

2005-12-22 Thread mark harwood
Hi Chris, Thanks for taking the time to review this. 1) I aplaud the plugable nature of your solution. That's definitely a worthwhile objective. 2) Digging into what was involved in writting an ObjectBuilder, I found... don't really feel like the API has a very clean seperation from SAX.

Re: Advanced query language

2005-12-23 Thread mark harwood
I suspect it's a little too ambitious to provide a unifying common abstraction which wraps event based *and* pull parser approaches. I'm personally happier to stick with one approach, preferably with an existing, standardized interface which lets me switch implementations. I didn't really want

Re: Search agents

2006-01-04 Thread mark harwood
Yes, I've found MemoryIndex to be very fast for this kind of thing. This contribution can be used to further optimize and shortlist the queries to be run against the new document sat in MemoryIndex. ___ To help you stay

Re: Advanced query language

2006-01-04 Thread mark harwood
This example code looks interesting. If I understand correctly using this approach requires that builders like the q QueryObjectBuilder instance must be explicitly registered with each and every builder that consumes its type of output eg BQOB and FQOB. An alternative would be to register q just

Preventing killer queries

2006-02-07 Thread mark harwood
I've just been doing some benchmarking on a reasonably large-scale system (38 million docs) and ran into an issue where certain *very* common terms would dramatically slow query responses. Some terms were abnormally common because I had constructed the index by taking several copies and merging

Re: Preventing killer queries

2006-02-08 Thread mark harwood
Thanks for the comments, Chris/Doug. Chris, although I suggested it initially, I'm now a little uncomfortable in controlling this issue with a static variable in TermQuery because it doesnt let me have different settings for different queries, indexes or fields. Doug, I'd ideally like to optimize

XML Query Parser - next steps

2006-02-24 Thread mark harwood
Before I commit this stuff to contrib I wanted to sound out dev members on directions for this code. We currently have an extensible parser with composable builder modules. These builders currently only have a role in life which involves parsing particular XML chunks and instantiating the related

Re: Lazy Field Loading

2006-03-31 Thread mark harwood
I don't think option 3 is baked in at indexing time. Sorry, I misread it. Yes, that is another option. So if options 3 and 4 are about search-time selection (based on size and fieldname respectively) can they be generalized into a more wide-reaching retrieval API? You can imagine a high-level

Query.extractTerms - a poor introspection API?

2006-04-06 Thread mark harwood
Having switched the highlighter over from lots of Query-specific code to using the generic Query.extractTerms API I realize I have both gained something (support for all query types) and lost something (detailed boost info for each term in the tree eg Fuzzy spelling variants). The boost info was

Re: Query.extractTerms - a poor introspection API?

2006-04-06 Thread mark harwood
It's still the case that you often need to know what type of query the parent is. For highlighting purposes I typically don't need/want to concern myself too much with precisely interpreting the specifics of all Query logic: * For Boolean queries the mustNot terms typically don't appear in the

Re: SentenceHighlighter

2006-04-19 Thread mark harwood
If you are wanting to select highlights from a document where only whole sentences are the fragments selected you will need to implement a custom Fragmenter class. This will need to look for sentence boundaries eg a . followed by whitespace only, then a word with an uppercase first character. I

Re: trivial util to Visualize BitSets (Query results actually)

2006-05-31 Thread mark harwood
I added something similar to Luke but without the colour intensity - I may add your code in to do this. Another Luke plugin I have visualizes vocabulary growth for a field as a chart over time. This is useful to see if a field is matured or is still accumulating new terms. A Zipf term distribution

RE: Luke - in need of maintainer

2006-06-01 Thread mark harwood
I can pick this up, but I don't think I've got much more bandwidth than Andrzej to work on it. I certainly don't have the time now for a port to an Apache-friendly GUI framework but ultimately I think Luke should end up under the contrib section where it can be managed and benefit from the

Re: Edit-distance strategy

2006-06-08 Thread mark harwood
FWIW, I integrated sourceforge's SecondString algos (http://secondstring.sourceforge.net/javadoc ) and others using a callout interface which boiled down to: float getDifference(String a, String b) This seemed to be the cleanest lowest-common-denominator standard for plugging in string

RangeQuery - rewrite to a RangeFilter in a ConstantScoreQuery?

2006-09-25 Thread mark harwood
Given the trouble people routinely get themselves into using RangeQuery would it make sense to change the rewrite method to generate a ConstantScoreQuery wrapping a RangeFilter? The only disadvantages I can see would be: 1) Scoring would change - some users may find their apps produce

Re: MatchAllDocs in BooleanQuery.rewrite

2007-02-02 Thread mark harwood
are there any legitimate usecases for calling rewrite other then when a Searcher is about to execute the query? When using the highlighter it is recommended to use a rewritten query e.g. to get all the variations for a fuzzy query. However I don't think there should be a problem with the

Exposing a public Filter getFilter() method in ConstantScoreQuery

2007-02-13 Thread mark harwood
Any objections to me adding this read-only method to ConstantScoreQuery? I need to discover RangeFilters etc wrapped in ConstantScoreQuerys as part of a generic query optimiser/analyser. Cheers, Mark

Re: Lius into apache incubator

2007-02-28 Thread mark harwood
Hi Rida, I've been talking with Jukka Zitting (involved in Nutch) about parsing/Tika and we started to sketch out some project objectives on the Wiki over there which may be of interest: http://code.google.com/p/tika/w/list I recently did a round-up of the main open source projects which

Re: Is this correct: term.field() == fieldName ?

2007-03-21 Thread mark harwood
Is it correct to compare using '==' or equals should be used instead? In this context it is OK. Term fieldnames are deliberately interned using String.intern() so this equality test can be used. The intention is to make comparisons faster. Cheers, Mark - Original Message From: dmitri

Re: How to handle servlet-api.jar in build?

2007-06-12 Thread mark harwood
Thanks for the pointers Paul. I just don't think you can 'package' up a distribution that includes these jars in your distribution. Clearly the binary distribution need not bundle servlet-api.jar - a demo.war file is all that is needed. However, is the source distribution exempt from this

Re: The JDK 1.5 Can o' Worms

2007-07-25 Thread mark harwood
Mostly, though, I think it gives Lucene Java the feel that we are behind. Isn't 1.6 the actual official release at this point? I wouldn't say behind, just concerned about enabling Lucene for all - in the same way popular websites might choose broad accessibility over using the latest AJAX

Re: Fwd: Decouple Filter from BitSet: API change and xml query parser

2007-08-10 Thread mark harwood
Subject: Re: Fwd: Decouple Filter from BitSet: API change and xml query parser On Friday 10 August 2007 13:12, mark harwood wrote: Could someone give me a clue as to why the test case TestRemoteCachingWrapperFilter fails with the patch applied? Regardless of the reasons for this particular

Re: Fwd: Decouple Filter from BitSet: API change and xml query parser

2007-08-10 Thread mark harwood
and DocIdSetIterator currently part of Lucene? I don't know how to go about these. Regards, Paul Elschot -- Forwarded Message -- Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet Date: Friday 10 August 2007 01:15 From: Mark Harwood (JIRA) [EMAIL PROTECTED

Re: Web-based Luke

2007-11-14 Thread mark harwood
This is neat, Mark! Thanks - GWT rocks. Then it became clear to me that it's actually the _remote_ filesystem one is looking at (the server's). Yes, that's a potentially worrying security issue that needs locking down carefully. I think one mode of operation should be that Luke Server is

Re: WebLuke - include Jetty in Lucene binary distribution?

2007-12-10 Thread mark harwood
I don't know that we have ever checked in IDE settings GWT development is much easier with the IDE and there is a fair amount of manual setup required without the settings to run the hosted development environment. Hosted development is the key productivity benefit and allows debugging in Java

Re: JBoss Cache as a store

2008-01-29 Thread mark harwood
Hi Manik, Is there a set of tests in the Lucene sources I could use to test the JBCDirectory, as I call it? You would probably need to adapt existing Junit tests in contrib/benchmark and src/test for performance and functionality testing, respectively. They use the

Out of memory - CachingWrappperFilter and multiple threads

2008-02-18 Thread mark harwood
I'm chasing down a bug in my application where multiple threads were readingand caching the same filter (same very common term, big index) and causedan Out of Memory exception when I would expect there to be plenty ofmemory to spare. There's a number of layers to this app to investigate (I was

Re: Out of memory - CachingWrappperFilter and multiple threads

2008-02-18 Thread mark harwood
(reader). This is safe when the cache is private. Regards, Paul Elschot Op Monday 18 February 2008 13:50:16 schreef mark harwood: I'm chasing down a bug in my application where multiple threads were readingand caching the same filter (same very common term, big index) and causedan Out of Memory

Re: WebLuke - include Jetty in Lucene binary distribution?

2008-04-25 Thread mark harwood
Why don't use ivy or maven for that? That would resurrect the Ant vs Maven debate around build systems. Not having used Maven I don't feel qualified to comment. Stefan, the Winstone server appears to be LGPL not Apache which also adds some complexity. The GWT compiler is the main cause of the

Re: Moving SweetSpotSimilarity out of contrib

2008-09-03 Thread mark harwood
Not tried SweetSpot so can't comment on worthiness of moving to core but agree with the principle that we can't let the hassles of a company's due diligence testing dictate the shape of core vs contrib. For anyone concerned with the overhead of doing these checks a company/product of potential

Re: Can I filter the results returned by IndexReader.terms(term)?

2008-09-03 Thread mark harwood
One way is to read TermDocs for each candidate term and see if they are in your filter - but that sounds like a lot of disk IO to me when responding to individual user keystrokes. You can use skip to avoid reading all term docs when you know what is in the filter but it all seems a bit costly.

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread mark harwood
Interesting discussion. I think we should seriously look at joining efforts with open-source Database engine projects I posted some initial dabblings here with a couple of the databases on your list :http://markmail.org/message/3bu5klzzc5i6uhl7 but this is not really a scalable solution

Re: Extending query parser with MinShouldMatch syntax

2008-09-13 Thread Mark Harwood
You might want to try the XML query parser in contrib. I deliberately created this to allow remote clients to have full control over lucene (filters, caching etc) without trying to bloat the standard query parser with special characters. On 13 Sep 2008, at 18:26, Shai Erera [EMAIL PROTECTED]

Re: RMI, Searchable and RemoteSearchable

2008-09-26 Thread mark harwood
since not many people, I think, even use the RMI stuff I certainly binned RMI in my distributed work. It just would not reliably stop/restart cleanly in my experience - despite following all the RMI guidelines for clean shutdowns. I'd happily see all RMI dependencies banished from core.

Re: [VOTE] Release Lucene 2.4.0

2008-10-03 Thread mark harwood
Hi Mike, Given the repackaging any chance you can sneak in 2 contrib fixes I added recently? Null pointer introduced to clients dropping in 2.4 upgrade - http://svn.apache.org/viewvc?view=revrevision=700815 Bug in fuzzy matching -

Re: [VOTE] Release Lucene 2.4.0

2008-10-07 Thread mark harwood
/lucene2.4take3 Here's my vote: +1. Mike mark harwood wrote: Hi Mike, Given the repackaging any chance you can sneak in 2 contrib fixes I added recently? Null pointer introduced to clients dropping in 2.4 upgrade - http://svn.apache.org/viewvc?view=revrevision=700815 Bug in fuzzy matching

Re: Adding dependency to servlet-api

2008-11-05 Thread mark harwood
Just checked Solr (forgot about that obvious precedent!) and they have it in trunk/lib and an entry in trunk/notice.txt which reads: Includes software from other Apache Software Foundation projects, including, but not limited to: - Apache Tomcat (lib/servlet-api-2.4.jar)

Re: [jira] Created: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-11 Thread mark harwood
I'm not sure I see an easy translation of copyright !mycompany into SpanQueries which is how all the other queries are being converted. SpanNotQuery isn't applicable here because that only tests spans don't overlap. Yonik's approach looks good. - Original Message From: Yonik Seeley

Re: LIA2 on l.a.o/java OK?

2009-02-20 Thread mark harwood
I'm OK with LIA2 on the front page - as Erik suggests it does help lend credibility to a project. I encounter organisations who are nervous about buying into an open-source solution and having books up there on the home page immediately helps establish the following: 1) The APIs are stable

Re: Welcome Uwe Schindler as Lucene committer!

2009-05-18 Thread mark harwood
Welcome, Uwe. Great work on the Trie piece - now if you could just settle the Tree vs Try pronunciation dilemma . :) - Original Message From: Mark Miller markrmil...@gmail.com To: java-dev@lucene.apache.org Sent: Monday, 18 May, 2009 17:46:51 Subject: Re: Welcome Uwe Schindler

Re: Lucene's default settings back compatibility

2009-05-19 Thread mark harwood
When you create IndexReader, IndexWriter and others, you must pass in a Settings instance. I think this would also help solve the steady growth of constructor variations (18 in 2.4's IndexWriter vs 3 in Lucene 1.9). - Original Message From: Otis Gospodnetic

Re: WebLuke - include Jetty in Lucene binary distribution?

2009-06-08 Thread mark harwood
Hi John/Grant. I haven't done any more in developing WebLuke - although still use it regularly. As Grant suggests there was an unease (mine) about bloating the Lucene distribution size with GWT dependencies so it wasn't rolled into contrib. However I guess I'm comfortable if no one else is

Re: [jira] Commented: (LUCENE-1685) Make the Highlighter use SpanScorer by default

2009-06-11 Thread Mark Harwood
+1 On 11 Jun 2009, at 21:32, Michael McCandless (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718629 #action_12718629 ] Michael McCandless commented on LUCENE-1685:

Re: Improving TimeLimitedCollector

2009-06-24 Thread Mark Harwood
I think the Collector approach makes the most sense to me, since it's the only object I fully control in the search process. I cannot control Query implementations, and I cannot control the decisions made by IndexSearcher. But I can always wrap someone else's Collector with TLC and pass it

Re: Improving TimeLimitedCollector

2009-06-26 Thread Mark Harwood
Going back to my post re TimeLimitedIndexReaders - here's an incomplete but functional prototype: http://www.inperspective.com/lucene/TimeLimitedIndexReader.java http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java The principle is that all reader accesses check a volatile

Re: Improving TimeLimitedCollector

2009-06-27 Thread Mark Harwood
. That will save looping over the collection to find the next candidate. Just an implementation detail though. Shai On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood markharw...@yahoo.co.uk wrote: Going back to my post re TimeLimitedIndexReaders - here's an incomplete but functional prototype

Re: Improving TimeLimitedCollector

2009-06-27 Thread Mark Harwood
Odd. I see you're responding to a message from Shai I didn't get. Some mail being dropped somewhere along the line.. Why don't you use Thread.interrupt(), .isInterrupted() ? Not sure where exactly you mean for that? I'm not sure I understand that - how can a thread run 1 activity

Re: FuzzyLikeThis query and exact matches

2009-08-27 Thread Mark Harwood
Despite making IDF a constant the edit distance should remain a factor in the rankings so I would have thought this would give you what you need. Can you supply a more detailed example? Either print the rewritten query or use the explain function Cheers Mark On 27 Aug 2009, at 13:22,

Re: FuzzyLikeThis query and exact matches

2009-08-27 Thread Mark Harwood
I think those boosts shown are reflecting the edit distance. What we can't see from this is that the Similarity class used in execution is using the same IDF for all terms. The other factors at play will be the term frequency in the doc, its length and any doc boost. I don't have access to the

Re: [jira] Commented: (LUCENE-1781) Large distances in Spatial go beyond Prime MEridian

2009-09-11 Thread mark harwood
It seems like something higher up must accept two rects and OR them together during the searching? That's the way I've done it before. It's like the old Asteroids arcade game where as the ship drifts off-screen stage right it is simultaneously emerging back from stage-left. -

Highlighting - catering for all query types

2009-10-19 Thread mark harwood
I've been putting together some code to support highlighting of opaque query clauses (cached filters, trie range, spatial etc etc) which shows some promise. This is not intended as a replacement for the existing highlighter(s) which deal with free-text but is instead concentrating on the

  1   2   3   4   5   >