Re: AND query in SHOULD

2007-11-22 Thread Shai Erera
Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: help required urgent!!!!!!!!!!!

2007-11-22 Thread Shai Erera
] -- Regards, Shai Erera

Re: Force MultiFieldQueryParser always to use PrefixQuery

2007-11-22 Thread Shai Erera
- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: help required urgent!!!!!!!!!!!

2007-11-22 Thread Shai Erera
StandardAnalyzer I have got no option. Thanks a lot for your reply. Please suggest me how can I go ahead. SHAKTI SAREEN GE-GDC STC HYDERABAD 994894 -Original Message- From: Shai Erera [mailto:[EMAIL PROTECTED] Sent: Thursday, November 22, 2007 9:25 PM To: java-user

Re: AND query in SHOULD

2007-11-24 Thread Shai Erera
uncertain about one detail: How do I achieve a search for multiple keywords. Not just green tree but also short road, sky, bird? Is there a chance to add those keywords to the Query q = qp.parse(\green tree\); command? Shai Erera wrote: How about using MultiFieldQueryParser. Here is a short

Re: Problem with Add method

2007-11-29 Thread Shai Erera
: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: Field weights

2007-12-14 Thread Shai Erera
] -- Regards, Shai Erera

Re: (~) opertor query....

2007-12-14 Thread Shai Erera
that leveraged the SpanQuery family of queries to do something like this. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: index and access to lines of a CSV file

2007-12-14 Thread Shai Erera
: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: (~) opertor query....

2007-12-14 Thread Shai Erera
I just noticed MultiPhraseQuery has a setSlop method, so I think this Query is what you're looking for. On Dec 15, 2007 7:04 AM, Shai Erera [EMAIL PROTECTED] wrote: You can look at org.apache.lucene.search.MultiPhraseQuery which does something similar to what you ask. From its javadoc

Re: Reading field parameters from XML

2008-01-02 Thread Shai Erera
code and assign somehow Field.Store, Field.Index and etc... based on string value. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: PrefixQuery question

2008-01-08 Thread Shai Erera
like to do is for it to return mouse cat apple and mouse cat house and not cat house mouse it will only do this if the field is untokenized I believe. is there any way to get the desired behavior? best, -C.B. -- Regards, Shai Erera

Retrieve the number of deleted documents

2008-01-11 Thread Shai Erera
Hi I didn't find a proper API on InderWriter or IndexReader to retrieve the total number of deleted documents. Will IndexReader.maxDocs() - IndexReader.numDocs() give the correct result? or this is just a heuristic? Thanks, Shai

Re: Retrieve the number of deleted documents

2008-01-11 Thread Shai Erera
Thanks I guess I should have looked in the code before asking those silly questions :-) I wonder why there isn't a specific API for that though ... On Jan 11, 2008 7:36 PM, Steven A Rowe [EMAIL PROTECTED] wrote: Hi Shai, On 01/11/2008 at 7:42 AM, Shai Erera wrote: Will IndexReader.maxDocs

Re: Using RangeFilter

2008-01-19 Thread Shai Erera
: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: matching products with suggest feature

2008-02-13 Thread Shai Erera
appreciated. Best Regards, C.B. -- Regards, Shai Erera

Re: matching products with suggest feature

2008-02-13 Thread Shai Erera
PM, Shai Erera [EMAIL PROTECTED] wrote: What is the default Operator of your QueryParser? Is it AND_OPERATOR or OR_OPERATOR. If it's OR ... then it's strange. If it's AND, then once you add more terms than what exists, it won't find anything. On Feb 13, 2008 6:54 PM, Cam Bazz [EMAIL

Re: matching products with suggest feature

2008-02-14 Thread Shai Erera
, BooleanClause.Occur.SHOULD)); } private static void add(BooleanQuery q, String name, String value) { q.add(new BooleanClause(new TermQuery(new Term(name, value)), BooleanClause.Occur.SHOULD)); } On Thu, Feb 14, 2008 at 8:44 AM, Shai Erera [EMAIL PROTECTED] wrote: Is this Speller

Re: How to construct a MultiReader?

2008-02-21 Thread Shai Erera
. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: IndexReader getFieldNames()

2008-03-19 Thread Shai Erera
commands, e-mail: [EMAIL PROTECTED] -- Regards, Shai Erera

Re: Preserving dots of an acronym while indexing in Lucene

2009-07-19 Thread Shai Erera
I think you should write your own Analyzer and use: * StandardTokenizer for tokenization and ACRONYM detection. * StopFilter for stopwrods handling. The Analyzer you write should override tokenStream() and do something like:

Re: Sorting field contating NULL values consumes field cache memory

2009-07-21 Thread Shai Erera
FWIW, I had implemented a sort-by-payload feature which performs quite well. It has a very small memory footprint (actually close to 0), and reads values from a payload. Payloads, at least from my experience, perform better than stored fields. On a comparison I've once made, the sort-by-payload

Re: Exclusion search

2009-07-22 Thread Shai Erera
Maybe add to each doc a field numVolunteers and then constraint the query to vol:krish and vol:raj and numvol:2 (something like that)? On Wed, Jul 22, 2009 at 9:49 AM, ba3 sbadhrin...@gmail.com wrote: Hi, In the documents which contain the volunteer information : Doc1 : volunteer krish

Re: indexing 100GB of data

2009-07-22 Thread Shai Erera
From my experience, you shouldn't have any problems indexing that amount of content even into one index. I've successfully indexed 450 GB of data w/ Lucene, and I believe it can scale much higher if rich text documents are indexed. Though I haven't tried yet, I believe it can scale into the 1-5 TB

Re: indexing 100GB of data

2009-07-22 Thread Shai Erera
There shouldn't be a problem to search such index. It depends on the machine you use. If it's a strong enough machine, I don't think you should have any problems. But like I said, you can always try it out on your machine before you make a decision. Also, Lucene has a Benchmark package which

Re: Exclusion search

2009-07-22 Thread Shai Erera
Perhaps I misunderstood something, but how do you update a document? I mean, if a document contains vol:a, vol:b and vol:c and then you want to add vol:d to it, don't you remove the document and add it back? If that's what you do, then you can also update the numvols field, right? Or .. you

Re: Lucene - Search breadth approach

2009-07-22 Thread Shai Erera
Hi Robert, What you could do is use the Stemmer (as a TokenFilter I assume) and produce two tokens always - the stem and the original. Index both of them in the same position. Then tell your users that if they search for [testing], it will find results for 'testing', 'test' etc (the stems) and

Re: reranking Lucene TopDocs

2009-07-22 Thread Shai Erera
Can you be more specific? What do you mean by re-rank? Reverse the sort? give different weights? Shai On Wed, Jul 22, 2009 at 4:35 PM, henok sahilu henok_sah...@yahoo.comwrote: hello there i like to re-rank lucene TopDoc result set. where shall i start thanks

Re: Backing up large indexes

2009-07-22 Thread Shai Erera
Hi Alex, You can start with this article: http://www.manning.com/free/green_HotBackupsLucene.html (you'll need to register w/ your email). It describes how one can write Hot Backups w/ Lucene, and capture just the delta since the last backup. I'm about to try it myself, so if you get to do it

Re: reranking Lucene TopDocs

2009-07-22 Thread Shai Erera
sahilu henok_sah...@yahoo.comwrote: i like to write a code that re assign weight to documets so that they can be reranked --- On Wed, 7/22/09, Shai Erera ser...@gmail.com wrote: From: Shai Erera ser...@gmail.com Subject: Re: reranking Lucene TopDocs To: java-user@lucene.apache.org Date

Re: Batch searching

2009-07-22 Thread Shai Erera
It's not accurate to say that Lucene scans the index for each search. Rather, every Query reads a set of posting lists, each are typically read from disk. If you pass Query[] which have nothing to do in common (for example no terms in common), then you won't gain anything, b/c each Query will

Re: Batch searching

2009-07-22 Thread Shai Erera
Queries cannot be ordered sequentially. Let's say that you run 3 Queries, w/ one term each a, b and c. On disk, the posting lists of the terms can look like this: post1(a), post1(c), post2(a), post1(b), post2(c), post2(b) etc. They are not guaranteed to be consecutive. The code makes sure the

Re: Lucene - Search breadth approach

2009-07-22 Thread Shai Erera
that the presence of the $ will short-circuit stemming, but you'll have to be sure that whatever analyzer you use doesn't strip it. Best Erick On Wed, Jul 22, 2009 at 9:16 AM, Shai Erera ser...@gmail.com wrote: Hi Robert, What you could do is use the Stemmer (as a TokenFilter I

Re: Lucene - Search breadth approach

2009-07-22 Thread Shai Erera
of Lucene for some time, so I could be way off. But you'd sure want to use a different token G Erick On Wed, Jul 22, 2009 at 4:12 PM, Shai Erera ser...@gmail.com wrote: Actually my stemming Analyzer adds a similar character to stems, to distinguish between original tokens (like

Re: Doc IDs via IndexReader?

2009-07-22 Thread Shai Erera
off the top of my head, if you have in hand all the doc IDs that were returned so far, you can do this: 1) Build a Filter which will return any doc ID that is not in that list. For example, pass it the list of doc IDs and every time next() or skipTo is called, it will skip over the given doc IDs.

Re: indexing 100GB of data

2009-07-23 Thread Shai Erera
Generally you shouldn't hit OOM. But it may change depending on how you use the index. For example, if you have millions of documents spread across the 100 GB, and you use sorting for various fields, then it will consume lots of RAM. Also, if you run hundreds of queries in parallel, each with a

Re: Doc IDs via IndexReader?

2009-07-24 Thread Shai Erera
There are a couple of things I can think of: 1) From IndexReader's javadoc ( http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexReader.html#deleteDocument%28int%29): An IndexReader can be opened on a directory for which an IndexWriter is opened already, but it cannot be used to

Re: Number of documents in each segment before a merge occurs

2009-07-26 Thread Shai Erera
: indexWriter.setMaxBufferedDocs(10); No difference - it continues to create one document in each RAM segment before the first merge. -venkat -Original Message- From: Shai Erera [mailto:ser...@gmail.com] Sent: Saturday, July 25, 2009 10:55 PM To: java-user@lucene.apache.org Subject: Re: Number

Re: Weird behaviour

2009-08-02 Thread Shai Erera
You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look at the index w/ Luke? It will show you what are the terms in the index. Another thing which is always good

Re: Weird behaviour

2009-08-02 Thread Shai Erera
, and have a given title. On Sun, Aug 2, 2009 at 4:07 PM, Shai Erera ser...@gmail.com wrote: You write that you index the string under the url field. Do you also index it under title? If not, that can explain why title:Rahul Dravid does not work for you. Also, did you try to look

Re: Weird behaviour

2009-08-02 Thread Shai Erera
You can always create your own Analyzer which creates a TokenStream just like StandardAnalyzer, but instead of using StandardFilter, write another TokenFilter which receives the HOST token type, and breaks it further to its components (e.g., extract en, wikipedia and org). You can also return the

Re: Searching doubt

2009-08-04 Thread Shai Erera
I can think of another approach - during indexing, capture the word aboutus and index it as about us and aboutus in the same position. That way both queries will work. You'd need to write your own TokenFilter, maybe a SynonymTokenFilter (since this reminds me of synonyms usage) that accept a list

Re: Searching doubt

2009-08-04 Thread Shai Erera
I don't see that you use the Analyzer anywhere (i.e. it's created by not used?). Also, the wildcard query you create may be very inefficient, as it will expand all the terms under the DEFAULT_FIELD. If the DEFAULT_FIELD is the field where all your default searchable terms are indexed, there could

Re: Searching doubt

2009-08-04 Thread Shai Erera
If you don't know which tokens you'll face, then it's really a much harder problem. If you know where the token is, e.g. it's always in http://some.example.site/a/b/here will be the token to break/index.html, then it eases the task a bit. Otherwise you'll need to search every single token

Re: Searching doubt

2009-08-04 Thread Shai Erera
Interesting ... I don't have access to a Japanese dictionary, so I just extract bi-grams. But I guess that in this case, if one can access an English dictionary (are you aware of an open-source one, or free one BTW?), one can use the method you mention. But still, doing this for every Token you

Re: Searching doubt

2009-08-04 Thread Shai Erera
From: Shai Erera ser...@gmail.com To: java-user@lucene.apache.org Sent: Tuesday, August 4, 2009 10:31:46 AM Subject: Re: Searching doubt Hi Darren, The question was, how given a string aboutus in a document, you can return that document as a result to the query about us (note the space). So

Re: Paging in a Lucene search

2009-08-06 Thread Shai Erera
If you pass reader.maxDoc(), it will create a heap (array) of size reader.maxDoc() and is not recommended. Instead, if you display the first page of results, you should pass 10 (assuming you display 10 results). You can call TopFieldDocs.totalHits to get the total number of matching results. Then

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
Robert - can you elaborate on what you mean by just treat it at the script level? On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir rcm...@gmail.com wrote: Bradford, there is an arabic analyzer in trunk. for farsi there is currently a patch available:

Re: Language Detection for Analysis?

2009-08-06 Thread Shai Erera
Thanks Robert for the explanation. I thought that you meant something different, like doing stemming in some sophisticated manner by somehow detecting the language. Doing these normalizations makes sense of course, especially if the letters look similar. Thanks again, Shai On Thu, Aug 6, 2009

Re: Different Analyzers

2009-08-11 Thread Shai Erera
you should also make sure the data is indexed twice, once w/ the original case and once w/o. It's like putting a TokenFilter after WhitespaceTokenizer which returns two tokens - lowercased and the original, both in the same position (set posIncr to 0). On Wed, Aug 12, 2009 at 6:20 AM, Max Lynch

Re: How to tune Analyzer for Text Extraction

2009-08-11 Thread Shai Erera
If this file has a predefined construct, e.g.: title: someting location: new york then you can write a simple parser that extracts that information. But I think otherwise this falls outside the scope of Lucene, unless I misunderstood you. If I had to give it a long shot though, I'd try to

Re: Indexer crashes with hit exception during merge

2009-08-13 Thread Shai Erera
Is that a local file system, or a network share? On Thu, Aug 13, 2009 at 1:07 PM, rishisinghal singhal.ri...@gmail.comwrote: Is there any chance that two writers are open on this directory? No, thats not true. something external to Lucene is removing files from the directory. No this also

Re: Indexer crashes with hit exception during merge

2009-08-13 Thread Shai Erera
experience that? Can you try to create the index somewhere else, or on another drive? Shai On Thu, Aug 13, 2009 at 3:00 PM, rishisinghal singhal.ri...@gmail.comwrote: It is a local file system. We are using lucene 2.4 and java 1.5 Regards, Rishi Shai Erera wrote: Is that a local file

Re: Is there a way to check for field uniqueness when indexing?

2009-08-13 Thread Shai Erera
this set of documents added in sync with the index reader on the index (before it has been written to). What I'd like is to have an access to the stuff the index writer has written but not yet commited. Is there something that can access that data? Daniel Shane Shai Erera wrote: How many

Re: Indexer crashes with hit exception during merge

2009-08-14 Thread Shai Erera
2.4] Checking only these segments: _61: No problems were detected with this index. Regards, Rishi Shai Erera wrote: I noticed the exception is Caused by: java.io.FileNotFoundException: /SYS$SYSDEVICE/RISHI/melon_1600/_61.cfs (i/o error (errno:5)) I searched for i/o error

Re: Problem doing backup using the SnapshotDeletionPolicy class

2009-08-14 Thread Shai Erera
I think you should also delete files that don't exist anymore in the index, from the backup? Shai On Fri, Aug 14, 2009 at 10:02 PM, Michael McCandless luc...@mikemccandless.com wrote: Could you boil this down to a small standalone program showing the problem? Optimizing in between backups

FSDirectory.setDisableLocks

2009-08-15 Thread Shai Erera
Hi If I can guarantee only one JVM will update an index (not at a time - truly just one JVM), can I disable locks, or is it really necessary only for read-only devices? If I disable locks, will I see any performance improvements? Thanks Shai

Re: FSDirectory.setDisableLocks

2009-08-15 Thread Shai Erera
Thanks Mike. Shai On Sat, Aug 15, 2009 at 12:49 PM, Michael McCandless luc...@mikemccandless.com wrote: You could also use NoLockFactory. Disabling locks just means Lucene stops checking if another writer has the index open (the write.lock file). It's extremely dangerous to do, unless

Extending Sort/FieldCache

2009-08-20 Thread Shai Erera
Hi I'd like to extend Lucene's FieldCache such that it will read native values from a different place (in my case, payloads). That is, instead of iterating on a field's terms and parsing each String to long (for example), I'd like to iterate over one term (sort:long, again - an example) and

Re: How to give a score for all documents?

2009-08-25 Thread Shai Erera
Can you please elaborate more on the use case? Why if a certain document is irrelevant to a certain query, you'd like to give it a score? Are you perhaps talking about certain documents which should always appear in search results, no matter what the query is? And instead of always showing them,

Re: Lucene gobbling file descriptors

2009-08-26 Thread Shai Erera
That's strange ... how do you execute your searches - each search opens up an IndexReader? Do you make sure to close them? Maybe those are file descriptors of files you index? Forgive the silly questions, but I've never seen Lucene run into out-of-files handles ... Shai On Wed, Aug 26, 2009 at

Re: Extending Sort/FieldCache

2009-08-26 Thread Shai Erera
Thanks a lot for the response ! I wanted to avoid two things: * Writing the logic that invokes cache-refresh upon IndexReader reload. * Write my own TopFieldCollector which uses this cache. I guess I don't have any other choice by to write both of them, or try to make TFC more customizable such

Re: Why perform optimization in 'off hours'?

2009-08-31 Thread Shai Erera
When you run optimize(), you consume CPU and do lots of IO operations which can really mess up the OS IO cache. Optimize is a very heavy process and therefore is recommended to run at off hours. Sometimes, when your index is large enough, it's recommended to run it during weekends, since the

Re: Question about IndexCommit

2009-09-01 Thread Shai Erera
If I'm not mistaken, IndexReader reads the .del file into memory, and therefore subsequent updates to it won't be visible to it. Shai On Tue, Sep 1, 2009 at 3:54 PM, Ted Stockwell emorn...@yahoo.com wrote: Hi All, I am interested in using Lucene to index RDF (Resource Description Format)

Re: First result in the group

2009-09-02 Thread Shai Erera
What do you mean by first result in the group? What is a group? On Wed, Sep 2, 2009 at 1:36 PM, Ganesh emailg...@yahoo.co.in wrote: Hello all, I want to retrieve the first result in the group. How to acheive this? Currently i am parsing all the results, using a hash and avoiding duplicate

Re: Extending Sort/FieldCache

2009-09-03 Thread Shai Erera
Thanks I plan to look into two things, and then probably create two separate issues: 1) Refactor the FieldCache API (and TopFieldCollector) such that one can provide its own Cache of native values. I'd hate to rewrite the FieldComparators logic just because the current API is not extendable.

Re: Extending Sort/FieldCache

2009-09-04 Thread Shai Erera
Thanks Mike. I did not phrase well my understanding of Cache reload. I didn't mean literally as part of the reopen, but *because* of the reopen. Because FieldCache is tied to an IndexReader instance, after reopen it gets refreshed. If I keep my own Cache, I'll need to code that logic, and I prefer

Re: Extending Sort/FieldCache

2009-09-08 Thread Shai Erera
I didn't say we won't need CSF, but that at least conceptually, CSF and my sort-by-payload are the same. If however it turns out that CSF performs better, then I'll definitely switch my sort-by-payload package to use it. I thought that CSF is going to be implemented using payloads, but perhaps I'm

Re: Is there way to get complete start end matches to be first in the list ?

2009-09-08 Thread Shai Erera
I can think of a way where you rely solely on scores and therefore there is still chance to get results not ordered the way you want, but you can try it - run the query [foo bar OR foo bar^10]. That way, your first result should be scored by [foo], [bar] and [foo bar]. Also, the phrase is added a

Re: TSDC, TopFieldCollector co

2009-09-30 Thread Shai Erera
I agree. If you need sort-by-score, it's better to use the fast search methods. IndexSearcher will create the appropriate TSDC instance for you, based on the Query that was passed. If you need to create multiple Collectors and pass a kind of Multi-Collector to IndexSearcher, then you should

Re: TSDC, TopFieldCollector co

2009-09-30 Thread Shai Erera
the right TSDC ... I like it, option 1 it is minimum user code. Cheers, eks - Original Message From: Shai Erera ser...@gmail.com To: java-user@lucene.apache.org Sent: Wednesday, 30 September, 2009 17:12:38 Subject: Re: TSDC, TopFieldCollector co I agree. If you need sort

Re: TSDC, TopFieldCollector co

2009-09-30 Thread Shai Erera
) are these objects mutable? - Original Message From: Shai Erera ser...@gmail.com To: java-user@lucene.apache.org Sent: Wednesday, 30 September, 2009 18:11:03 Subject: Re: TSDC, TopFieldCollector co BTW eks, you asked about reusing TSDC. PQ has a clear() method, so it can

Equality Numeric Query

2009-11-11 Thread Shai Erera
Hi I index documents with numeric fields using the new Numeric package. I execute two types of queries: range queries (for example, [1 TO 20}) and equality queries (for example 24.75). Don't mind the syntax. Currently, to execute the equality query, I create a NumericRangeQuery with the

Re: Equality Numeric Query

2009-11-11 Thread Shai Erera
-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai Erera [mailto:ser...@gmail.com] Sent: Wednesday, November 11, 2009 2:55 PM To: java-user@lucene.apache.org Subject: Equality Numeric Query Hi I index documents with numeric

Re: Equality Numeric Query

2009-11-11 Thread Shai Erera
Thanks a lot for the super fast response ! Shai On Wed, Nov 11, 2009 at 4:21 PM, Uwe Schindler u...@thetaphi.de wrote: No. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai Erera [mailto:ser

How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
Hi I started to migrate my Analyzers, Tokenizer, TokenStreams and TokenFilters to the new API. Since the entire set of classes handled Token before, I decided to not change it for now, and was happy to discover that Token extends AttributeImpl, which makes the migration easier. So I started w/

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
(Token) -- hasA(Term) -- getA(Term) -- cast to Token ... I don't know if this is a bug or not, but it's strange. Shai On Sun, Nov 22, 2009 at 1:12 PM, Shai Erera ser...@gmail.com wrote: Hi I started to migrate my Analyzers, Tokenizer, TokenStreams and TokenFilters to the new API. Since

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
) -- getA(Term) -- cast to Token ... I don't know if this is a bug or not, but it's strange. Shai On Sun, Nov 22, 2009 at 1:12 PM, Shai Erera ser...@gmail.com wrote: Hi I started to migrate my Analyzers, Tokenizer, TokenStreams and TokenFilters to the new API. Since

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
But I do use addAttribute(Token.class), so I don't understand why you say it's not possible. And I completely don't understand why the new API allows me to just work w/ interfaces and not impls ... A while ago I got the impression that we're trying to get rid of interfaces because they're not easy

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
ok so from what I understand, I should stop working w/ Token, and move to working w/ the Attributes. addAttribute indeed does not work. Even though it does not through an exception, if I call in.addAttribute(Token.class), I get a new instance of Token and not the once that was added by in. So

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
with restoreState to the TokenStream. CachingTokenFilter does this. So the new API uses the State object to put away tokens for later reference. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
is that? Shai On Sun, Nov 22, 2009 at 3:57 PM, Shai Erera ser...@gmail.com wrote: Perhaps I misunderstand something. The current use case I'm trying to solve is - I have an abbreviations TokenFilter which reads a token and stores it. If the next token is end-of-sentence, it checks whether

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, November 22, 2009 3:28 PM To: java-user@lucene.apache.org Subject: Re: How to deal with Token in the new TS API What I've done

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, November 22, 2009 3:28 PM To: java-user@lucene.apache.org Subject: Re: How to deal with Token in the new TS API What

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai Erera [mailto:ser...@gmail.com] Sent: Sunday, November 22, 2009 7:53 PM To: java-user@lucene.apache.org Subject: Re: How to deal with Token in the new TS API Yes I can clone the term itself

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
(TermAttribute.class); By that you guarantee, that both are from the same implementation type. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Shai Erera [mailto:ser

Re: How to deal with Token in the new TS API

2009-11-22 Thread Shai Erera
of clones, I'll create Token and populate it w/ what I need, just for convenience ... Thanks, Shai On Sun, Nov 22, 2009 at 9:23 PM, Shai Erera ser...@gmail.com wrote: I assume termAtt is the input's TermAttribute, right? Therefore it has no copyTo ... What I've done so far is create a TermAttribute

Re: Performance problems with Lucene 2.9

2009-11-30 Thread Shai Erera
Hi First you can use MatchAllDocsQuery, which matches all documents. It will save a HUGE posting list (TAG:TAG), and performs much faster. For example TAG:TAG computes a score for each doc, even though you don't need it. MatchAllDocsQuery doesn't. Second, move away from Hits ! :) Use Collectors

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
Robert, what if I need to do additional filtering after CollationKeyFilter, like stopwords removal, abbreviations handling, stemming etc? Will that be possible if I use CollationKeyFilter? I also noticed CKF creates a String out of the char[]. If the code already does that, why not use

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
. An easy example, the lowercase of ß is ß itself, it is already lowercase. it will not match with 'SS' if you use lowercase filter. if you use case folding, these two will match. On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera ser...@gmail.com wrote: Robert, what if I need to do additional

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
double normalization and folding for better performance. On Mon, Nov 30, 2009 at 3:41 PM, Shai Erera ser...@gmail.com wrote: Thanks Robert. In my Analyzer I do case folding according to Unicode tables. So ß is converted to SS. I do the same for diacritic removal and Hiragana/Katakan folding

Best Locking approach (Directory Lock)

2009-12-02 Thread Shai Erera
Hi We've run into problems w/ LockFactory usage on our system. The problem is that the system can be such that the index is configured on a local file system, or a remote, shared one. If remote/share, the protocol is sometimes SMB 1.0/2.0, Samba, NFS 3/4 etc. In short, we have no control on the

Re: Best Locking approach (Directory Lock)

2009-12-02 Thread Shai Erera
I have multiple JVMs on different machines accessing the shared file system. I don't really have multiple IndexWriters on the same JVM, I asked this just out of curiosity. So I don't understand from your reply if it's safe to use NoLockFactory, or I should use SimpleFSLockFactory and unlock if

Re: Best Locking approach (Directory Lock)

2009-12-02 Thread Shai Erera
]! Mike On Wed, Dec 2, 2009 at 8:24 AM, Shai Erera ser...@gmail.com wrote: I have multiple JVMs on different machines accessing the shared file system. I don't really have multiple IndexWriters on the same JVM, I asked this just out of curiosity. So I don't understand from your reply

Re: Converting HitCollector to Collector

2009-12-09 Thread Shai Erera
Hi Max, In 3.0.0 (actually in 2.9.0 already), Lucene moved to execute its searches one sub-reader at a time. As a consequence, absolute docIDs are not passed to the collect method anymore, but instead the relative docIDs of that reader. An example, suppose you have 2 segments, with 6 documents

Using TermDocs.seek vs. IndexReader.termDocs()

2010-01-17 Thread Shai Erera
Hi I remember a while ago a discussion around the efficiency of TermDocs.seek and how it is inefficient and it's better to call IndexReader.termDocs instead (actually someone was proposing to remove seek entirely from the interface because of that). I've looked at FieldCacheImpl's

Re: Using TermDocs.seek vs. IndexReader.termDocs()

2010-01-17 Thread Shai Erera
...@mikemccandless.com wrote: On Sun, Jan 17, 2010 at 5:01 AM, Shai Erera ser...@gmail.com wrote: I remember a while ago a discussion around the efficiency of TermDocs.seek and how it is inefficient and it's better to call IndexReader.termDocs instead (actually someone was proposing to remove seek

Re: NFS, Stale File Handle Problem and my thoughts....

2010-01-20 Thread Shai Erera
We've worked around that problem by doing two things: 1) We notify all nodes in the cluster when the index has committed (we use JMS for that). 2) On each node there is a daemon which waits on this JMS queue, and once the index has committed it reopens an IR, w/o checking isCurrent(). I think that

Re: NFS, Stale File Handle Problem and my thoughts....

2010-01-20 Thread Shai Erera
a reopen (normal or NRT) and warming is taking place. (NOTE: I'm one of the authors on Lucene in Action 2nd edition!). But it doesn't do the communication part, to know when it's time to reopen. Mike On Wed, Jan 20, 2010 at 9:32 AM, Shai Erera ser...@gmail.com wrote: We've worked around

Re: IndexWriter memory leak?

2010-04-08 Thread Shai Erera
What Analyzer are you using? zzBuffer belongs to the tokenizer's automaton that is generated by JFlex. I've checked StandardTokenizerImpl and zzBuffer can grow, beyond the default 16KB, but yours look to be a lot bigger (33 MB !?). The only explanation I have to this is that you're trying to (or

  1   2   3   4   >