Re: modify existing non-indexed field

2006-07-08 Thread Doron Cohen
From what you said, I'm thinking of switching to IndexModifier. Yes, IndexModifier would synchronize add/delete. One should notice the performance comment in IndexModifier http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexModifier.html - While you can freely mix calls to

Re: modify existing non-indexed field

2006-07-09 Thread Doron Cohen
The problem I've had before was that I set my writer to null right after close it. That's why I got lock timeout exception when i try to create a the writer again. Guess I just need to close it, and re-open it would avoid the locking problems then. It is valid to nullify the just closed

Re: modify existing non-indexed field

2006-07-10 Thread Doron Cohen
The lock time out exception is caused by trying to open multiple IndexWriter objects in parallel - each of the 5 threads is creating its own IndexWriter object in each invocation of addAndIndex(). This cannot work - I think that chapter 2.9 of Lucene in Action is essential reading for fixing this

Re: modify existing non-indexed field

2006-07-11 Thread Doron Cohen
I've tried changing to one indexing thread (instead of 5) but still get the same problem. can't figure out why this happens. The program as listed seems to accesss an existing index - since 'create' is always false for both 'FSDirectory.getDirectoy(,)' and 'new IndexWriter(,,)'. Perhaps an old

Re: modify existing non-indexed field

2006-07-12 Thread Doron Cohen
I did clean everything but still getting the same problem. I'm using lucene 2.0. Do you get the same problem on your machine? Please try with this code - http://cdoronc.20m.com/tmp/indexingThreads.zip Regards, Doron - To

Re: modify existing non-indexed field

2006-07-13 Thread Doron Cohen
can't access the file: http://cdoronc.20m.com/tmp/indexingThreads.zip Yes, this Web host sometimes behaves strange when clicking a link from a mail program. Please try to copy cdoronc.20m.com/tmp to the Web Browser (e.g. Firefox), click Enter. This should show the content of that tmp folder,

Re: is it wrong with my code?

2006-07-14 Thread Doron Cohen
Hits hits = searcher.search(qp.Query(queryStr)); I think it should be qp.parse(String query) (rather than qp.Query(String field)) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Query does not work past 26 characters?!

2006-07-19 Thread Doron Cohen
doc.add(new Field(to, [EMAIL PROTECTED], ... PrefixQuery pq = new PrefixQuery(new Term(to, [EMAIL PROTECTED])); Perhaps a typo in the query text - Indexed text: [EMAIL PROTECTED] Searched text: [EMAIL PROTECTED] The searched text is not a prefix of the indexed one. Regards,

Re: Performance question

2006-07-20 Thread Doron Cohen
Does it matter what order I add the sub-queries to the BooleanQuery Q. That is, is the execution speed for the search faster (slower) if I do: Q.add(Q1, BooleanClause.Occur.MUST); Q.add(Q2, BooleanClause.Occur.MUST); Q.add(Q3, BooleanClause.Occur.MUST); As

Re: StandardAnalyzer question

2006-07-21 Thread Doron Cohen
\u002d would add -. Originally request was for _ - \u005f Mark Miller [EMAIL PROTECTED] wrote on 21/07/2006 13:09:28: | #LETTER: // unicode letters [ \u0041-\u005a, \u0061-\u007a, \u00c0-\u00d6, \u00d8-\u00f6, \u00f8-\u00ff,

Re: MultiFieldQueryParser.parse deprecated. What can I use?

2006-07-25 Thread Doron Cohen
(Seems 1.9 javadoc could be just a bit more clear on this.) The following should do the work: QueryParser qp = new MultiFieldQueryParser(fields[], analyzer); Query = qp.parse(qtext); Notice the difference in semantics as explained in the deprecated comment in 1.9. Also see the

Re: Grouping over multiple fields

2006-07-25 Thread Doron Cohen
Just realized that the some text part should also be grouped, so checked that this variation also works: qtxt = some text AND ( AUTHOR_NAME:krish OR EMPLOYEE_NAME:krish ); --- field:some +field:text +(AUTHOR_NAME:krish EMPLOYEE_NAME:krish) qtxt = (some text) AND ( AUTHOR_NAME:krish OR

Re: Index Rows as Documents? Help me design a solution

2006-07-25 Thread Doron Cohen
Few comments - (from first posting in this thread) The indexing was taking much more than minutes for a 1 MB log file. ... I would expect to be able to index at least a of GB of logs within 1 or 2 minutes. 1-2 minutes per GB would be 30-60 GB/Hour, which for a single machine/jvm is a lot -

Re: MultiFieldQueryParser.parse deprecated. What can I use?

2006-07-26 Thread Doron Cohen
By the resulted query toString(), boolean query would not work correctly: qtxt: a foo [1] Multi Field Query (OR) : (title:a body:a) (title:foo body:foo) [2] Multi Field Query (AND): +(title:a body:a) +(title:foo body:foo) [3] Boolean Query : (title:a title:foo) (body:a body:foo) --

Re: Index Rows as Documents? Help me design a solution

2006-07-26 Thread Doron Cohen
A document per row is seems correct to me too. If search would be by msisdn / messageid, - and if, as it seems, these are keywords, not free text that needs to be analyzed, they both should have Index.UNTOKENIZED. Also, since no search is to be done by the line content, the line should have

Re: Scoring a document (count?)

2006-07-28 Thread Doron Cohen
This task reminds me more of a count(*) sql query than a text search query. Assuming that using a text search engine is a pre requisite, I can think of two approaches - basing on Lucene scoring as suggested in the question, or a more simple approach (below). For the scoring approach - I don't

Re: Consult some information about adding index while searching

2006-07-28 Thread Doron Cohen
hu andy [EMAIL PROTECTED] wrote on 28/07/2006 01:28:14: These codes are written in C#,. There is a C# version of Lucene 1.9, which I am not a C#'er so I might have misunderstood this code, still, here is my take; One general comment - the program sent is not self contained so it's hard to debug

Re: Search Numerical Field

2006-07-28 Thread Doron Cohen
John john [EMAIL PROTECTED] wrote on 28/07/2006 06:36:19: Hello, I tried to add a field like that field = new Field(number, 1, Field.Store.YES,Field.Index.UN_TOKENIZED); so i should be indexed and to analyzed? my writer is writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(),

Re: Scoring a document (count?)

2006-07-28 Thread Doron Cohen
Doron Cohen/Haifa/[EMAIL PROTECTED] wrote on 28/07/2006 00:18:47: For the scoring approach - I don't see an easy way to get the counts from the score of the results, although the TF (term frequency in candidate docs) is known+used during document scoring, and although it seems

RE: Scoring a document (count?)

2006-08-03 Thread Doron Cohen
Hi Russel, I am also interested in the internals of Lucene's ranking and how one can/should alter the scoring. For now I was just learning from existing code of Lucene scorers and Weights. Your question seemed interesting, so I in fact implemented a quick scorer that would return the raw tf as a

Re: wildcards and spans

2006-08-04 Thread Doron Cohen
A thought - would you (or the project lead;-) consider limiting the 'wildcard expansion'? Assuming a query like: ( uni* near(5) science ) I.e. match docs with any word with prefix uni that spans no further than 5 from the word science. Assume current lexicon has M (say 1200) words

Re: is there a simple way to make a list of all words in an index?

2006-08-04 Thread Doron Cohen
See IndexReader methods - terms() and terms(Term) - and Lucene FAQ - http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#terms() http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)

Re: Classifieds rotation - weighting Lucene results by previous show frequency?

2006-08-07 Thread Doron Cohen
If the 'small classifieds index' is sufficiently small to be re-indexed every night, I think this would be a simple solution - just set the document boosts according to these statistics - i.e. boost more down docs of classifieds that were shown more yesterday -

Re: Poor performance race condition in FieldSortedHitQueue

2006-08-09 Thread Doron Cohen
Hi Otis, I think that synchronizing the entire method would be an overkill - instead it would be sufficient to synchronize on a by field object so that only if two requests for the same cold/missing field are racing, one of them would wait for the other to complete loading that field. I think

Re: Poor performance race condition in FieldSortedHitQueue

2006-08-09 Thread Doron Cohen
[EMAIL PROTECTED] wrote on 09/08/2006 11:22:12: Assuming field wasn't being used to synchronize on something else, this would still block *all* IndexReaders/Searchers trying to sort on that field. In Solr, it would make the situation worse. If I had my warmed-up IndexSearcher serving live

Re: Poor performance race condition in FieldSortedHitQueue

2006-08-09 Thread Doron Cohen
[EMAIL PROTECTED] wrote on 09/08/2006 20:32:20: Heh... interfaces strike again. Well then since we *know* that no one has their own implementation (because they would not have been able to register it), we should be able to safely upgrade the interface to a class (anyone want to supply a

Re: updating document

2006-08-10 Thread Doron Cohen
Hi Deepan, The steps below seems correct, given that all the fields of the original document are also stored - the javadoc for indexReader.document(int n) (which I assume is what you are using) says: Returns the stored fields of the nth Document in this index. - so, only stored fields would exist

RE: Scoring a document (count?)

2006-08-10 Thread Doron Cohen
Hi Russel, my apologies for the delayed response. I rather have all correspondence on the mailing list, but to keep this mail thread readable I put the files at http://cdoronc.awardspace.com/TfTermQuery . I hope it helps you and would be interested in your comments. Regards, Doron Russell M.

Re: Poor performance race condition in FieldSortedHitQueue

2006-08-10 Thread Doron Cohen
On 8/10/06, Doron Cohen [EMAIL PROTECTED] wrote: Sorting was introduced to Lucene before my time, so I don't know the reasons behind it. Maybe it was seen as non-optimial or non-core and so was kept out of the IndexReader. I admit, it does feel like the level of abstraction that FieldCache

Re: Words Frequency Problem

2006-08-18 Thread Doron Cohen
See http://www.nabble.com/Accessing-%22term-frequency-information%22-for-documents-tf1964461.html#a5390696 - Doron aslam bari [EMAIL PROTECTED] wrote on 17/08/2006 23:13:27: Dear All, I am new to Lucene. I am searching for a word circle in my indexed document list. It gives me total

Re: Incemental Updating

2006-08-26 Thread Doron Cohen
i have two applications on an windows machine. One is the searchengine where the index is can be searched. The second application runs one time on a day which updates (deletions/adding) the index. My question: The index is already opened (Indexreader) by the frist application. Is there a

Re: java.io.IOException: Access is denied on java.io.WinNTFileSystem.createFileExclusively

2006-08-27 Thread Doron Cohen
Jason Polites [EMAIL PROTECTED] wrote on 27/08/2006 09:36:07: I would have thought that simultaneous cross-JVM access to an index was outside of scope of the core Lucene API (although it would be great), but maybe the file system basis allows for this (?). Lucene does protect you from

jvm crashes on FieldCache.DEFAULT.getStrings(reader, field);

2006-09-05 Thread Doron Cohen
[discussion moved here from dev-list] Could it be an out-of-mem error? Can you run it with a debugger, to see what really happens? JVMs usually create a javacore file, and in case of an out-of-mem also a heapdump file - these give more info on the problem. In case this file was not created in

Re: Keep hits in results

2006-09-06 Thread Doron Cohen
Hits is not really a simple container - it references a certain searcher - that same searcher that was used to find these hits. When a request for a result document is made, the Hits object delegates this request to the searcher. So in order to page through the results using an existing Hits

Re: Doc add limit, im experiencing it too

2006-09-06 Thread Doron Cohen
I believe this should go to the solr-user@lucene.apache.org ? Michael Imbeault [EMAIL PROTECTED] wrote on 05/09/2006 23:26:55: Old issue (see http://www.mail-archive.com/solr-user@lucene.apache.org/msg00651.html), but I'm experiencing the same exact thing on windows xp, latest tomcat. I

Re: Keep hits in results

2006-09-06 Thread Doron Cohen
? (Don't think about state of users in webapp for a while) Best Regards. jacky - Original Message - From: Doron Cohen [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, September 06, 2006 2:06 PM Subject: Re: Keep hits in results Hits is not really a simple

Re: Indexer large file and hi performance indexing

2006-09-06 Thread Doron Cohen
HODAC, Olivier [EMAIL PROTECTED] wrote on 06/09/2006 03:04:15: hello, I design an application which bottleneck concerns the indexing process. Objects indexation blocks the user's action. Furthermore, I need to index a large maount of documents (3 per day) and save them on the file

Re: Update index

2006-09-06 Thread Doron Cohen
WATHELET Thomas [EMAIL PROTECTED] wrote on 23/08/2006 00:49:25: Is it possible to update fields in an existing index. If yes how to proceed. Unfortunately no. To update a (document's) field that document must be removed and re-added.

RE: group field selection of the form field:(a b c)

2006-09-12 Thread Doron Cohen
It think option B cannot work because due to the MUST operator it requires both databasemanagement and accountmanagement to be in the subtype field. Option A however should work, once the padding blank spaces are removed from the field name - notice that while the standard analyzer would trim

Re: SV: SV: Changing the Scoring api

2006-09-13 Thread Doron Cohen
I think it is not possible, by only modifying Similarity, to make the total score only count for documents boosts (which is the original request in this discussion). This is because a higher level scorer always sums the scores of its sub-scorers - is this right...? if so there are probably two

Re: Storing no. of occurances of a token

2006-09-13 Thread Doron Cohen
I found out how to determine the number of documents in which a term appeared by looking at the Luke code, but how does one determine the number of times it occurs in each document? Use TermDocs - http://lucene.apache.org/java/docs/api/org/apache/lucene/index/TermDocs.html Something like -

Re: Using example Lucene 2.0 index class

2006-09-22 Thread Doron Cohen
I have been using the Lucene 2.0 distro Index to index my files, currently it indexes filepath and contents. I want to index, lastModified() (Returns the time that the file denoted by this abstract pathname was last modified.), and file length, length(). Can someone please show me how to do

Re: Using example Lucene 2.0 index class

2006-09-22 Thread Doron Cohen
I have been using the Lucene 2.0 distro Index to index my files, currently it indexes filepath and contents. I want to index, lastModified() (Returns the time that the file denoted by this abstract pathname was last modified.), and file length, length(). Can someone please show me

Re: highlighting

2006-09-25 Thread Doron Cohen
Stelios Eliakis [EMAIL PROTECTED] wrote on 23/09/2006 02:39:27: I want to extract the Best Fragment (passage) from a text file. When I use the following code I take the first fragment that contains my query. Nevertheless, the JavaDoc says that the function getBestFragment returns the best

Re: highlighting

2006-09-25 Thread Doron Cohen
? Thanks in advance Stelios Eliakis On 9/26/06, Doron Cohen [EMAIL PROTECTED] wrote: Stelios Eliakis [EMAIL PROTECTED] wrote on 23/09/2006 02:39:27: I want to extract the Best Fragment (passage) from a text file. When I use the following code I take the first fragment

Re: StandardAnalyzer question

2006-09-29 Thread Doron Cohen
QueryParser can do that for you - something like: QueryParser qp = new QueryParser( CONTENTS , new StandardAnalyzer() ); qp.setDefaultOperator ( Operator.AND ); Query q = qp.parse ( TOOLS FOR TRAILER ); Result query should be: +content:tools +content:trailer Van Nguyen

Re: lucene newbie question

2006-10-02 Thread Doron Cohen
SSN actually is a common situation. Assume you have a (relational) database with a table of products with three columns : - SSN, which is also a primary key for that table, - DESCRIPTION, which has free text (i.e. unformatted text) describing the product. - OTHER - additional info. Also assume

Re: get terms by positions

2006-10-02 Thread Doron Cohen
You can store TermVectors with position info, but I don't think this would be enough for what you are asking, because it is not meant for direct access to a term by its position, and because TermVectors store tokens, i.e. the indexed form of the word, which I am not sure is what you need. It

Re: A question about query syntax, has it changed?

2006-10-02 Thread Doron Cohen
The problem stems from using the query parser for searching a non tokenized field (book). You can either create a term query for searching in that field, like this: new TermQuery(new Term(book,first title)); Or tokenize the field book and keep using QueryParser. Decision is based on how you

Re: Lucene scoring question (how to boost leading terms match)

2006-10-03 Thread Doron Cohen
If I understand the question, you do not want to boost in advance a certain doc, but rather score higher those documents containing the search term closer to the start of the document. There is more to define here - for instance, if doc1 has 5 words but doc2 has 1,000,000 words, would you still

Re: Spam filter for lucene project

2006-10-04 Thread Doron Cohen
I was wondering if anyone knows of an open source spam filter that I can add to my project to scan the posts (which are just plain text) for spam? I am not aware of any (which does not mean there is none), but just wanted to draw your attention to a related discussion

Re: discontinuous range query

2006-10-04 Thread Doron Cohen
: The query you want is : name:[A TO C] name:[G TO K] : (each clause being SHOULD, or put another way, an implicit OR in between. : : The problem may be how you analyze the name field... is it tokenized at all? : If so, you might be matching on first, last, and middle names, and the :

Re: Find if words are in the same phrase?

2006-10-05 Thread Doron Cohen
I am not sure I understand what you are asking. I assume you are aware of Lucene Proximity Search - e.g. jakarta apache~4 - see http://lucene.apache.org/java/docs/queryparsersyntax.html Are you asking if it is possible to search for docs in which the gap between the two words is exactly N, e.g.

Re: discontinuous range query

2006-10-05 Thread Doron Cohen
I sometimes find it helpful to think of the query parts as applying 'filtering' logic, helping to understand how query components play together in determining the acceptable set of results (mostly ignoring scoring here, which would usually sort the candidate results). Consider a set of 10

Re: Different boost values for different terms in a field.

2006-10-05 Thread Doron Cohen
Frode Bjerkholt [EMAIL PROTECTED] wrote on 05/10/2006 01:10:43: My intention is to give different terms in a field different boost values. The queries from a use perspective, will be one fulltext input field. The following code illustrates this: Field f1 = new Field(name, John,

Re: TermQuery and PhraseQuery..problem with word with space

2006-10-09 Thread Doron Cohen
I would guess that one of your assumptions is wrong... The assumptions to check are: At indexing: - lpf.getLuceneFieldName() == fav_stores - pa.getPersonProfileChoice().getChoice() == Banana Republic At search: - the query is created like this: new TermQuery(new Term(fav_stores,Banana

Re: wildcard and span queries

2006-10-09 Thread Doron Cohen
Erick Erickson [EMAIL PROTECTED] wrote on 09/10/2006 13:09:21: ... The kicker is that what we are indexing is OCR data, some of which is pretty trashy. So you wind up with interesting words in your index, things like rtyHrS. So the whole question of allowing very specific queries on detailed

Re: What is the advantage of setting using compund file to false

2006-10-10 Thread Doron Cohen
A bit of clarification: Lucene index is made of multiple segments. Compound format: stores each segment in a single file - less files created/opened. Not-compound format: stores each segment in multi-files - more files created/opened. Not-compound is likely to be faster for indexing. Optimizing

Re: Lucene in Action examples complie problem

2006-10-10 Thread Doron Cohen
Field.Text() was deprecated in Lucene 1.9 and then removed in 2.0. The book examples were not updated for 2.0 yet. You should now use Field(String, String, Field.Store, Field.Index). To have the same behavior as old Field.Text use: Field(name, value, Field.Store.YES, Field.Index.TOKENIZED). For

Re: Lucene in Action examples complie problem

2006-10-10 Thread Doron Cohen
I wonder if this should be in the FAQ entry How do i get code written for Lucene 1.4.x to work with Lucene 2.x, Or perhaps just adding there a link to your post here - http://www.nabble.com/Lucene-in-Action-examples-complie-problem-tf2418478.html#a6743189 Erik Hatcher [EMAIL PROTECTED] wrote on

Re: corrupt index: .fdx and stored norms

2006-10-10 Thread Doron Cohen
Nick, could you provide additional info: (1) Env info - Lucene version, Java version, OS, JVM args (e.g. -XmNNN), etc... (2) is this reproducible? By the file sizes there seem to be ~182 indexed docs when the problem occur, so, if this is reproducible it would hopefully not take too long. If

Re: corrupt index: .fdx and stored norms

2006-10-10 Thread Doron Cohen
I meant ~182K files ... Nick, could you provide additional info: (1) Env info - Lucene version, Java version, OS, JVM args (e.g. -XmNNN), etc... (2) is this reproducible? By the file sizes there seem to be ~182 indexed docs when the problem occur, so, if this is reproducible it would

Re: Strange Spellchecker behaviour

2006-10-10 Thread Doron Cohen
I believe this was fixed in http://issues.apache.org/jira/browse/LUCENE-593 - Doron Björn Ekengren [EMAIL PROTECTED] wrote on 10/10/2006 02:12:23: Hello, I have found that the spellchecker behaves a bit strange. My spell indexer class below doesn't work if I use the spellfield string set in

Re: Big problem with big indexes

2006-10-11 Thread Doron Cohen
These times really are not reasonable. But 60K do not seem much for Lucene. I once indexed ~1M docs of ~20K each, that's ~20GB input collection. The result index size was ~2.5GB and the search times for a short query 2-3 words free text (or) query was ~300ms for a hot query and ~900ms for a cold

Re: Large index question

2006-10-12 Thread Doron Cohen
Scott Smith [EMAIL PROTECTED] wrote on 12/10/2006 14:14:57: Supposed I want to index 500,000 documents (average document size is 4kBs). Let's assume I create a single index and that the index is static (I'm not going to add any new documents to it). I would guess the index would be around

Re: Error while closing IndexWriter

2006-10-13 Thread Doron Cohen
I am far from perfect in this pdf text extracting, however I noticed something in your code that you may want to check to clear up the reason for this failure, see below.. Shivani Sawhney [EMAIL PROTECTED] wrote on 12/10/2006 22:54:07: Hi All, I am facing a peculiar problem. I am trying to

Re: advanced search

2006-10-13 Thread Doron Cohen
Terry Steichen [EMAIL PROTECTED] wrote on 13/10/2006 08:01:11: You can just add a field to your indexed docs that always evaluates to a fixed value. Then you can do queries like: +doc:1 -id:test Alternatively you can use MatchAllDocsQuery, e.g. BooleanQuery bq = new BooleanQuery();

Re: highlighting with WildcardQuery

2006-10-14 Thread Doron Cohen
The IndexReader is needed for finding all wildcard matches (by the index lexicon). It seems you do not want to expand the wild card query by the index lexicon, but rather with that of the highlighted text (which may not be indexed at all). I think you have at least two ways to do that: (1) create

Re: java.io.IOException: read past EOF

2006-10-14 Thread Doron Cohen
John Gilbert [EMAIL PROTECTED] wrote on 14/10/2006 20:14:43: I am trying to write an Ejb3Directory. It seems to work for index writing but not for searching. I get the EOF exception. I assume this means that either my OutputStream or InputStream is doing something wrong. It fails because the

Re: problem deleting documents

2006-10-15 Thread Doron Cohen
now pk is primary key which i am storing but not indexing it.. doc.add(new Field(pk, message.getId().toString(),Field.Store.YES, Field.Index.NO)); You would need to index it for this to work. From javadocs for IndexReader.deleteDocuments(Term): Deletes all documents

Re: Query not finding indexed data

2006-10-15 Thread Doron Cohen
Hi Antony, you cannot instruct the query parser to do that. Note that an application can add both tokenized and un_tokenized data under the same field name. This is an application logic to know that a certain query is not to be tokenized. In this case you could create your query with: query =

Re: Help with Custom Analyzer

2006-10-16 Thread Doron Cohen
Otis Gospodnetic [EMAIL PROTECTED] wrote on 16/10/2006 14:32:13: Hi Ryan, StandardAnalyzer should already be smart about keeping email addresses as a single token: // email addresses | EMAIL: ALPHANUM ((.|-|_) ALPHANUM)* @ ALPHANUM ((.|-) ALPHANUM)+ (this is from StandardAnalyzer.jj)

Re: PrefixFilter and WildcardQuery

2006-10-16 Thread Doron Cohen
hi Vasu, how about using ChainedFilter(yourPrefixFilters[], ChainedFilter.AND)? vasu shah [EMAIL PROTECTED] wrote on 16/10/2006 17:50:27: Hi, I have have multiple fields that I need to search on. All these fields need to support wildcard search. I am ANDing these search fields using

RE: BooleanQuery.TooManyClauses exception

2006-10-17 Thread Doron Cohen
See also relevant FAQ entry Wiki page: http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831 http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing Steven Parkes [EMAIL PROTECTED] wrote on 17/10/2006 09:12:55: Lucene takes your date

Re: index architectures

2006-10-18 Thread Doron Cohen
Not sure if this is the case, but you said searchers, so might be it - you can (and should) reuse searchers for multiple/concurrent queries. IndexSearcher is thread-safe, so no need to have a different searcher for each query. Keep using this searcher until you decide to open a new searcher -

Re: boost at query time or index time

2006-10-23 Thread Doron Cohen
thanks what i was looking for was the fact if i can donot need to boost docs then what will be the difference a) in query results and b) time for indexing and c) time to run query and collect result ? There is also some precision loss with index time boosting. Also see the Score Boosting

Re: Catalog backend for document stored fields?

2006-10-24 Thread Doron Cohen
I'm indexing logs from a transaction-based application. ... millions documents per month, the size of the indices is ~35 gigs per month (that's the lower bound). I have no choice but to 'store' each field values (as well as indexing/tokenizing them) because I'll need to retrieve them in

Re: Scalability Questions

2006-10-24 Thread Doron Cohen
4) Roughly how large is the index file in comparison to the size of the input files? It depends on whether you store fields or just index them, plus there is also a compression (gzip -9 equivalent) option. As an example - index size numbers I saw: when indexing 1M docs of ~20KB of very

Re: index architectures

2006-10-24 Thread Doron Cohen
Perhaps another comment on the same line - I think you would be able to get more from your system by bounding the number of open searchers to 2: - old, serving 'old' queries, would be soon closed; - new, being opened and warmed up, and then serving 'new' queries; Because... - if I understood

Re: number of term occurrences

2006-10-24 Thread Doron Cohen
I don't know why the termDocs option did not work for you. Perhaps you did not (re)open the searcher after the index was populated? Anyhow, here is a small code snippet that does just this, see if it works for you, then you can compare it to your code... void numberOfTermOcc() throws Exception

Re: java.io.WriteAbortedException: writing aborted; java.io.NotSerializableException: org.apache.lucene.queryParser.Token

2006-10-25 Thread Doron Cohen
Hi Eugene, If the query parser (from some reason) throws a ParseException, and the RMI layer attempts to marshal/serialize that exception, there would probably be an issue because although ParseException is serializable (as all throwables) it has a Token data member, which is not serializable.

Re: Lucene search priorities

2006-10-31 Thread Doron Cohen
Erick Erickson [EMAIL PROTECTED] wrote on 31/10/2006 05:03:18: I don't remember who wrote this, Chris or Yonik or Otis, but here's the word from somebody who actually knows... index time field boosts are away to express things like this document title is worth twice as much as the title of

Re: My frirst problem using lucene

2006-10-31 Thread Doron Cohen
Might be related to an already resolved issue - see related discussion: http://www.nabble.com/lucene-web-demo-is-not-working--tf1736444.html#a4718639 Miren [EMAIL PROTECTED] wrote on 31/10/2006 03:57:50: parse(java.lang.String) in org.apache.lucene.queryParser.QueryParser cannot be applied to

Re: simple (?) question about scoring

2006-11-02 Thread Doron Cohen
[EMAIL PROTECTED] wrote on 02/11/2006 06:36:48: .. the following operation: given a Query and a Document, return the score .. I would like a method which returns the score directly. .. Btw, I do not have an index, I have 1 Document, and 1 Query. Lucene scoring -

Re: simple (?) question about scoring

2006-11-02 Thread Doron Cohen
michele.amoretti wrote: Ok I am trying the MemoryIndex, but when compiling I have the following erro message: package org.apache.lucene.index.memory does not exist Is it not included in the lucene .jar? I currently have the latest lucene binaries. Yes this is not part of core Lucene but

Re: search within search

2006-11-02 Thread Doron Cohen
This code adds the same query twice to a boolean query: Query query = parser.parse(searchString); bq1.add(query, BooleanClause.Occur.MUST); bq1.add(new BooleanClause(query,

Re: search within search

2006-11-03 Thread Doron Cohen
spinergywmy [EMAIL PROTECTED] wrote on 03/11/2006 00:40:42: I have another problem is I do not perform the real search within search feature which according to the way that I have coded, because for the second time searching, I actually go back to the index directory to search the entire

Re: Filter query method

2006-11-08 Thread Doron Cohen
spinergywmy [EMAIL PROTECTED] wrote on 08/11/2006 01:56:00: within my first search result, there is only one record that contains Java and Tomcat words, therefore, there should be only one record return for 2nd search. And the highlight is now move from Java to Tomcat. To my

Re: Filter query method

2006-11-10 Thread Doron Cohen
You did not specify what's wrong - in what way is the code below not working as you expect? Two things to check: (1) search() and refindSearchResult() process the text of the first query differently. In search() the text is added to multiple fields (metaField). The way it is done btw would not

Re: Bugs in contrib/snowball/.../SnowballProgram.java - Kraaij-Pohlmann gives Index-OOB Exception

2006-12-11 Thread Doron Cohen
Andreas, I could generate the error as you describe. You can report this bug in http://issues.apache.org/jira/browse/LUCENE There seem to be a few updates in http://snowball.tartarus.org not reflected currently in Lucene - - SnowballProgram.java has this bug fix as you describe The

Re: SegmentReader using too much memory?

2006-12-11 Thread Doron Cohen
I do want to use document boosting... Is that independent from field boosting? The length normalization on the other hand may not be necessary. They go together - see Score Boosting in http://lucene.apache.org/java/docs/scoring.html

Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-11 Thread Doron Cohen
Well it doesn't since there is not justification of why it is the way it is. Its like saying, here is that car with 5 weels... enjoy driving. - I think the explanations there would also answer at least some of your questions. I hoped it would answer *some* of the questions... (not all)

Re: How to delete partial index

2006-12-12 Thread Doron Cohen
spinergywmy [EMAIL PROTECTED] wrote: Hi, I have ask this question before but may be the question wasn't clear. How can I delete particular index that I want to and keep the rest? For instance, I have been indexed document Id, date, user Id and contents, my question is does that

Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-12 Thread Doron Cohen
Karl Koch [EMAIL PROTECTED] wrote: For the documents Lucene employs its norm_d_t which is explained as: norm_d_t : square root of number of tokens in d in the same field as t Actually (by default) it is: 1 / sqrt(#tokens in d with same field as t) basically just the square root of the

Re: lucene functionality

2006-12-13 Thread Doron Cohen
Lucene RangeQuery would do for the time and numeric reqs. Mark Mei [EMAIL PROTECTED] wrote: At the bottom of this email is the sample xml file that we are using today. We have about 10 million of these. We need to know whether Lucene can support the following functionalities. (1) Each field

Re: Lucene change field values to wrong ones when indexing

2006-12-14 Thread Doron Cohen
Two things I would check: 1) converting pubDate to String during indexing for later date-range-filtering search results might not work well, because, e.g., string wise, 9 100. You could use Lucene's DateTools - there's an example in TestDateFilter -

Re: range query on dates

2006-12-14 Thread Doron Cohen
There is an example in TestDateFilter http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/TestDateFilter.java?view=log Cam Bazz [EMAIL PROTECTED] wrote: Hello, how can I make a query to bring documents between timestamp begin and timestamp end, given that I have

Re: sorting by per doc hit count

2006-12-19 Thread Doron Cohen
Mark Miller [EMAIL PROTECTED] wrote on 19/12/2006 09:21:00: LIA mentioned something about needing to rebuild the index if you change Similarity's. That does not make sense to me yet. It would seem you could alternate them. What does scoring have to do with indexing? For this part of your

Re: Extracting data from Lucene index files

2006-12-20 Thread Doron Cohen
Using term vectors means passing on the terms too many times - i.e - loop on terms - - loop on docs of a term - - - loop on terms of a doc Would something like this be better: do { System.out.println(tenum.term()+ appears in +tenum.docFreq()+ docs!); TermDocs td =

Re: First search is slow after updating index .. subsequent searches very fast

2006-12-21 Thread Doron Cohen
Something like dd if=/path/to/index/foo.cfs of=/dev/null Be careful not to mistaken with the 'of' argument of 'dd' - see http://en.wikipedia.org/wiki/Dd_(Unix) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

  1   2   3   4   >