[jira] Created: (LUCENE-1197) IndexWriter can flush too early when flushing by RAM usage

2008-02-29 Thread Michael McCandless (JIRA)
IndexWriter can flush too early when flushing by RAM usage
--

 Key: LUCENE-1197
 URL: https://issues.apache.org/jira/browse/LUCENE-1197
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4


There is a silly bug in how DocumentsWriter tracks its RAM usage:
whenever term vectors are enabled, it incorrectly counts the space
used by term vectors towards flushing, when in fact this space is
recycled per document.

This is not a functionality bug.  All it causes is flushes to happen
too frequently, and, IndexWriter will use less RAM than you asked it
to.  To work around it you can simply give it a bigger RAM buffer.

I will commit a fix shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1197) IndexWriter can flush too early when flushing by RAM usage

2008-02-29 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1197.


Resolution: Fixed

> IndexWriter can flush too early when flushing by RAM usage
> --
>
> Key: LUCENE-1197
> URL: https://issues.apache.org/jira/browse/LUCENE-1197
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
>
> There is a silly bug in how DocumentsWriter tracks its RAM usage:
> whenever term vectors are enabled, it incorrectly counts the space
> used by term vectors towards flushing, when in fact this space is
> recycled per document.
> This is not a functionality bug.  All it causes is flushes to happen
> too frequently, and, IndexWriter will use less RAM than you asked it
> to.  To work around it you can simply give it a bigger RAM buffer.
> I will commit a fix shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-29 Thread Mathieu Lecarme (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu Lecarme updated LUCENE-1190:


Attachment: aphone+lexicon.patch

> a lexicon object for merging spellchecker and synonyms from stemming
> 
>
> Key: LUCENE-1190
> URL: https://issues.apache.org/jira/browse/LUCENE-1190
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*, Search
>Affects Versions: 2.3
>Reporter: Mathieu Lecarme
> Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

2008-02-29 Thread Mathieu Lecarme (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12573907#action_12573907
 ] 

Mathieu Lecarme commented on LUCENE-1190:
-

News features:
helper to extends query with similarity of each term :
+type:dog +name:rintint*
will become:
+type:(+dog (dogs doggy)^0.7) +name:rintint*

"Do you mean pattern" packaged over IndexSearcher. If search result is under a 
thresold, sorted suggestion list for each term is provided, and a rewritten 
query sentence:
truc:brawn
will become:
truc:brown 




> a lexicon object for merging spellchecker and synonyms from stemming
> 
>
> Key: LUCENE-1190
> URL: https://issues.apache.org/jira/browse/LUCENE-1190
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*, Search
>Affects Versions: 2.3
>Reporter: Mathieu Lecarme
> Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]