[jira] Commented: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

2010-03-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843539#action_12843539
 ] 

Cédrik LIME commented on LUCENE-2015:
-

Robert, any news on this patch? Can we get it applied for Lucene 3.1?

> ASCIIFoldingFilter: expose folding logic + small improvements to 
> ISOLatin1AccentFilter
> --
>
> Key: LUCENE-2015
> URL: https://issues.apache.org/jira/browse/LUCENE-2015
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Cédrik LIME
>Priority: Minor
> Attachments: ASCIIFoldingFilter-no_formatting.patch, 
> ASCIIFoldingFilter-no_formatting.patch, Filters.patch, 
> ISOLatin1AccentFilter.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: 
> left & right single quotation marks, en dash, em dash) which we very 
> frequently encounter in our projects. I know that this class is now 
> deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in 
> ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

2010-03-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cédrik LIME updated LUCENE-2015:


Attachment: LUCENE-2015.patch

Robert: I liked the dual approach (fold 1 {{char}} / a {{char[]}}) as it 
offered maximum flexibility (folding a String didn't incur a systematic copy of 
the input as {{toCharArray()}} does, I could use {{charAt()}} in a loop).
Nevertheless, I will be happy with a single method if this is your preferred 
approach.

I have updated your patch slightly to model the API after 
{{System.arraycopy()}}, which makes it a bit more flexible and easier to use:
* added offset for output
* shuffled the arguments order to mimic {{System.arraycopy()}}
* updated JavaDoc

> ASCIIFoldingFilter: expose folding logic + small improvements to 
> ISOLatin1AccentFilter
> --
>
> Key: LUCENE-2015
> URL: https://issues.apache.org/jira/browse/LUCENE-2015
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Cédrik LIME
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ASCIIFoldingFilter-no_formatting.patch, 
> ASCIIFoldingFilter-no_formatting.patch, Filters.patch, 
> ISOLatin1AccentFilter.patch, LUCENE-2015.patch, LUCENE-2015.patch
>
>
> This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: 
> left & right single quotation marks, en dash, em dash) which we very 
> frequently encounter in our projects. I know that this class is now 
> deprecated; this improvement is for legacy code that hasn't migrated yet.
> It also enables easy access to the ascii folding technique use in 
> ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Account password

2010-04-13 Thread jira

  You (or someone else) has reset your password.

-

Your password has been changed to: MCwqNr

You can change your password here:

   https://issues.apache.org/jira/secure/ViewProfile.jspa

Here are the details of your account:
-
Username: java-dev@lucene.apache.org
   Email: java-dev@lucene.apache.org
   Full Name: Lucene Developers
Password: MCwqNr
(You can always retrieve these via the "Forgot Password" link on the signup 
page)
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-489) Wildcard Queries with leading "*"

2006-01-23 Thread JIRA
Wildcard Queries with leading "*"
-

 Key: LUCENE-489
 URL: http://issues.apache.org/jira/browse/LUCENE-489
 Project: Lucene - Java
Type: Wish
  Components: QueryParser  
Reporter: Peter Schäfer


It would be nice to have wildcard queries with a leading wildcard ("?" or "*").

I'm aware that this is a well-known issue, and I do understand the reasons 
behind it,
but try explaining that to our end-users ... :-(




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-489) Wildcard Queries with leading "*"

2006-01-24 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-489?page=comments#action_12363790 ] 

Peter Schäfer commented on LUCENE-489:
--

Thanks, I know that those queries perform badly.

Do you have a hint how to improve those kinds of queries ?
Or is there a chance that we will see a more efficient implementation in the 
future ?




> Wildcard Queries with leading "*"
> -
>
>  Key: LUCENE-489
>  URL: http://issues.apache.org/jira/browse/LUCENE-489
>  Project: Lucene - Java
> Type: Wish
>   Components: QueryParser
> Reporter: Peter Schäfer

>
> It would be nice to have wildcard queries with a leading wildcard ("?" or 
> "*").
> I'm aware that this is a well-known issue, and I do understand the reasons 
> behind it,
> but try explaining that to our end-users ... :-(

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-489) Wildcard Queries with leading "*"

2006-01-24 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-489?page=comments#action_12363796 ] 

Peter Schäfer commented on LUCENE-489:
--

great idea, thanks !

but what about *xyz*   :-(

> Wildcard Queries with leading "*"
> -
>
>  Key: LUCENE-489
>  URL: http://issues.apache.org/jira/browse/LUCENE-489
>  Project: Lucene - Java
> Type: Wish
>   Components: QueryParser
> Reporter: Peter Schäfer

>
> It would be nice to have wildcard queries with a leading wildcard ("?" or 
> "*").
> I'm aware that this is a well-known issue, and I do understand the reasons 
> behind it,
> but try explaining that to our end-users ... :-(

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-552) NPE during mergeSegments

2006-04-21 Thread JIRA
NPE during mergeSegments


 Key: LUCENE-552
 URL: http://issues.apache.org/jira/browse/LUCENE-552
 Project: Lucene - Java
Type: Bug

  Components: Index  
Versions: 2.0
 Environment: 2.0-rc1-dev

Reporter: Ole Kværnø


The JVM stops with a NPE after running for about 6-8 hours, indexing about 
500.000 articles.
After a restart of the JVM, the problematic merge seems to complete OK.

Exception in thread "Thread-4" java.lang.NullPointerException
at org.apache.lucene.store.RAMInputStream.(RAMInputStream.java:32)
at org.apache.lucene.store.RAMDirectory.openInput(RAMDirectory.java:171)
at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:155)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:700)
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:684)
at org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:654)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:402)
at com.magentanews.index.IndexManager.insertDocuments(IndexManager.java:190)
at 
com.magentanews.index.SplitIndexManager.insertDocuments(SplitIndexManager.java:152)
at com.magentanews.index.LuceneFeeder.insertDocuments(LuceneFeeder.java:234)
at 
com.magentanews.index.IndexerApplication.insertDocuments(IndexerApplication.java:255)
at com.magentanews.index.IndexerApplication.run(IndexerApplication.java:160)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-552) NPE during mergeSegments

2006-04-26 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-552?page=comments#action_12376656 ] 

Ole Kværnø commented on LUCENE-552:
---

Occured once more.  Further log details (SRC = rev 396455)
 ...
 [java] [GC 537735K->471821K(769088K), 0.0765670 secs]
 [java] [GC 541005K->475102K(769088K), 0.0767420 secs]
 [java] [GC 544286K->478897K(769088K), 0.0757440 secs]
 [java] [GC 548081K->485157K(769088K), 0.1270930 secs]
 [java] merging segments _17q8 (1 docs) _17q9 (1 docs) _17qa (1 docs) _17qb 
(1 docs) _17qc (1 docs) _17qd (1 docs) _17qe (1 docs) _17qf (1 docs) _17qg (1 
docs) _17qh (1 docs) _17qi (1 docs) _17qj (1 docs) _17qk (1 docs) _17ql (1 
docs) _17qm (1 docs) _17qn (1 docs) _17qo (1 docs) _17qp (1 docs) _17qq (1 
docs) _17qr (1 docs) _17qs (1 docs) _17qt (1 docs) _17qu (1 docs) _17qv (1 
docs) _17qw (1 docs) _17qx (1 docs) _17qy (1 docs) _17qz (1 docs) _17r0 (1 
docs) _17r1 (1 docs) _17r2 (1 docs) _17r3 (1 docs) _17r4 (1 docs) _17r5 (1 
docs) _17r6 (1 docs) _17r7 (1 docs) _17r8 (1 docs) _17r9 (1 docs) _17ra (1 
docs) _17rb (1 docs) _17rc (1 docs) _17rd (1 docs) _17re (1 docs) _17rf (1 
docs) _17rg (1 docs) _17rh (1 docs) _17ri (1 docs) _17rj (1 docs) _17rk (1 
docs) _17rl (1 docs) _17rm (1 docs) _17rn (1 docs) _17ro (1 docs) _17rp (1 
docs) _17rq (1 docs) _17rr (1 docs) _17rs (1 docs) _17rt (1 docs) _17ru (1 
docs) _17rv (1 docs) _17rw (1 docs) _17rx (1 docs) _17ry (1 docs) _17rz (1 
docs) _17s0 (1 docs) _17s1 (1 docs) _17s2 (1 docs) _17s3 (1 docs) _17s4 (1 
docs) _17s5 (1 docs) _17s6 (1 docs) _17s7 (1 docs) _17s8 (1 docs) _17s9 (1 
docs) _17sa (1 docs) _17sb (1 docs) _17sc (1 docs) _17sd (1 docs) _17se (1 
docs) _17sf (1 docs) _17sg (1 docs) _17sh (1 docs) _17si (1 docs) _17sj (1 
docs) _17sk (1 docs) _17sl (1 docs) _17sm (1 docs) _17sn (1 docs) _17so (1 
docs) _17sp (1 docs) _17sq (1 docs) _17sr (1 docs) _17ss (1 docs) _17st (1 
docs) _17su (1 docs) _17sv (1 docs) _17sw (1 docs) _17sx (1 docs) _17sy (1 
docs) _17sz (1 docs) _17t0 (1 docs) _17t1 (1 docs) _17t2 (1 docs) _17t3 (1 
docs) _17t4 (1 docs) _17t5 (1 docs) _17t6 (1 docs) _17t7 (1 docs) _17t8 (1 
docs) _17t9 (1 docs) _17ta (1 docs) _17tb (1 docs) _17tc (1 docs) _17td (1 
docs) _17te (1 docs) _17tf (1 docs) _17tg (1 docs) _17th (1 docs) _17ti (1 
docs) _17tj (1 docs) _17tk (1 docs) _17tl (1 docs) _17tm (1 docs) _17tn (1 
docs) _17to (1 docs) _17tp (1 docs) _17tq (1 docs) _17tr (1 docs) _17ts (1 
docs) _17tt (1 docs) _17tu (1 docs) _17tv (1 docs) _17tw (1 docs) _17tx (1 
docs) _17ty (1 docs) _17tz (1 docs) _17u0 (1 docs) _17u1 (1 docs) _17u2 (1 
docs) _17u3 (1 docs) _17u4 (1 docs) _17u5 (1 docs) _17u6 (1 docs) _17u7 (1 
docs) _17u8 (1 docs) _17u9 (1 docs) _17ua (1 docs) _17ub (1 docs) _17uc (1 
docs) _17ud (1 docs) _17ue (1 docs) _17uf (1 docs) _17ug (1 docs) _17uh (1 
docs) _17ui (1 docs) _17uj (1 docs) _17uk (1 docs) _17ul (1 docs) _17um (1 
docs) _17un (1 docs) _17uo (1 docs) _17up (1 docs) _17uq (1 docs) _17ur (1 
docs) _17us (1 docs) _17ut (1 docs) _17uu (1 docs) _17uv (1 docs) _17uw (1 
docs) _17ux (1 docs) _17uy (1 docs) _17uz (1 docs) _17v0 (1 docs) _17v1 (1 
docs) _17v2 (1 docs) _17v3 (1 docs) _17v4 (1 docs) _17v5 (1 docs) _17v6 (1 
docs) _17v7 (1 docs) _17v8 (1 docs) _17v9 (1 docs) _17va (1 docs) _17vb (1 
docs) _17vc (1 docs) _17vd (1 docs) _17ve (1 docs) _17vf (1 docs) _17vg (1 
docs) _17vh (1 docs) _17vi (1 docs) _17vj (1 docs) _17vk (1 docs) _17vl (1 
docs) _17vm (1 docs) _17vn (1 docs) _17vo (1 docs) _17vp (1 docs) _17vq (1 
docs) _17vr (1 docs) _17vs (1 docs) _17vt (1 docs) _17vu (1 docs) _17vv (1 
docs) _17vw (1 docs) _17vx (1 docs) _17vy (1 docs) _17vz (1 docs) _17w0 (1 
docs) _17w1 (1 docs) _17w2 (1 docs) _17w3 (1 docs) _17w4 (1 docs) _17w5 (1 
docs) _17w6 (1 docs) _17w7 (1 docs) _17w8 (1 docs) _17w9 (1 docs) _17wa (1 
docs) _17wb (1 docs) _17wc (1 docs) _17wd (1 docs) _17we (1 docs) _17wf (1 
docs) _17wg (1 docs) _17wh (1 docs) _17wi (1 docs) _17wj (1 docs) _17wk (1 
docs) _17wl (1 docs) _17wm (1 docs) _17wn (1 docs) _17wo (1 docs) _17wp (1 
docs) _17wq (1 docs) _17wr (1 docs) _17ws (1 docs) _17wt (1 docs) _17wu (1 
docs) _17wv (1 docs) _17ww (1 docs) _17wx (1 docs) _17wy (1 docs) _17wz (1 
docs) _17x0 (1 docs) _17x1 (1 docs) _17x2 (1 docs) _17x3 (1 docs) _17x4 (1 
docs) _17x5 (1 docs) _17x6 (1 docs) _17x7 (1 docs) _17x8 (1 docs) _17x9 (1 
docs) _17xa (1 docs) _17xb (1 docs) _17xc (1 docs) _17xd (1 docs) _17xe (1 
docs) _17xf (1 docs) _17xg (1 docs) _17xh (1 docs) _17xi (1 docs) _17xj (1 
docs) _17xk (1 docs) _17xl (1 docs) _17xm (1 docs) _17xn (1 docs) _17xo (1 
docs) _17xp (1 docs) _17xq (1 docs) _17xr (1 docs) _17xs (1 docs) _17xt (1 
docs) _17xu (1 docs) _17xv (1 docs) _17xw (1 docs) _17xx (1 docs) _17xy (1 
docs) _17xz (1 docs) _17y0 (1 docs) _17y1 (1 docs) _17y2 (1 docs) _17y3 (1 
docs) _17y4 (1 docs) _17y5 (1 docs) _17y6 (1 docs) _17y7 (1 docs) _17y8 (1 
docs) _17y9 (1 docs) _17ya (1 docs) _17yb (1 docs) _17yc (1 docs) _17yd (1 
docs) _17ye (1 docs) _

[jira] Created: (LUCENE-584) Decouple Filter from BitSet

2006-05-31 Thread JIRA
Decouple Filter from BitSet
---

 Key: LUCENE-584
 URL: http://issues.apache.org/jira/browse/LUCENE-584
 Project: Lucene - Java
Type: Improvement

  Components: Search  
Versions: 2.0.1
Reporter: Peter Schäfer
Priority: Minor


{code}
package org.apache.lucene.search;

public abstract class Filter implements java.io.Serializable 
{
  public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
}

public interface AbstractBitSet 
{
  public boolean get(int index);
}

{code}

It would be useful if the method =Filter.bits()= returned an abstract 
interface, instead of =java.util.BitSet=.

Use case: there is a very large index, and, depending on the user's privileges, 
only a small portion of the index is actually visible.
Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
memory. It would be desirable to have an alternative BitSet implementation with 
smaller memory footprint.

Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
obviously not designed for that purpose.
That's why I propose to use an interface instead. The default implementation 
could still delegate to =java.util.BitSet=.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2006-06-01 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12414224 ] 

Peter Schäfer commented on LUCENE-584:
--

thanks, this looks interesting.

Regards,
Peter

> Decouple Filter from BitSet
> ---
>
>  Key: LUCENE-584
>  URL: http://issues.apache.org/jira/browse/LUCENE-584
>  Project: Lucene - Java
> Type: Improvement

>   Components: Search
> Versions: 2.0.1
> Reporter: Peter Schäfer
> Priority: Minor

>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-593) Spellchecker's dictionary iterator misbehaves

2006-06-08 Thread JIRA
Spellchecker's dictionary iterator misbehaves
-

 Key: LUCENE-593
 URL: http://issues.apache.org/jira/browse/LUCENE-593
 Project: Lucene - Java
Type: Bug

  Components: Search  
Versions: 2.0.0
 Environment: Any (mine is Fedora Core 4 - Linux pc983 2.6.16-1.2111_FC4 #1 Sat 
May 20 19:59:40 EDT 2006 i686 i686 i386 GNU/Linux)
Reporter: Kåre Fiedler Christiansen


In LuceneDictionary, the LuceneIterator.hasNext() method has two issues that 
makes it misbehave:

1) If hasNext is called more than once, items are skipped
2) Much more seriously, when comparing fieldnames it is done with != rather 
than .equals() with the potential result that nothing is indexed


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-593) Spellchecker's dictionary iterator misbehaves

2006-06-08 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-593?page=comments#action_12415345 ] 

Kåre Fiedler Christiansen commented on LUCENE-593:
--

Ad 1)
- I don't. I just noticed it when looking at the code. However, if the iterator 
is only supposed to be used internally by the spell-checking code, why is it 
public at all?
Ad 2)
- GREAT! That also provides the workaround until this patch can be applied: 
simply call the constructor with an interned string.

> Spellchecker's dictionary iterator misbehaves
> -
>
>  Key: LUCENE-593
>  URL: http://issues.apache.org/jira/browse/LUCENE-593
>  Project: Lucene - Java
> Type: Bug

>   Components: Search
> Versions: 2.0.0
>  Environment: Any (mine is Fedora Core 4 - Linux pc983 2.6.16-1.2111_FC4 #1 
> Sat May 20 19:59:40 EDT 2006 i686 i686 i386 GNU/Linux)
> Reporter: Kåre Fiedler Christiansen
>  Attachments: LuceneDictionary.java.diff
>
> In LuceneDictionary, the LuceneIterator.hasNext() method has two issues that 
> makes it misbehave:
> 1) If hasNext is called more than once, items are skipped
> 2) Much more seriously, when comparing fieldnames it is done with != rather 
> than .equals() with the potential result that nothing is indexed

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-593) Spellchecker's dictionary iterator misbehaves

2006-06-30 Thread JIRA
 [ http://issues.apache.org/jira/browse/LUCENE-593?page=all ]

Kåre Fiedler Christiansen updated LUCENE-593:
-

Attachment: LuceneDictionary.java.diff

Patch to make LuceneDictionery's Iterator conform with the Iterator contract

> Spellchecker's dictionary iterator misbehaves
> -
>
>  Key: LUCENE-593
>  URL: http://issues.apache.org/jira/browse/LUCENE-593
>  Project: Lucene - Java
> Type: Bug

>   Components: Search
> Versions: 2.0.0
>  Environment: Any (mine is Fedora Core 4 - Linux pc983 2.6.16-1.2111_FC4 #1 
> Sat May 20 19:59:40 EDT 2006 i686 i686 i386 GNU/Linux)
> Reporter: Kåre Fiedler Christiansen
>  Attachments: LuceneDictionary.java.diff, LuceneDictionary.java.diff
>
> In LuceneDictionary, the LuceneIterator.hasNext() method has two issues that 
> makes it misbehave:
> 1) If hasNext is called more than once, items are skipped
> 2) Much more seriously, when comparing fieldnames it is done with != rather 
> than .equals() with the potential result that nothing is indexed

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commenté: (LUCENE-285) Dav id Spencer Spell Checker improved

2006-06-30 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-285?page=comments#action_12418682 ] 

Cédrik LIME commented on LUCENE-285:


Not: not sure this is the right way of proceeding with JIRA. Should I open a 
new bug report instead of commenting?

This implementation (as of Lucene 1.9.1) uses an un-optimized implementation of
the Levenshtein Distance algorithm (it uses way too much memory). Please see Bug
38911 (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
information and the new implementation.

> David Spencer Spell Checker improved
> 
>
>  Key: LUCENE-285
>  URL: http://issues.apache.org/jira/browse/LUCENE-285
>  Project: Lucene - Java
> Type: Improvement

>   Components: Search
> Versions: unspecified
>  Environment: Operating System: other
> Platform: All
> Reporter: Nicolas Maisonneuve
> Priority: Minor
>  Attachments: spellchecker.zip
>
> hy,
> i developed a SpellChecker based on the David Spencer code (DSc) but more 
> flexible.
> the structure of the index is inspired of the DSc (for a 3-4 gram):
> word:
> gram3:
> gram4:
>  
> 3start:
> 4start:
> ..
> 3end:
> 4end:
> ..
> transposition:
>  
> This index is a dictonary so there isn't the "freq" field like with DSc 
> version.
> it's independant of the user index. So we can add words becoming to several
> fields of several index for example or, why not, to a file with a list of 
> words.
> The suggestSimilar method return a list of suggests word sorted by the
> Levenshtein distance and optionaly to the popularity of the word for a 
> specific
> field in a user index. More of that, this list can be restricted only to words
> present in a specific field of a user index.
>  
> See the test case.
>  
> i hope this code will be put in the lucene sandbox. 
>  
> Nicolas Maisonneuve

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-639) [PATCH] Slight performance improvement for readVInt() of IndexInput

2006-07-28 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-639?page=comments#action_12424038 ] 

Nicolas Lalevée commented on LUCENE-639:


The loop you unrolled has no compilation-time known iteration. A VInt is not a 
limited length type. Your patch only works if you don't use Vint larger than 
268,435,456.

> [PATCH] Slight performance improvement for readVInt() of IndexInput
> ---
>
> Key: LUCENE-639
> URL: http://issues.apache.org/jira/browse/LUCENE-639
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.0.0
>Reporter: Johan Stuyts
>Priority: Minor
> Attachments: Lucene2ReadVIntPerformance.patch
>
>
> By unrolling the loop in readVInt() I was able to get a slight, about 1.8 %, 
> performance improvement for this method. The test program invoked the method 
> over 17 million times on each run.
> I ran the performance tests on:
> - Windows XP Pro SP2
> - Sun JDK 1.5.0_07
> - YourKit 5.5.4
> - Lucene trunk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-639) [PATCH] Slight performance improvement for readVInt() of IndexInput

2006-07-28 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-639?page=comments#action_12424043 ] 

Nicolas Lalevée commented on LUCENE-639:


oh yeah your right. In fact I was reading the Lucene index format, which don't 
specify any limit in the integer length.
Effectively, returning a Java int, it is useless to parse too many bytes !

> [PATCH] Slight performance improvement for readVInt() of IndexInput
> ---
>
> Key: LUCENE-639
> URL: http://issues.apache.org/jira/browse/LUCENE-639
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.0.0
>Reporter: Johan Stuyts
>Priority: Minor
> Attachments: Lucene2ReadVIntPerformance.patch
>
>
> By unrolling the loop in readVInt() I was able to get a slight, about 1.8 %, 
> performance improvement for this method. The test program invoked the method 
> over 17 million times on each run.
> I ran the performance tests on:
> - Windows XP Pro SP2
> - Sun JDK 1.5.0_07
> - YourKit 5.5.4
> - Lucene trunk

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-487) Database as a lucene index target

2006-08-03 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-487?page=comments#action_12425482 ] 

Christophe Bégot commented on LUCENE-487:
-

I believe that I detected a bug in this extention
The case of file containing only one caracter is not well indexed. The 
correction seams to work.

DBDirectory.java 
Line 426

Original:

Blob blob=rs.getBlob("DATA");
byte[] buffer=null;
long pos=1;
int length=0;
while(posblob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
buffer=blob.getBytes(pos,length);
file.addData(buffer);

pos+=BUFFER_SIZE>blob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
}


Corrected:

Blob blob=rs.getBlob("DATA");
byte[] buffer=null;
long pos=1;
int length=0;
while(pos<=blob.length()) {

length=BUFFER_SIZE>blob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
buffer=blob.getBytes(pos,length);
file.addData(buffer);

pos+=BUFFER_SIZE>blob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
}

Thanks you Amir for this extension
Christophe

> Database as a lucene index target
> -
>
> Key: LUCENE-487
> URL: http://issues.apache.org/jira/browse/LUCENE-487
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 1.9
> Environment: MySql (version 4.1 an up), Oracle (version 8.1.7 and up)
>Reporter: Amir Kibbar
>Priority: Minor
> Attachments: files.zip
>
>
> I've written an extension for the Directory object called DBDirectory, that 
> allows you to read and write a Lucene index to a database instead of a file 
> system.
> This is done using blobs. Each blob represents a "file". Also, each blob has 
> a name which is equivalent to the filename and a prefix, which is equivalent 
> to a directory on a file system. This allows you to create multiple Lucene 
> indexes in a single database schema.
> The solution uses two tables:
> LUCENE_INDEX - which holds the index files as blobs
> LUCENE_LOCK - holds the different locks
> Attached is my proposed solution. This solution is still very basic, but it 
> does the job.
> The solution supports Oracle and mysql
> To use this solution:
> 1. Place the files:
> - DBDirectory in src/java/org/apache/lucene/store
> - TestDBIndex in src/test/org/apache/lucene/index
> - objects-mysql.sql in src/db
> - objects-oracle.sql in src/db
> 2. Edit the parameters for the database connection in TestDBIndex
> 3. Create the database tables using the objects-mysql.sql script (assuming 
> you're using mysql)
> 4. Build Lucene
> 5. Run TestDBIndex with the database driver in the classpath
> I've tested the solution on mysql, but it *should* work on Oracle, I will 
> test that in a few days.
> Amir

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-487) Database as a lucene index target

2006-08-03 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-487?page=comments#action_12425483 ] 

Christophe Bégot commented on LUCENE-487:
-

I believe that I detected a bug in this extension
The case of file containing only one character does not index correctly. The 
correction seems to work.

DBDirectory.java 
Line 426

Original:

Blob blob=rs.getBlob("DATA");
byte[] buffer=null;
long pos=1;
int length=0;
while(posblob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
buffer=blob.getBytes(pos,length);
file.addData(buffer);

pos+=BUFFER_SIZE>blob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
}

Correction:

Blob blob=rs.getBlob("DATA");
byte[] buffer=null;
long pos=1;
int length=0;
while(pos<=blob.length()) {

length=BUFFER_SIZE>blob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
buffer=blob.getBytes(pos,length);
file.addData(buffer);

pos+=BUFFER_SIZE>blob.length()-pos?(int)(blob.length()-pos+1):BUFFER_SIZE;
}

Christophe


> Database as a lucene index target
> -
>
> Key: LUCENE-487
> URL: http://issues.apache.org/jira/browse/LUCENE-487
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 1.9
> Environment: MySql (version 4.1 an up), Oracle (version 8.1.7 and up)
>Reporter: Amir Kibbar
>Priority: Minor
> Attachments: files.zip
>
>
> I've written an extension for the Directory object called DBDirectory, that 
> allows you to read and write a Lucene index to a database instead of a file 
> system.
> This is done using blobs. Each blob represents a "file". Also, each blob has 
> a name which is equivalent to the filename and a prefix, which is equivalent 
> to a directory on a file system. This allows you to create multiple Lucene 
> indexes in a single database schema.
> The solution uses two tables:
> LUCENE_INDEX - which holds the index files as blobs
> LUCENE_LOCK - holds the different locks
> Attached is my proposed solution. This solution is still very basic, but it 
> does the job.
> The solution supports Oracle and mysql
> To use this solution:
> 1. Place the files:
> - DBDirectory in src/java/org/apache/lucene/store
> - TestDBIndex in src/test/org/apache/lucene/index
> - objects-mysql.sql in src/db
> - objects-oracle.sql in src/db
> 2. Edit the parameters for the database connection in TestDBIndex
> 3. Create the database tables using the objects-mysql.sql script (assuming 
> you're using mysql)
> 4. Build Lucene
> 5. Run TestDBIndex with the database driver in the classpath
> I've tested the solution on mysql, but it *should* work on Oracle, I will 
> test that in a few days.
> Amir

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-662) Extendable writer and reader of field data

2006-08-22 Thread JIRA
Extendable writer and reader of field data
--

 Key: LUCENE-662
 URL: http://issues.apache.org/jira/browse/LUCENE-662
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Nicolas Lalevée
Priority: Minor
 Attachments: generic-fieldIO.patch

As discussed on the dev mailing list, I have modified Lucene to allow to define 
how the data of a field is writen and read in the index.

Basically, I have introduced the notion of IndexFormat. It is in fact a factory 
of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter and the 
SegmentMerger are using this factory and not doing a "new 
FieldsReader/Writer()".

I have also introduced the notion of FieldData. It handles every data of a 
field, and also the writing and the reading in a stream. I have done this way 
because in the current design of Lucene, Fiedable is an interface, so methods 
with a protected or package visibility cannot be defined.

A FieldsWriter just writes data into a stream via the FieldData of the field.
A FieldsReader instanciates a FieldData depending on the field name. Then it 
use the field data to read the stream. And finnaly it instanciates a Field with 
the field data.

About compatibility, I think it is kept, as I have writen a DefaultIndexFormat 
that provides some DefaultFieldsWriter and DefaultFieldsReader. These 
implementations do the exact job that is done today.
To acheive this modification, some classes and methods had to be moved from 
private and/or final to public or protected.

About the lazy fields, I have implemented them in a more general way in the 
implementation of the abstract class FieldData, so it will be totally 
transparent for the Lucene user that will extends FieldData. The stream is kept 
in the fieldData and used as soon as the stringValue (or something else) is 
called. Implementing this way allowed me to handle the recently introduced 
LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on this 
lazy field data, the saved input stream is directly copied in the output stream.

I have a last issue with this patch. The current design allow to read an index 
in an old format, and just do a writer.addIndexes() into a new format. With the 
new design, you cannot, because the writer will use the FieldData.write 
provided by the reader.

enjoy !


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-655) field queries does not work as expected

2006-09-23 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-655?page=comments#action_12437123 ] 

Nicolas Lalevée commented on LUCENE-655:


It is maybe an issue with your analyzer you are using. Which one are you using 
when storing your documents ?
And if you do query with the query parser, which analyzer do you use ?

> field queries does not work as expected
> ---
>
> Key: LUCENE-655
> URL: http://issues.apache.org/jira/browse/LUCENE-655
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.9
> Environment: tomcat 5.5.17
> jdk 1.5.x
>Reporter: Sebastian Wiemer
>
> Hi folks, 
> i have some trouble using the field queries.
> I create documents similar to the following code example:
> (The deprecation usage results of having used 1.4.3 before. Just switched to 
> 1.9.1 to check if the error still occurs)
> Document document = new Document();
> document.add( Field.Keyword( "id", project.getId().toString() ) );
> document.add( Field.Keyword( "type", ComponentType.PROJECT.toString() ) );
> document.add( Field.Text( "name", project.getName() ) );
> document.add( Field.Text( "description", project.getDescription() ) );
> ...
> The indexing process works fine. Searching withing 'name' and 'description' 
> returns the correct result.
> I have an xml converted version of the resulting document hit. (the xml is 
> generated using the Document.fields() enumeration)
> 
> 
>   
>tokenized="false">3
>tokenized="false">PROJECT
>tokenized="true">project 1
>vector="false" tokenized="true">this is my first project.
>   
> 
> The following query is the problematic one:
> id:3
> type:PROJECT
> +id:3 +type:PROJECT
> +(id:3 type:PROJECT)
> none of those return a result.
> I'm not really sure if this is a bug, or a missusage of the lucene api.
> If've tried version 1.4.3 and 1.9.1 so far.
> Would be nice to hear from you guys,
> greets
> Sebastian

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-531) RAMDirectory creation from existing FSDirectory throws IOException ("is a directory")

2006-09-23 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-531?page=comments#action_12437128 ] 

Nicolas Lalevée commented on LUCENE-531:


I think this is fixed with LUCENE-638

> RAMDirectory creation from existing FSDirectory throws IOException ("is a 
> directory")
> -
>
> Key: LUCENE-531
> URL: http://issues.apache.org/jira/browse/LUCENE-531
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Store
>Affects Versions: 1.9, 2.0.0
> Environment: OS: Fedora 5 
> 2.6.15-1.2054_FC5 #1 Tue Mar 14 15:48:33 EST 2006 i686 athlon i386 GNU/Linux
> java version "1.5.0_06"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
> Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing)
>Reporter: Alexander Gutkin
>Priority: Minor
> Attachments: patch.txt, patch_1.9-branch.txt
>
>
> If you generate an index somewhere on the filesystem in location DIR and 
> later on
> add some other (not index-related) directories to DIR, then loading that 
> index using
> FSDirectory will succeed. However, if you then attempt to load that index into
> RAM using RAMDirectory API, RAMDirectory constructor will throw an exception
> because it assumes that FSDirectory will return a list of files residing in 
> DIR. The
> problem with the trunk is that FSDirectory.list() implementation does not 
> check
> for extraneous entities in the index directory, hence breaking RAMDirectory
> construction.
> I encountered this issue because I started storing some of my tiny indexes 
> under
> version control. Loading these indexes using RAMDirectory fails because of
> the CVS/subversion directories (.svn/.cvs) which are created within the index
> directories.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-662) Extendable writer and reader of field data

2006-09-24 Thread JIRA
 [ http://issues.apache.org/jira/browse/LUCENE-662?page=all ]

Nicolas Lalevée updated LUCENE-662:
---

Attachment: generic-fieldIO-2.patch

I think I got it. What was disturbing on the last patch was the notion of 
FieldData I added. So I removed it. So let's summerize the diff between the 
trunk and my patch :

* The concepts :
** an IndexFormat defines which FieldsWriter and FieldsReader to use
** an IndexFormat defines the used extensions, so the user can add it's own 
files
** the format of an index is attached to the Directory
** the whole index format isn't customizable, just a part of them. So some 
functions are private or "default", so the Lucene user won't have acess to them 
: it's Lucene internal stuff. Some others are public or protected : they can be 
redefined.
** Lucene now provide an API to add some files which are tables of data, as the 
FieldInfos is
** it is to the FieldsWriter implementation to check if the field to write is 
of the same format (basically checking by a instanceof).
** the user can add some information at the document level, and provide it's 
own implementation of Document
** the user can define how data for a field is stored and retreived, and 
provide it's own implementation of Fieldable
** the reading of field data is done in the Fieldable
** the writting of the field is done in the FieldsWriter

* API change :
** There are new constructors of the directory : contructors with specified 
IndexFormat
** new Entry and EntryTable : generic API for managing a table of data in a file
** FieldInfos extends now EntryTable

* Code changes :
** AbstractField become Fieldable (Fieldable is no more an interface).
** the FieldsWriter have been separated in the abstract class FieldsWriter and 
its default implementation DefaultFieldsWriter. Idem for FieldsReader and 
DefaultFieldsReader.
** the lazy loading have been moved from FieldsReader to Fieldable
** IndexOuput can now write directly from an input stream
** If a field was loaded lazily, the DefaultFieldsWriter directly copy the 
source input stream to the output stream
** the IndexFileNameFilter take now it's list of known file extensions from the 
index format
** each time a temporary RAM directory is created, the index format have to be 
passed : see diff for CompoundFileReader or IndexWriter
** Some private and/or final have been moved to public

* Last worries :
** quite a big one in fact, but I don't know how to handle it : every RMI tests 
fails because of :
{noformat}
error unmarshalling return; nested exception is:
[junit] java.io.InvalidClassException: 
org.apache.lucene.document.Field; no valid constructor
[junit] java.rmi.UnmarshalException: error unmarshalling return; nested 
exception is:
[junit] java.io.InvalidClassException: 
org.apache.lucene.document.Field; no valid constructor
[junit] at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:157)
{noformat}
** a function is public and it shouldn't : see Fieldable.setLazyData()

I have added an exemple of implementation in the patch that use this future : 
look at org.apache.lucene.index.rdf

I know this is a big patch but I think the API has not been broken, and I would 
appreciate comments on this.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: http://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: generic-fieldIO-2.patch, generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementa

[jira] Commented: (LUCENE-662) Extendable writer and reader of field data

2006-09-25 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-662?page=comments#action_12437644 ] 

Nicolas Lalevée commented on LUCENE-662:


It is due to lazy loading. A lazy field, when being retreived data, have to 
know how to read the stream. In the current trunk, a special implementation of 
Field is doing this. Here, we don't have control of the implemenation of 
Fieldable it will be. As I wanted to keep the lazy loading mechanism controlled 
internally in Lucene, being transparent to the user, I had to force every 
Fieldable implementation to know how about retreiving data lazily. So I 
switched the interface to an abstract class : in fact I have moved 
AbstractField to Fieldable.
But as I already raised, I still have an issue about it : the lazy loading 
mechanism isn't totally internal. The function Fieldable.setLazyData() 
shouldn't be public but default.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: http://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: generic-fieldIO-2.patch, generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-662) Extendable writer and reader of field data

2006-10-20 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-662?page=comments#action_12443766 ] 

Nicolas Lalevée commented on LUCENE-662:


I just realized reading the recent discussing on lucene-dev (LazyField use of 
IndexInput not thread safe) that the implementation I have done isn't thread 
safe at all. The input is not cloned at all...

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: http://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: generic-fieldIO-2.patch, generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-555) Index Corruption

2006-10-26 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-555?page=comments#action_12444827 ] 

Tero Hagström commented on LUCENE-555:
--

We've experienced Lucene index corruption a few times on a production system. 
What makes it tricky is, that 1) the ability to search the Lucene index is 
critical in that system, and 2) recreating it takes rather long. Thus the index 
corruption renders the system unusable for a long period.

The latest index corruption appears to have resulted from a disk partition 
being full. I would expect that Lucene would fail gracefully in that situation 
and not corrupt it's index.

Any chance of reopening this issue?




> Index Corruption
> 
>
> Key: LUCENE-555
> URL: http://issues.apache.org/jira/browse/LUCENE-555
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 1.9
> Environment: Linux FC4, Java 1.4.9
>Reporter: dan
>Priority: Critical
>
> Index Corruption
> >>>>>>>>> output
> java.io.FileNotFoundException: ../_aki.fnm (No such file or directory)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.(RandomAccessFile.java:204)
> at 
> org.apache.lucene.store.FSIndexInput$Descriptor.(FSDirectory.java:425)
> at org.apache.lucene.store.FSIndexInput.(FSDirectory.java:434)
> at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
> at org.apache.lucene.index.FieldInfos.(FieldInfos.java:56)
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:674)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
> >>>>>>>>> input
> - I open an index, I read, I write, I optimize, and eventually the above 
> happens. The index is unusable.
> - This has happened to me somewhere between 20 and 30 times now - on indexes 
> of different shapes and sizes.
> - I don't know the reason. But, the following requirement applies regardless.
> >>>>>>>>> requirement
> - Like all modern database programs, there has to be a way to repair an 
> index. Period.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-555) Index Corruption

2006-10-30 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-555?page=comments#action_12445520 ] 

Tero Hagström commented on LUCENE-555:
--

We managed to identify the source of the index corruption. A system 
administrator manually removed a file from the index to free disk space after 
receiving an alert on low free disk space.

So, "appears to have resulted from a disk partition being full", while being 
true in a sort of indirect manner, is by no means basis for reopening this or 
another issue. Sorry for causing undue alarm. Mea culpa.

We still have one unidentified Lucene index corruption to account for. That one 
happened roughly at the same time that HW failure testing was done on the SAN 
used for storing the Lucene index. Basically disconnecting optical fibers on 
the fly.

That happened a while ago and I don't have enough details to file a decent bug 
report.

I think we'll settle for the fact that the Lucene index can get corrupt for one 
reason or another (some of which are not in the realm of Lucene developers), 
and concentrate on having a good backup policy. 







> Index Corruption
> 
>
> Key: LUCENE-555
> URL: http://issues.apache.org/jira/browse/LUCENE-555
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 1.9
> Environment: Linux FC4, Java 1.4.9
>Reporter: dan
>Priority: Critical
>
> Index Corruption
> >>>>>>>>> output
> java.io.FileNotFoundException: ../_aki.fnm (No such file or directory)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.(RandomAccessFile.java:204)
> at 
> org.apache.lucene.store.FSIndexInput$Descriptor.(FSDirectory.java:425)
> at org.apache.lucene.store.FSIndexInput.(FSDirectory.java:434)
> at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
> at org.apache.lucene.index.FieldInfos.(FieldInfos.java:56)
> at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
> at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:674)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:658)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:517)
> >>>>>>>>> input
> - I open an index, I read, I write, I optimize, and eventually the above 
> happens. The index is unusable.
> - This has happened to me somewhere between 20 and 30 times now - on indexes 
> of different shapes and sizes.
> - I don't know the reason. But, the following requirement applies regardless.
> >>>>>>>>> requirement
> - Like all modern database programs, there has to be a way to repair an 
> index. Period.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-712) Build with GCJ fail

2006-11-15 Thread JIRA
Build with GCJ fail
---

 Key: LUCENE-712
 URL: http://issues.apache.org/jira/browse/LUCENE-712
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Nicolas Lalevée


just need some little fix in the jar name and some issue with some anonymous 
contructor

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-712) Build with GCJ fail

2006-11-15 Thread JIRA
 [ http://issues.apache.org/jira/browse/LUCENE-712?page=all ]

Nicolas Lalevée updated LUCENE-712:
---

Attachment: patch

Here is how to fix it

> Build with GCJ fail
> ---
>
> Key: LUCENE-712
> URL: http://issues.apache.org/jira/browse/LUCENE-712
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Nicolas Lalevée
> Attachments: patch
>
>
> just need some little fix in the jar name and some issue with some anonymous 
> contructor

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-714) Use a System.arraycopy more than a for

2006-11-17 Thread JIRA
Use a System.arraycopy more than a for
--

 Key: LUCENE-714
 URL: http://issues.apache.org/jira/browse/LUCENE-714
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Nicolas Lalevée
 Attachments: DocumentWriter.patch

In org.apache.lucene.index.DocumentWriter. The patch will explain by itself. I 
didn't make any performance test, but I think it is obvious that it will be 
faster.
All tests passed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-714) Use a System.arraycopy more than a for

2006-11-17 Thread JIRA
 [ http://issues.apache.org/jira/browse/LUCENE-714?page=all ]

Nicolas Lalevée updated LUCENE-714:
---

Attachment: DocumentWriter.patch

> Use a System.arraycopy more than a for
> --
>
> Key: LUCENE-714
> URL: http://issues.apache.org/jira/browse/LUCENE-714
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Nicolas Lalevée
> Attachments: DocumentWriter.patch
>
>
> In org.apache.lucene.index.DocumentWriter. The patch will explain by itself. 
> I didn't make any performance test, but I think it is obvious that it will be 
> faster.
> All tests passed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-714) Use a System.arraycopy more than a for

2006-11-17 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-714?page=comments#action_12450680 ] 

Nicolas Lalevée commented on LUCENE-714:


About the priority of the issue, I didn't want to set it to "Major", I just 
forget to set it correctly.

> Use a System.arraycopy more than a for
> --
>
> Key: LUCENE-714
> URL: http://issues.apache.org/jira/browse/LUCENE-714
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Nicolas Lalevée
> Attachments: DocumentWriter.patch
>
>
> In org.apache.lucene.index.DocumentWriter. The patch will explain by itself. 
> I didn't make any performance test, but I think it is obvious that it will be 
> faster.
> All tests passed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-662) Extendable writer and reader of field data

2006-11-21 Thread JIRA
 [ http://issues.apache.org/jira/browse/LUCENE-662?page=all ]

Nicolas Lalevée updated LUCENE-662:
---

Attachment: generic-fieldIO-3.patch

Here is an update of the patch:
- merged with the last commit in trunk
- I have fixed the issue with stream cloning (just reusing the same way of 
cloning as it is done in the current trunk)
- the FieldData is back. So the Fieldable is back too. And the worry I had 
about offering an internal function to public is gone.
- every test passed.
- I have moved the bunch of classes that implement the FieldReader/FieldWriter 
in a RDF way into the tests. So there are some tests on this extension 
mechanism.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: http://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: generic-fieldIO-2.patch, generic-fieldIO-3.patch, 
> generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-662) Extendable writer and reader of field data

2006-12-12 Thread JIRA
[ 
http://issues.apache.org/jira/browse/LUCENE-662?page=comments#action_12457757 ] 

Nicolas Lalevée commented on LUCENE-662:


Not at all.

In fact we don't use a lucene modified with my patch in our system. I only 
start working with lucene this year, and our search engine is a too critical 
component to play with a patched trunk. So I have even not tested it in real 
condition.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: http://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: generic-fieldIO-2.patch, generic-fieldIO-3.patch, 
> generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-662) Extendable writer and reader of field data

2006-12-22 Thread JIRA
 [ http://issues.apache.org/jira/browse/LUCENE-662?page=all ]

Nicolas Lalevée updated LUCENE-662:
---

Attachment: generic-fieldIO-4.patch

Patch synchronized with the trunk.
I also tried to minimize the diff. And in fact I just realized that there are 
two patchs in one there : 
- the real object-oriented storage of field data.
- and some refactoring about the storage of the field infos : for reuse of the 
indexed binary storage of a table of String.

I will try to seperate them.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: http://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: generic-fieldIO-2.patch, generic-fieldIO-3.patch, 
> generic-fieldIO-4.patch, generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-662) Extendable writer and reader of field data

2007-01-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-662:
---

Attachment: generic-fieldIO-5.patch

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: https://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: entrytable.patch, generic-fieldIO-2.patch, 
> generic-fieldIO-3.patch, generic-fieldIO-4.patch, generic-fieldIO-5.patch, 
> generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-662) Extendable writer and reader of field data

2007-01-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-662:
---

Attachment: entrytable.patch

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: https://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: entrytable.patch, generic-fieldIO-2.patch, 
> generic-fieldIO-3.patch, generic-fieldIO-4.patch, generic-fieldIO-5.patch, 
> generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-662) Extendable writer and reader of field data

2007-01-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462712
 ] 

Nicolas Lalevée commented on LUCENE-662:


here it is : I have synchronized with the current trunk, and I have splited the 
patch in two parts.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: https://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: entrytable.patch, generic-fieldIO-2.patch, 
> generic-fieldIO-3.patch, generic-fieldIO-4.patch, generic-fieldIO-5.patch, 
> generic-fieldIO.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-766) Two same new field with and without Term vector make an IllegalStateException

2007-01-08 Thread JIRA
Two same new field with and without Term vector make an IllegalStateException
-

 Key: LUCENE-766
 URL: https://issues.apache.org/jira/browse/LUCENE-766
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.1
Reporter: Nicolas Lalevée


On a empty index, adding a document with two fields with the same name but with 
different term vector option fail. The field with 
TermVector.WITH_POSITIONS_OFFSETS is correctly indexed, as the offset are 
correclty extracted. The field with TermVector.NO is not. The TermVectorsWriter 
tries to add offset info given to the data of the filedinfo from the "fnm" 
file, but the DocumentWriter didn't prepared offset datas as it gets its info 
from the field itself, not from the fieldinfo.

Here is the stack trace :
java.lang.IllegalStateException: Trying to write offsets that are null!
at 
org.apache.lucene.index.TermVectorsWriter.writeField(TermVectorsWriter.java:311)
at 
org.apache.lucene.index.TermVectorsWriter.closeField(TermVectorsWriter.java:142)
at 
org.apache.lucene.index.TermVectorsWriter.closeDocument(TermVectorsWriter.java:100)
at 
org.apache.lucene.index.TermVectorsWriter.close(TermVectorsWriter.java:240)
at 
org.apache.lucene.index.DocumentWriter.writePostings(DocumentWriter.java:365)
at 
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:114)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:618)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:601)
at 
org.apache.lucene.index.TestDocumentWriter.testTermVector(TestDocumentWriter.java:147)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:164)
at junit.framework.TestCase.runBare(TestCase.java:130)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:120)
at junit.framework.TestSuite.runTest(TestSuite.java:230)
at junit.framework.TestSuite.run(TestSuite.java:225)
at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

Attaching a patch with a test.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-766) Two same new field with and without Term vector make an IllegalStateException

2007-01-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-766:
---

  Description: 
On a empty index, adding a document with two fields with the same name but with 
different term vector option fail. The field with 
TermVector.WITH_POSITIONS_OFFSETS is correctly indexed, as the offset are 
correclty extracted. The field with TermVector.NO is not. The TermVectorsWriter 
tries to add offset info given to the data of the filedinfo from the "fnm" 
file, but the DocumentWriter didn't prepared offset datas as it gets its info 
from the field itself, not from the fieldinfo.

Attaching a patch with a test. The test without the fix make this stack trace :

java.lang.IllegalStateException: Trying to write offsets that are null!
at 
org.apache.lucene.index.TermVectorsWriter.writeField(TermVectorsWriter.java:311)
at 
org.apache.lucene.index.TermVectorsWriter.closeField(TermVectorsWriter.java:142)
at 
org.apache.lucene.index.TermVectorsWriter.closeDocument(TermVectorsWriter.java:100)
at 
org.apache.lucene.index.TermVectorsWriter.close(TermVectorsWriter.java:240)
at 
org.apache.lucene.index.DocumentWriter.writePostings(DocumentWriter.java:365)
at 
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:114)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:618)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:601)
at 
org.apache.lucene.index.TestDocumentWriter.testTermVector(TestDocumentWriter.java:147)


  was:
On a empty index, adding a document with two fields with the same name but with 
different term vector option fail. The field with 
TermVector.WITH_POSITIONS_OFFSETS is correctly indexed, as the offset are 
correclty extracted. The field with TermVector.NO is not. The TermVectorsWriter 
tries to add offset info given to the data of the filedinfo from the "fnm" 
file, but the DocumentWriter didn't prepared offset datas as it gets its info 
from the field itself, not from the fieldinfo.

Here is the stack trace :
java.lang.IllegalStateException: Trying to write offsets that are null!
at 
org.apache.lucene.index.TermVectorsWriter.writeField(TermVectorsWriter.java:311)
at 
org.apache.lucene.index.TermVectorsWriter.closeField(TermVectorsWriter.java:142)
at 
org.apache.lucene.index.TermVectorsWriter.closeDocument(TermVectorsWriter.java:100)
at 
org.apache.lucene.index.TermVectorsWriter.close(TermVectorsWriter.java:240)
at 
org.apache.lucene.index.DocumentWriter.writePostings(DocumentWriter.java:365)
at 
org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:114)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:618)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:601)
at 
org.apache.lucene.index.TestDocumentWriter.testTermVector(TestDocumentWriter.java:147)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:164)
at junit.framework.TestCase.runBare(TestCase.java:130)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:120)
at junit.framework.TestSuite.runTest(TestSuite.java:230)
at junit.framework.TestSuite.run(TestSuite.java:225)
at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

Attaching a patch with a test.

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> Two same new field with and without Term vector make an IllegalStateException
> -
>
> Key: LUCENE-766
>     URL: https://issues.apache.org/jira/browse/LUCENE-766
> Project: Lucene - Java
>   

[jira] Updated: (LUCENE-766) Two same new field with and without Term vector make an IllegalStateException

2007-01-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-766:
---

Attachment: bugfix.patch

> Two same new field with and without Term vector make an IllegalStateException
> -
>
> Key: LUCENE-766
> URL: https://issues.apache.org/jira/browse/LUCENE-766
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.1
>Reporter: Nicolas Lalevée
> Attachments: bugfix.patch
>
>
> On a empty index, adding a document with two fields with the same name but 
> with different term vector option fail. The field with 
> TermVector.WITH_POSITIONS_OFFSETS is correctly indexed, as the offset are 
> correclty extracted. The field with TermVector.NO is not. The 
> TermVectorsWriter tries to add offset info given to the data of the filedinfo 
> from the "fnm" file, but the DocumentWriter didn't prepared offset datas as 
> it gets its info from the field itself, not from the fieldinfo.
> Attaching a patch with a test. The test without the fix make this stack trace 
> :
> java.lang.IllegalStateException: Trying to write offsets that are null!
>   at 
> org.apache.lucene.index.TermVectorsWriter.writeField(TermVectorsWriter.java:311)
>   at 
> org.apache.lucene.index.TermVectorsWriter.closeField(TermVectorsWriter.java:142)
>   at 
> org.apache.lucene.index.TermVectorsWriter.closeDocument(TermVectorsWriter.java:100)
>   at 
> org.apache.lucene.index.TermVectorsWriter.close(TermVectorsWriter.java:240)
>   at 
> org.apache.lucene.index.DocumentWriter.writePostings(DocumentWriter.java:365)
>   at 
> org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:114)
>   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:618)
>   at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:601)
>   at 
> org.apache.lucene.index.TestDocumentWriter.testTermVector(TestDocumentWriter.java:147)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-755) Payloads

2007-01-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-755:
---

Attachment: payload.patch

> Payloads
> 
>
> Key: LUCENE-755
> URL: https://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> --   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>*  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>*  the first time.
>* 
>* @return length of the current payload in number of bytes
>*/
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* @param data the array into which the data of this payload is to be
>* stored, if it is big enough; otherwise, a new byte[] array
>* is allocated for this purpose. 
>* @param offset the offset in the array into which the data of this payload
>*   is to be stored.
>* @return a byte[] array containing the data of this payload
>* @throws IOException
>*/
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> --
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:
>* The DocumentWriter enables payloads for a field, if one ore more Tokens 
> carry payloads.
>* The SegmentMerger enables payloads for a field during a merge, if 
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the 
> ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A 
> payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the 
> PositionDelta is shifted one bit. The lowest bit is used to indicate whether 
> the length of the following payload is stored explicitly. If not, i. e. the 
> bit is false, then the payload has the same length as the payload of the 
> previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at 
> every skip point has to be known. Therefore the payload length is also stored 
>

[jira] Commented: (LUCENE-755) Payloads

2007-01-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463414
 ] 

Nicolas Lalevée commented on LUCENE-755:


The patch I have just upload (payload.patch) is Michael's one (payloads.patch) 
with the customization of how payload are written and read, exactly like I did 
for Lucene-662. An IndexFormat is in fact a factory of PayloadWriter and 
PayloadReader, this index format being stored in the Directory instance.

Note that I haven't changed the javadoc neither the comments included in 
Michael's patch, it needs some cleanup if somebody is interested in commiting 
it.
And sorry for the name of the patch I have uploaded, it is a little bit 
confusing now, and I can't change it's name. I will be more carefull next time 
when naming my patch files.

> Payloads
> 
>
> Key: LUCENE-755
> URL: https://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> --   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>*  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>*  the first time.
>* 
>* @return length of the current payload in number of bytes
>*/
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* @param data the array into which the data of this payload is to be
>* stored, if it is big enough; otherwise, a new byte[] array
>* is allocated for this purpose. 
>* @param offset the offset in the array into which the data of this payload
>*   is to be stored.
>* @return a byte[] array containing the data of this payload
>* @throws IOException
>*/
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> --
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:
>* The DocumentWriter enables payloads for a field, if one ore more Tokens 
> carry payloads.
>* The SegmentMerger enables payloads for a field during a merge, if 
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the 
> ProxFile and FreqFile don't change
> - P

[jira] Created: (LUCENE-776) Use WeakHashMap instead of Hashtable in FSDirectory

2007-01-13 Thread JIRA
Use WeakHashMap instead of Hashtable in FSDirectory
---

 Key: LUCENE-776
 URL: https://issues.apache.org/jira/browse/LUCENE-776
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Nicolas Lalevée
Priority: Trivial


I was just reading the FSDirectory java code, then I found this :

  /** This cache of directories ensures that there is a unique Directory
   * instance per path, so that synchronization on the Directory can be used to
   * synchronize access between readers and writers.
   *
   * This should be a WeakHashMap, so that entries can be GC'd, but that would
   * require Java 1.2.  Instead we use refcounts...
   */
  private static final Hashtable DIRECTORIES = new Hashtable();

Since Lucene is now requiring at least 1.2 (for ThreadLocal for instance, which 
is using BTW some WeakHashMap), maybe it is time to change ?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-773) Deprecate "create" method in FSDirectory.getDirectory in favor of IndexWriter's "create"

2007-01-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464586
 ] 

Nicolas Lalevée commented on LUCENE-773:


I was working on the IndexFormat mechanism, LUCENE-622 being the first draft of 
it. And I have tried to use some Java-1.5 parametered types to see if it is 
possible to make index readers/writers typed by the index format. And I am in 
front of one issue : the directory have to know the index format because of the 
IndexNameFilter and the create feature. And I don't think it is a good idea 
because of how they are instanciated.

I have not finished the design of this Java-1.5-way-of-typing, I have other 
issues to look at, but I vote +1 for removing any index structure specificity 
in the store package.

> Deprecate "create" method in FSDirectory.getDirectory in favor of 
> IndexWriter's "create"
> 
>
> Key: LUCENE-773
> URL: https://issues.apache.org/jira/browse/LUCENE-773
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
>
> It's confusing that there is a create=true|false at the FSDirectory
> level and then also another create=true|false at the IndexWriter
> level.  Which one should you use when creating an index?
> Our users have been confused by this in the past:
>   http://www.gossamer-threads.com/lists/lucene/java-user/4792
> I think in general we should try to have one obvious way to achieve
> something (like Python: http://en.wikipedia.org/wiki/Python_philosophy).
> And the fact that there are now two code paths that are supposed to do
> the same (similar?) thing, can more easily lead to sneaky bugs.  One
> case of LUCENE-140 (already fixed in trunk but not past releases),
> which inspired this issue, can happen if you send create=false to the
> FSDirectory and create=true to the IndexWriter.
> Finally, as of lockless commits, it is now possible to open an
> existing index for "create" while readers are still using the old
> "point in time" index, on Windows.  (At least one user had tried this
> previously and failed).  To do this, we use the IndexFileDeleter class
> (which retries on failure) and we also look at the segments file to
> determine the next segments_N file to write to.
> With future issues like LUCENE-710 even more "smarts" may be required
> to know what it takes to "create" a new index into an existing
> directory.  Given that we have have quite a few Directory
> implemenations, I think these "smarts" logically should live in
> IndexWriter (not replicated in each Directory implementation), and we
> should leave the Directory as an interface that knows how to make
> changes to some backing store but does not itself try to make any
> changes.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-773) Deprecate "create" method in FSDirectory.getDirectory in favor of IndexWriter's "create"

2007-01-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464588
 ] 

Nicolas Lalevée commented on LUCENE-773:


forget what I have said about "removing any index structure specificity in the 
store package.". Actually, the directory is the only central instance that can 
holds an indexformat instance.

Anyway, I still +1 for not duplicating code ! :)

> Deprecate "create" method in FSDirectory.getDirectory in favor of 
> IndexWriter's "create"
> 
>
> Key: LUCENE-773
> URL: https://issues.apache.org/jira/browse/LUCENE-773
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
>
> It's confusing that there is a create=true|false at the FSDirectory
> level and then also another create=true|false at the IndexWriter
> level.  Which one should you use when creating an index?
> Our users have been confused by this in the past:
>   http://www.gossamer-threads.com/lists/lucene/java-user/4792
> I think in general we should try to have one obvious way to achieve
> something (like Python: http://en.wikipedia.org/wiki/Python_philosophy).
> And the fact that there are now two code paths that are supposed to do
> the same (similar?) thing, can more easily lead to sneaky bugs.  One
> case of LUCENE-140 (already fixed in trunk but not past releases),
> which inspired this issue, can happen if you send create=false to the
> FSDirectory and create=true to the IndexWriter.
> Finally, as of lockless commits, it is now possible to open an
> existing index for "create" while readers are still using the old
> "point in time" index, on Windows.  (At least one user had tried this
> previously and failed).  To do this, we use the IndexFileDeleter class
> (which retries on failure) and we also look at the segments file to
> determine the next segments_N file to write to.
> With future issues like LUCENE-710 even more "smarts" may be required
> to know what it takes to "create" a new index into an existing
> directory.  Given that we have have quite a few Directory
> implemenations, I think these "smarts" logically should live in
> IndexWriter (not replicated in each Directory implementation), and we
> should leave the Directory as an interface that knows how to make
> changes to some backing store but does not itself try to make any
> changes.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-773) Deprecate "create" method in FSDirectory.getDirectory in favor of IndexWriter's "create"

2007-01-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464608
 ] 

Nicolas Lalevée commented on LUCENE-773:


ho, yeah, it's the 662 of course, my eyes might have squited. :) (and it can't 
be the 622 as I am definitively not a maven user ^^)

> Deprecate "create" method in FSDirectory.getDirectory in favor of 
> IndexWriter's "create"
> 
>
> Key: LUCENE-773
> URL: https://issues.apache.org/jira/browse/LUCENE-773
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
>
> It's confusing that there is a create=true|false at the FSDirectory
> level and then also another create=true|false at the IndexWriter
> level.  Which one should you use when creating an index?
> Our users have been confused by this in the past:
>   http://www.gossamer-threads.com/lists/lucene/java-user/4792
> I think in general we should try to have one obvious way to achieve
> something (like Python: http://en.wikipedia.org/wiki/Python_philosophy).
> And the fact that there are now two code paths that are supposed to do
> the same (similar?) thing, can more easily lead to sneaky bugs.  One
> case of LUCENE-140 (already fixed in trunk but not past releases),
> which inspired this issue, can happen if you send create=false to the
> FSDirectory and create=true to the IndexWriter.
> Finally, as of lockless commits, it is now possible to open an
> existing index for "create" while readers are still using the old
> "point in time" index, on Windows.  (At least one user had tried this
> previously and failed).  To do this, we use the IndexFileDeleter class
> (which retries on failure) and we also look at the segments file to
> determine the next segments_N file to write to.
> With future issues like LUCENE-710 even more "smarts" may be required
> to know what it takes to "create" a new index into an existing
> directory.  Given that we have have quite a few Directory
> implemenations, I think these "smarts" logically should live in
> IndexWriter (not replicated in each Directory implementation), and we
> should leave the Directory as an interface that knows how to make
> changes to some backing store but does not itself try to make any
> changes.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-776) Use WeakHashMap instead of Hashtable in FSDirectory

2007-01-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464610
 ] 

Nicolas Lalevée commented on LUCENE-776:


I think you've describe the problem completely Michael. When submitting this 
issue, I thought that the weak object in a WeakHashMap was the value of the 
map. So it appears that it is not done for that. About your last though, it is 
accurate because I think that most of the time, Lucene-based application are 
opening their directories at the same place.
My turn of though : we might have an issue if the table holds some reference 
that are not yet GCed. A directory is closed, "manually" cleaned up, and 
reopened with a different lock factory : this will fail with the IOException 
because of the still cached directory, conflicting because of its different 
lock factory. So the current design might be the best one in fact.

> Use WeakHashMap instead of Hashtable in FSDirectory
> ---
>
> Key: LUCENE-776
> URL: https://issues.apache.org/jira/browse/LUCENE-776
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Nicolas Lalevée
> Assigned To: Michael McCandless
>Priority: Trivial
>
> I was just reading the FSDirectory java code, then I found this :
>   /** This cache of directories ensures that there is a unique Directory
>* instance per path, so that synchronization on the Directory can be used 
> to
>* synchronize access between readers and writers.
>*
>* This should be a WeakHashMap, so that entries can be GC'd, but that would
>* require Java 1.2.  Instead we use refcounts...
>*/
>   private static final Hashtable DIRECTORIES = new Hashtable();
> Since Lucene is now requiring at least 1.2 (for ThreadLocal for instance, 
> which is using BTW some WeakHashMap), maybe it is time to change ?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-778) Allow overriding a Document

2007-01-17 Thread JIRA
Allow overriding a Document
---

 Key: LUCENE-778
 URL: https://issues.apache.org/jira/browse/LUCENE-778
 Project: Lucene - Java
  Issue Type: New Feature
Affects Versions: 2.0.0
Reporter: Nicolas Lalevée
Priority: Trivial


In our application, we have some kind of generic API that is handling how we 
are using Lucene. The different other applications are using this API with 
different semantics, and are using the Lucene fields quite differently. We 
wrote some usefull functions to do this mapping. Today, as the Document class 
cannot be overriden, we are obliged to make a document wrapper by application, 
ie some MyAppDocument and MyOtherAppDocument which have a property holding a 
real Lucene Document. Then, when MyApp or MyOtherApp want to use our generic 
lucene API, we have to "get out" the Lucene document, ie do some 
genericLuceneAPI.writeDoc(myAppDoc.getLuceneDocument()). This work fine, but it 
becomes quite tricky to use the other function of our generic API which is 
genericLuceneAPI.writeDocs(Collection docs).

I don't know the rational behind making final Document, but removing it will 
allow more object-oriented code.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-778) Allow overriding a Document

2007-01-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465408
 ] 

Nicolas Lalevée commented on LUCENE-778:


Just after committing the jira issue, I just figured out that I haven't 
searched for topics about it. Sorry.
BTW, thanks for the pointers.

But what my request here is not making Lucene providing customized documents, 
like in LUCENE-662, it is about allowing passing to writer.addDocument() a 
document that is extending Document, nothing more.

For instance, I would like to do something like that :

public RDFDocument extends Document {
  public RDFDocument(String uri) {
add(new Field("uri", uri); 
  }
  
  public void addStatement(String prop, String value) {
add(new Field(prop, value));
  }
}

Should we move this discussion to lucene-dev ?


> Allow overriding a Document
> ---
>
> Key: LUCENE-778
> URL: https://issues.apache.org/jira/browse/LUCENE-778
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.0.0
>Reporter: Nicolas Lalevée
>Priority: Trivial
>
> In our application, we have some kind of generic API that is handling how we 
> are using Lucene. The different other applications are using this API with 
> different semantics, and are using the Lucene fields quite differently. We 
> wrote some usefull functions to do this mapping. Today, as the Document class 
> cannot be overriden, we are obliged to make a document wrapper by 
> application, ie some MyAppDocument and MyOtherAppDocument which have a 
> property holding a real Lucene Document. Then, when MyApp or MyOtherApp want 
> to use our generic lucene API, we have to "get out" the Lucene document, ie 
> do some genericLuceneAPI.writeDoc(myAppDoc.getLuceneDocument()). This work 
> fine, but it becomes quite tricky to use the other function of our generic 
> API which is genericLuceneAPI.writeDocs(Collection docs).
> I don't know the rational behind making final Document, but removing it will 
> allow more object-oriented code.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-816) Manage dependencies in the build with ivy

2007-02-26 Thread JIRA
Manage dependencies in the build with ivy
-

 Key: LUCENE-816
 URL: https://issues.apache.org/jira/browse/LUCENE-816
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 2.1
Reporter: Nicolas Lalevée
 Attachments: common-build.tar.gz

There were issues about making the 2.1 release : 
http://www.nabble.com/-VOTE--release-Lucene-2.1-tf3228536.html#a8994721
Then the discussion started to talk about maven, and also about ivy.
I propose here a draft, a proof of concept of an ant + ivy build. I made this 
build parallel to the actual one, so people can evaluate it.
Note that I have only ivy-ified the core, the demo and the contrib/benchmark. 
The other contrib projects can be ivy-ified quite easily.

The build system is in the common-build directory. In this directory we have :
* common-build.xml : the main common build which handle dependencies with ivy
* common-build-project.xml : build a java project, core, demo, or a contrib one
* common-build-webapp.xml : extend common-build-project and have some tasks 
about building a war
* common-build-modules.xml : allow to build sevral projects, just using some 
subant task
* common-build-gcj.xml : build with gcj. It work once, need to be fixed
* ivyconf.xml, ivyconf.properties : ivy configuration
* build.xml : a little task to generate the ivyconf.xml to use with the eclipse 
ivy plugin
* eclipse directory : contains some XSL/XML to generate .classpath and .project

To test it and see how ivy is cool :
cd contrib/benchmark
ant -f build-ivy.xml buildeep

and look at the new local-libs directory at the root of the lucene directory !


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-816) Manage dependencies in the build with ivy

2007-02-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-816:
---

Attachment: common-build.tar.gz

> Manage dependencies in the build with ivy
> -
>
> Key: LUCENE-816
> URL: https://issues.apache.org/jira/browse/LUCENE-816
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.1
>Reporter: Nicolas Lalevée
> Attachments: common-build.tar.gz
>
>
> There were issues about making the 2.1 release : 
> http://www.nabble.com/-VOTE--release-Lucene-2.1-tf3228536.html#a8994721
> Then the discussion started to talk about maven, and also about ivy.
> I propose here a draft, a proof of concept of an ant + ivy build. I made this 
> build parallel to the actual one, so people can evaluate it.
> Note that I have only ivy-ified the core, the demo and the contrib/benchmark. 
> The other contrib projects can be ivy-ified quite easily.
> The build system is in the common-build directory. In this directory we have :
> * common-build.xml : the main common build which handle dependencies with ivy
> * common-build-project.xml : build a java project, core, demo, or a contrib 
> one
> * common-build-webapp.xml : extend common-build-project and have some tasks 
> about building a war
> * common-build-modules.xml : allow to build sevral projects, just using some 
> subant task
> * common-build-gcj.xml : build with gcj. It work once, need to be fixed
> * ivyconf.xml, ivyconf.properties : ivy configuration
> * build.xml : a little task to generate the ivyconf.xml to use with the 
> eclipse ivy plugin
> * eclipse directory : contains some XSL/XML to generate .classpath and 
> .project
> To test it and see how ivy is cool :
> cd contrib/benchmark
> ant -f build-ivy.xml buildeep
> and look at the new local-libs directory at the root of the lucene directory !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-816) Manage dependencies in the build with ivy

2007-02-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-816:
---

Attachment: ivy-build.patch

> Manage dependencies in the build with ivy
> -
>
> Key: LUCENE-816
> URL: https://issues.apache.org/jira/browse/LUCENE-816
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.1
>Reporter: Nicolas Lalevée
> Attachments: common-build.tar.gz, ivy-build.patch
>
>
> There were issues about making the 2.1 release : 
> http://www.nabble.com/-VOTE--release-Lucene-2.1-tf3228536.html#a8994721
> Then the discussion started to talk about maven, and also about ivy.
> I propose here a draft, a proof of concept of an ant + ivy build. I made this 
> build parallel to the actual one, so people can evaluate it.
> Note that I have only ivy-ified the core, the demo and the contrib/benchmark. 
> The other contrib projects can be ivy-ified quite easily.
> The build system is in the common-build directory. In this directory we have :
> * common-build.xml : the main common build which handle dependencies with ivy
> * common-build-project.xml : build a java project, core, demo, or a contrib 
> one
> * common-build-webapp.xml : extend common-build-project and have some tasks 
> about building a war
> * common-build-modules.xml : allow to build sevral projects, just using some 
> subant task
> * common-build-gcj.xml : build with gcj. It work once, need to be fixed
> * ivyconf.xml, ivyconf.properties : ivy configuration
> * build.xml : a little task to generate the ivyconf.xml to use with the 
> eclipse ivy plugin
> * eclipse directory : contains some XSL/XML to generate .classpath and 
> .project
> To test it and see how ivy is cool :
> cd contrib/benchmark
> ant -f build-ivy.xml buildeep
> and look at the new local-libs directory at the root of the lucene directory !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-817) Manage dependencies in the build with ivy

2007-02-26 Thread JIRA
Manage dependencies in the build with ivy
-

 Key: LUCENE-817
 URL: https://issues.apache.org/jira/browse/LUCENE-817
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Build
Affects Versions: 2.1
Reporter: Nicolas Lalevée


There were issues about making the 2.1 release : 
http://www.nabble.com/-VOTE--release-Lucene-2.1-tf3228536.html#a8994721
Then the discussion started to talk about maven, and also about ivy.
I propose here a draft, a proof of concept of an ant + ivy build. I made this 
build parallel to the actual one, so people can evaluate it.
Note that I have only ivy-ified the core, the demo and the contrib/benchmark. 
The other contrib projects can be ivy-ified quite easily.

The build system is in the common-build directory. In this directory we have :
* common-build.xml : the main common build which handle dependencies with ivy
* common-build-project.xml : build a java project, core, demo, or a contrib one
* common-build-webapp.xml : extend common-build-project and have some tasks 
about building a war
* common-build-modules.xml : allow to build sevral projects, just using some 
subant task
* common-build-gcj.xml : build with gcj. It work once, need to be fixed
* ivyconf.xml, ivyconf.properties : ivy configuration
* build.xml : a little task to generate the ivyconf.xml to use with the eclipse 
ivy plugin
* eclipse directory : contains some XSL/XML to generate .classpath and .project

To test it and see how ivy is cool :) :
cd contrib/benchmark
ant -f build-ivy.xml buildeep

and look at the new local-libs directory at the root of the lucene directory !


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-817) Manage dependencies in the build with ivy

2007-02-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée resolved LUCENE-817.


   Resolution: Duplicate
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

jira not responding, retrying, and then a duplicate issue...

> Manage dependencies in the build with ivy
> -
>
> Key: LUCENE-817
> URL: https://issues.apache.org/jira/browse/LUCENE-817
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.1
>Reporter: Nicolas Lalevée
>
> There were issues about making the 2.1 release : 
> http://www.nabble.com/-VOTE--release-Lucene-2.1-tf3228536.html#a8994721
> Then the discussion started to talk about maven, and also about ivy.
> I propose here a draft, a proof of concept of an ant + ivy build. I made this 
> build parallel to the actual one, so people can evaluate it.
> Note that I have only ivy-ified the core, the demo and the contrib/benchmark. 
> The other contrib projects can be ivy-ified quite easily.
> The build system is in the common-build directory. In this directory we have :
> * common-build.xml : the main common build which handle dependencies with ivy
> * common-build-project.xml : build a java project, core, demo, or a contrib 
> one
> * common-build-webapp.xml : extend common-build-project and have some tasks 
> about building a war
> * common-build-modules.xml : allow to build sevral projects, just using some 
> subant task
> * common-build-gcj.xml : build with gcj. It work once, need to be fixed
> * ivyconf.xml, ivyconf.properties : ivy configuration
> * build.xml : a little task to generate the ivyconf.xml to use with the 
> eclipse ivy plugin
> * eclipse directory : contains some XSL/XML to generate .classpath and 
> .project
> To test it and see how ivy is cool :) :
> cd contrib/benchmark
> ant -f build-ivy.xml buildeep
> and look at the new local-libs directory at the root of the lucene directory !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-816) Manage dependencies in the build with ivy

2007-02-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-816:
---

Attachment: external-libs.tar.gz

> Manage dependencies in the build with ivy
> -
>
> Key: LUCENE-816
> URL: https://issues.apache.org/jira/browse/LUCENE-816
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.1
>Reporter: Nicolas Lalevée
> Attachments: common-build.tar.gz, external-libs.tar.gz, 
> ivy-build.patch
>
>
> There were issues about making the 2.1 release : 
> http://www.nabble.com/-VOTE--release-Lucene-2.1-tf3228536.html#a8994721
> Then the discussion started to talk about maven, and also about ivy.
> I propose here a draft, a proof of concept of an ant + ivy build. I made this 
> build parallel to the actual one, so people can evaluate it.
> Note that I have only ivy-ified the core, the demo and the contrib/benchmark. 
> The other contrib projects can be ivy-ified quite easily.
> The build system is in the common-build directory. In this directory we have :
> * common-build.xml : the main common build which handle dependencies with ivy
> * common-build-project.xml : build a java project, core, demo, or a contrib 
> one
> * common-build-webapp.xml : extend common-build-project and have some tasks 
> about building a war
> * common-build-modules.xml : allow to build sevral projects, just using some 
> subant task
> * common-build-gcj.xml : build with gcj. It work once, need to be fixed
> * ivyconf.xml, ivyconf.properties : ivy configuration
> * build.xml : a little task to generate the ivyconf.xml to use with the 
> eclipse ivy plugin
> * eclipse directory : contains some XSL/XML to generate .classpath and 
> .project
> To test it and see how ivy is cool :
> cd contrib/benchmark
> ant -f build-ivy.xml buildeep
> and look at the new local-libs directory at the root of the lucene directory !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-662) Extendable writer and reader of field data

2007-03-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-662:
---

Attachment: indexFormat.patch

Patch update: synchronized with the trunk and new features.

* The index format has now an ID which is serialized in a new file in the 
directory. This new file is managed by the SegmentInfos class. It has been 
pushed in a new file to keep me from breaking things, but it may be pushed in 
the segment file. This new feature will help to avoid opening index with wrong 
code. Like the index version, if the index format is not compatible, opening it 
fails. And it also fails while trying to use IndexWriter#addIndexes(). This 
compatibilities issues are managed by the implementations of the index format: 
an implementation have to implement the function canRead(String indexFmtID). 
But I think something is still missing in this design. Saying that a format is 
compatible is another one is OK, but I have to figured out if this is really 
possible to make a reader which handle two different formats.

* When synchronizing with the trunk, I had trouble with the new 
FieldSelectorResult : SIZE. This new feature expect the FieldsReader to know 
the size of the content of the field. With the generic FieldReader, the data is 
only a sequence of byte, so it cannot compute the size of the decoded data. I 
did a dumb implementation: it returns the size of the data in bytes. I know 
this is wrong, the associated tests fail (I let it fails in the patch). It has 
to be fixed, this may require some change in the API I have designed.

* There was a discussion in java-dev about changing the order of the postings. 
Today in the .frq file, the document numbers are ordered by document number. 
The proposal was to order them by frequency. So I worked a little bit on the 
mechanism I have done to generify the field storing, and applied it to posting 
storing. This part of the patch proposed here is not well (nearly not at all) 
documented and is a draft. But it works (at least with the actual 
implementation), the mechanism allow to implement a custom PostingReader, 
PostingWritter :

public interface PostingWriter {
  public void close() throws IOException;
  public long[] getPointers();
  public int getNbPointer();
  public long writeSkip(RAMOutputStream skipBuffer) throws IOException;
  public void write(int doc, int lastDoc, int nbPos, int[] positions) throws 
IOException;
}

public interface PostingReader {
  public void close() throws IOException ;
  public TermDocs termDocs(BitVector deletedDocs, TermInfosReader tis, 
FieldInfos fieldInfos) throws IOException;
  public TermPositions termPositions(BitVector deletedDocs, TermInfosReader 
tis, FieldInfos fieldInfos) throws IOException;
}

Furthermore this "generification" also allows an implementation invoked many 
times : http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
Note that it does not break the actual format. The .tis file is still managed 
internaly by Lucene and it holds pointers to some external files (managed by 
the indexFormat). The implementation of the PostingReader/PostingWriter specify 
how many pointers there are. The default one is 2 : .frq and .prx. The 
FlexibleIndexing would be 1.

* To show that the default implementation of the index format can be changed, I 
have created a new package org.apache.lucene.index.impl which holds the actual 
index format :
- DefaultFieldData : the data part of Field
- DefaultFieldsReader : the non-generified part of the FieldsReader
- DefaultFieldsWriter : the non-generified part of the FieldsWriter
- DefaultIndexFormat : the factory of readers and writers
- DefaultPostringReader : just instanciate SegmentTermDocs and 
SegmentTermPositions
- DefaultPostringWriter : the posting writing part of DocumentWriter
- SegmentTermDocs : just moved
- SegmentTermPositions : just moved

* Where I want to continue: I am mainly interested in the generic field 
storage, so I will continue to maintain it, I will try to fix the SIZE issue 
and will work about allowing readers being compatible with each other. I am 
also interested in some generic index storing for facetted search. But I 
figured out that the indexed data have to be stored at the document level. And 
this cannot be done with postings. So I don't think I will go further in 
playing with postings. I prefer look at LUCENE-584.


> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: https://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: entrytable.patch, generic-fieldIO-2.patch, 
> generic-fieldIO-3.patch, generic-f

[jira] Commented: (LUCENE-626) Extended spell checker with phrase support and adaptive user session analysis.

2007-03-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477685
 ] 

Nicolas Lalevée commented on LUCENE-626:


This feature looks interesting, but why should it depend on LUCENE-550 ?

> Extended spell checker with phrase support and adaptive user session analysis.
> --
>
> Key: LUCENE-626
> URL: https://issues.apache.org/jira/browse/LUCENE-626
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Karl Wettin
> Assigned To: Karl Wettin
>Priority: Minor
> Attachments: didyoumean.patch.bz2, spellchecker.diff
>
>
> Extensive java docs available in patch, but I try to keep it compiled here: 
> http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
> The patch spellcheck.diff should not depend on anything but Lucene trunk. It 
> has basic support for phrase suggestions  and query goal detection, but is 
> pretty buggy and lacks features available in didyoumean.diff.bz2. The latter 
> depends on LUCENE-550.
> Example:
> {code:java}
> public void testImportData() throws Exception {
> // load 200 000 user queries with session data and time stamp. no goals 
> specified.
> System.out.println("Processing 
> http://ginandtonique.org/~kalle/data/pirate.data.gz";);
> importFile(new InputStreamReader(new GZIPInputStream(new 
> URL("http://ginandtonique.org/~kalle/data/pirate.data.gz";).openStream(;
> System.out.println("Processing 
> http://ginandtonique.org/~kalle/data/hero.data.gz";);
> importFile(new InputStreamReader(new GZIPInputStream(new 
> URL("http://ginandtonique.org/~kalle/data/hero.data.gz";).openStream(;
> System.out.println("Done.");
> // run some tests without the second level suggestions,
> // i.e. user behavioral data only. no ngrams or so.
> 
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe 
> caribbean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carribbean"));
> assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carriben"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carabien"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of 
> the carabbean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates og 
> carribean"));
> assertEquals("pirates of the caribbean soundtrack", 
> facade.didYouMean("pirates of the caribbean music"));
> assertEquals("pirates of the caribbean score", facade.didYouMean("pirates 
> of the caribbean soundtrack"));
> assertEquals("pirate of caribbean", facade.didYouMean("pirate of 
> carabian"));
> assertEquals("pirates of caribbean", facade.didYouMean("pirate of 
> caribbean"));
> assertEquals("pirates of caribbean", facade.didYouMean("pirates of 
> caribbean"));
> // depening on how many hits and goals are noted with these two queries
> // perhaps the delta should be added to a synonym dictionary? 
> assertEquals("homm iv", facade.didYouMean("homm 4"));
> // not yet known.. and we have no second level yet.
> assertNull(facade.didYouMean("the pilates"));
> // use the dictionary built from user queries to build the token phrase 
> and ngram suggester.  
> 
> facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()),
>  1d);
> // now it's learned
> assertEquals("the pirates", facade.didYouMean("the pilates"));
> // typos
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> fight and magic"));
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> right and magic"));
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of 
> magic and light"));
> // composite dictionary key not learned yet..
> assertEquals(null, facade

[jira] Commented: (LUCENE-778) Allow overriding a Document

2007-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_1243
 ] 

Nicolas Lalevée commented on LUCENE-778:


Marker interface is a nice idea, but I think this will make Document handling 
more painfull. In my use case this will not be optimal.
In our application, we have a kind of workflow of Document. We have a 
big/master index which is holding every data on the system, and then we have 
specialized index which a part of the big one. The big one is for making big 
global queries on the whole data. The specialized are specialized for the end 
application. So the workflow is :
* from the raw data, make them as Document and index it in the master index
* for each specialized index :
** do the specific query on the master index
** from the retreived document, redecorate it with specialized indexed field
** index the decorated documents in the specialized index

Here I just have to decorate the Document retrieved form the master index. With 
incompatible interfaces, this won't be possible anymore. I will have to 
reinstanciate a Document each time and repopulate it.
So why not keep IndexWriter#addDocument(Document), and just change 
IndexReader#doc(int) to make it return a kind of DocumentWithOnlyStoredData, 
with DocumentWithOnlyStoredData extends Document. (the proposed named is 
horrible, I know !)


> Allow overriding a Document
> ---
>
> Key: LUCENE-778
> URL: https://issues.apache.org/jira/browse/LUCENE-778
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.0.0
>Reporter: Nicolas Lalevée
>Priority: Trivial
>
> In our application, we have some kind of generic API that is handling how we 
> are using Lucene. The different other applications are using this API with 
> different semantics, and are using the Lucene fields quite differently. We 
> wrote some usefull functions to do this mapping. Today, as the Document class 
> cannot be overriden, we are obliged to make a document wrapper by 
> application, ie some MyAppDocument and MyOtherAppDocument which have a 
> property holding a real Lucene Document. Then, when MyApp or MyOtherApp want 
> to use our generic lucene API, we have to "get out" the Lucene document, ie 
> do some genericLuceneAPI.writeDoc(myAppDoc.getLuceneDocument()). This work 
> fine, but it becomes quite tricky to use the other function of our generic 
> API which is genericLuceneAPI.writeDocs(Collection docs).
> I don't know the rational behind making final Document, but removing it will 
> allow more object-oriented code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-662) Extendable writer and reader of field data

2007-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_1244
 ] 

Nicolas Lalevée commented on LUCENE-662:


Thanks Michael !
I will appreciate a review and feedbacks, as it will open a lot the API, this 
will go even further than just make Document public (LUCENE-778).

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: https://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: entrytable.patch, generic-fieldIO-2.patch, 
> generic-fieldIO-3.patch, generic-fieldIO-4.patch, generic-fieldIO-5.patch, 
> generic-fieldIO.patch, indexFormat.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-778) Allow overriding a Document

2007-03-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477962
 ] 

Nicolas Lalevée commented on LUCENE-778:


Note that I was talking about the future API, with some deprecated functions 
removed. So the API will look like :
class IndexReader {
  ReturnableDocument doc(int n);
}
class IndexWriter {
  void addDocument(IndexableDocument doc);
}

1) it does matter from the user point of view : we cannot do anymore 
writer.addDocument(reader.doc(10)).

2) Effectively I can implement a DecoratedDocument. But I cannot make Lucene 
instanciating my own document, the reader will still return some 
ReturnableDocument. Unless you want to allow the user to customize in Lucene 
the instanciation of Document by providing a factory of Document ?

> Allow overriding a Document
> ---
>
> Key: LUCENE-778
> URL: https://issues.apache.org/jira/browse/LUCENE-778
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.0.0
>Reporter: Nicolas Lalevée
>Priority: Trivial
>
> In our application, we have some kind of generic API that is handling how we 
> are using Lucene. The different other applications are using this API with 
> different semantics, and are using the Lucene fields quite differently. We 
> wrote some usefull functions to do this mapping. Today, as the Document class 
> cannot be overriden, we are obliged to make a document wrapper by 
> application, ie some MyAppDocument and MyOtherAppDocument which have a 
> property holding a real Lucene Document. Then, when MyApp or MyOtherApp want 
> to use our generic lucene API, we have to "get out" the Lucene document, ie 
> do some genericLuceneAPI.writeDoc(myAppDoc.getLuceneDocument()). This work 
> fine, but it becomes quite tricky to use the other function of our generic 
> API which is genericLuceneAPI.writeDocs(Collection docs).
> I don't know the rational behind making final Document, but removing it will 
> allow more object-oriented code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-778) Allow overriding a Document

2007-03-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478316
 ] 

Nicolas Lalevée commented on LUCENE-778:


Hoss> OK I got it.
In fact my concern was about "semantic". I agree with you that the API make the 
Lucene user think that every index data are retrieved while doing 
reader.doc(int i) (as I thought the first days using Lucene). But here you are 
proposing to completely separate the indexing from the retrieving, saying that 
they are not compatible with each other. I think this is wrong because 
basically you can retrieve a document and repushed the same in the index, even 
if it has no sense. But this was pure semantic concerns.
Now looking at the implementation you are proposing of a "YourDocumentWraper", 
we can make it work correctly without any performance issue. So I won't make a 
war is a such design is implemented ;)

Grant>In fact the discussion derived from the original issue. BTW, this would 
be nice !

> Allow overriding a Document
> ---
>
> Key: LUCENE-778
>     URL: https://issues.apache.org/jira/browse/LUCENE-778
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.0.0
>Reporter: Nicolas Lalevée
>Priority: Trivial
>
> In our application, we have some kind of generic API that is handling how we 
> are using Lucene. The different other applications are using this API with 
> different semantics, and are using the Lucene fields quite differently. We 
> wrote some usefull functions to do this mapping. Today, as the Document class 
> cannot be overriden, we are obliged to make a document wrapper by 
> application, ie some MyAppDocument and MyOtherAppDocument which have a 
> property holding a real Lucene Document. Then, when MyApp or MyOtherApp want 
> to use our generic lucene API, we have to "get out" the Lucene document, ie 
> do some genericLuceneAPI.writeDoc(myAppDoc.getLuceneDocument()). This work 
> fine, but it becomes quite tricky to use the other function of our generic 
> API which is genericLuceneAPI.writeDocs(Collection docs).
> I don't know the rational behind making final Document, but removing it will 
> allow more object-oriented code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-755) Payloads

2007-03-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479841
 ] 

Nicolas Lalevée commented on LUCENE-755:


Grant>
The patch I have propsed here has no dependency on LUCENE-662, I just 
"imported" some ideas from it and put them there. Since the LUCENE-662 have 
involved, the patches will probably make conflicts. The best to use here is 
Michael's one. I think it won't conflit with LUCENE-662. And if both are 
intended to be commited, then the best is to commit the both seperately and 
redo the work I have done with the provided patch (I remember that it was quite 
easy).


> Payloads
> 
>
> Key: LUCENE-755
>     URL: https://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> --   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>*  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>*  the first time.
>* 
>* @return length of the current payload in number of bytes
>*/
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* @param data the array into which the data of this payload is to be
>* stored, if it is big enough; otherwise, a new byte[] array
>* is allocated for this purpose. 
>* @param offset the offset in the array into which the data of this payload
>*   is to be stored.
>* @return a byte[] array containing the data of this payload
>* @throws IOException
>*/
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> --
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:
>* The DocumentWriter enables payloads for a field, if one ore more Tokens 
> carry payloads.
>* The SegmentMerger enables payloads for a field during a merge, if 
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the 
> ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A 
> payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compress

[jira] Commented: (LUCENE-662) Extendable writer and reader of field data

2007-03-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479843
 ] 

Nicolas Lalevée commented on LUCENE-662:


Hum... same here This is due to some svn mv, and I created the patch with 
svn diff.
I can provide a patch with the complete diff, but you will loose the svn mv 
infos, so the svn history of the file will be lost.
Any advise is welcomed. I will also ask on monday to my colleagues how they use 
to work with svn mv and patches.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: https://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: entrytable.patch, generic-fieldIO-2.patch, 
> generic-fieldIO-3.patch, generic-fieldIO-4.patch, generic-fieldIO-5.patch, 
> generic-fieldIO.patch, indexFormat.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-778) Allow overriding a Document

2007-03-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479861
 ] 

Nicolas Lalevée commented on LUCENE-778:


Rethinking about the function "public void document(int n, Document doc)", in 
fact it will completely break the work I have done for LUCENE-662.

And finally, I agree with you Hoss. Two different interfaces, and let the user 
implement the document he wants. As a first step, the user will decorate his 
document. And in a second time, Lucene could provide the user the possibility 
to have his own DocumentFactory.


> Allow overriding a Document
> ---
>
> Key: LUCENE-778
> URL: https://issues.apache.org/jira/browse/LUCENE-778
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.0.0
>Reporter: Nicolas Lalevée
>Priority: Trivial
>
> In our application, we have some kind of generic API that is handling how we 
> are using Lucene. The different other applications are using this API with 
> different semantics, and are using the Lucene fields quite differently. We 
> wrote some usefull functions to do this mapping. Today, as the Document class 
> cannot be overriden, we are obliged to make a document wrapper by 
> application, ie some MyAppDocument and MyOtherAppDocument which have a 
> property holding a real Lucene Document. Then, when MyApp or MyOtherApp want 
> to use our generic lucene API, we have to "get out" the Lucene document, ie 
> do some genericLuceneAPI.writeDoc(myAppDoc.getLuceneDocument()). This work 
> fine, but it becomes quite tricky to use the other function of our generic 
> API which is genericLuceneAPI.writeDocs(Collection docs).
> I don't know the rational behind making final Document, but removing it will 
> allow more object-oriented code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-662) Extendable writer and reader of field data

2007-03-12 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-662:
---

Attachment: indexFormat.patch

Patch updated and synchornized with the trunk r517330.
I have removed the "svn mv" I have done so now the patch is applying fine on a 
fresh trunk. The svn mv was just about creating the package impl. So everthing 
came back to o.a.l.index.

Note about the last commit in trunk I have merged : lazy loading of the 
"proxstream". The feature is lost within this patch. I didn't took time to 
merge it properly. I think this is hightly feasable, but not just done. So a 
new item on the TODO list.

> Extendable writer and reader of field data
> --
>
> Key: LUCENE-662
> URL: https://issues.apache.org/jira/browse/LUCENE-662
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Nicolas Lalevée
>Priority: Minor
> Attachments: entrytable.patch, generic-fieldIO-2.patch, 
> generic-fieldIO-3.patch, generic-fieldIO-4.patch, generic-fieldIO-5.patch, 
> generic-fieldIO.patch, indexFormat.patch, indexFormat.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to 
> define how the data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a 
> factory of FieldsWriter and FieldsReader. So the IndexReader, the indexWriter 
> and the SegmentMerger are using this factory and not doing a "new 
> FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a 
> field, and also the writing and the reading in a stream. I have done this way 
> because in the current design of Lucene, Fiedable is an interface, so methods 
> with a protected or package visibility cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it 
> use the field data to read the stream. And finnaly it instanciates a Field 
> with the field data.
> About compatibility, I think it is kept, as I have writen a 
> DefaultIndexFormat that provides some DefaultFieldsWriter and 
> DefaultFieldsReader. These implementations do the exact job that is done 
> today.
> To acheive this modification, some classes and methods had to be moved from 
> private and/or final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the 
> implementation of the abstract class FieldData, so it will be totally 
> transparent for the Lucene user that will extends FieldData. The stream is 
> kept in the fieldData and used as soon as the stringValue (or something else) 
> is called. Implementing this way allowed me to handle the recently introduced 
> LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on 
> this lazy field data, the saved input stream is directly copied in the output 
> stream.
> I have a last issue with this patch. The current design allow to read an 
> index in an old format, and just do a writer.addIndexes() into a new format. 
> With the new design, you cannot, because the writer will use the 
> FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-02-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675585#action_12675585
 ] 

Felipe Sánchez Martínez commented on LUCENE-1284:
-

Kind remider
-

Otis, 

could you check if everything is ok with the last attachment (from May 2008).

Thanks a lot
--
Felipe.

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.2008-05-19.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2009-02-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe Sánchez Martínez updated LUCENE-1284:


Attachment: apertium-morph.0.9.0.tgz

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2009-02-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe Sánchez Martínez updated LUCENE-1284:


Attachment: (was: apertium-morph.2008-05-19.tgz)

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-02-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675591#action_12675591
 ] 

Felipe Sánchez Martínez commented on LUCENE-1284:
-

I have uploaded the package as it was released as part of the Apertium project 
(http://www.apertium.org).  It contains a brief README file and an example of 
use in  the "example" folder. 

To benefit from this package the texts to be indexed need to be preprocessed 
using some Apertium tools. These tools can be downloaded from the Apertium web 
page at sourceforge (http://sourceforge.net/projects/apertium/). You need to 
install the following packages: lttoobox, apertium, and the linguistic package 
you are interested in (with the name apertium-xx-yy). 

Mark, could you point me to the discussion about the @author tag?

--
Felipe.

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697434#action_12697434
 ] 

Felipe Sánchez Martínez commented on LUCENE-1284:
-

Hi Otis,

The package I submitted to Lucene has a dual license, so it is both GPL v2.0 
and ASL at the same time. Is this a problem?. Apertium is GPL v2.

There is a huge community around Apertium developing language pairs for it. 
Actually, this year Apertium is in the Google Summer of Code. The language 
pairs mentioned in http://wiki.apertium.org/wiki/List_of_language_pairs are 
those under development; the language pairs you can download from sourceforge  
(http://sourceforge.net/projects/apertium/ ;  packages with name 
apertium-xx-yy) are the ones that have been released; anyway, they are updated 
from time to time with further improvements. Their version numbers will help 
you on making and idea of the state of development and the translation quality 
you can expect.

Hope this helps
--
Felipe.

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698732#action_12698732
 ] 

Felipe Sánchez Martínez commented on LUCENE-1284:
-

Hi Otis,

The Java code I contributed is ASL and GPLv2  (dual license). Apertium tools 
and data are GPL v2.


>  Why are they in pairs? Is that simply for the translation part of Apertium, 
> and  something that's ignored when you use the pair for Lucene and 
> morphological analysis?

Yes, they are language pairs because of the translation. If you are not 
interested in translation (as is our case) you can used whichever language pair 
containing the language you are interested in; choose the language pair with 
the highest number of lemmata, probably the one with the highest version number.

> Do you mind replacing the deprecated Hits object in the Searcher class?

Which is the new class I should use?

> Could you explain why the removal of multiword expressions is needed?

Multiword units need to be removed from the dictionary mainly because they are 
there to facilitate the correct translation of some expressions to the target 
language. This is not Spanish specific and should be done in all cases.


> So these are a few command-line tools that end up marking up the input text 
> with POS? 

Yes. 

> I seem to be missing some libraries and can't compile Apterium locally to 
> check what that this marked up file looks like.

You need to install lttoolbox,  you can download it from the Apertium web page.

> But my main question here is whether there are Java equivalents of these 
> command-line tools,

Unfortunately, no :(

Regards.
--
Felipe

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www

2009-04-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703670#action_12703670
 ] 

Felipe Sánchez Martínez commented on LUCENE-1284:
-

Hi, 

I think that the fact that the tool relies on an external free/open-source 
package to pre-process the files to be indexed should not be an obstacle for 
the community to benefit from them; the world is pretty heterogeneous ;). 
Furthermore, they are not required at search time. 

> Felipe, although Java equivalents of those command-line tools don't exist 
> currently, do you think one could implement them in Java (and release them 
> under ASL)? 

This year the Apertium project is in the Google Summer of Code. A student will 
port the ltoolbox package to Java. Note that the tool I contribute also uses 
the apertium tagger and that this tool will not be ported; fortunately the 
usage of the tagger is optional.  The Java version of lttoolbox will be 
released under the GPL license, I am not sure if they will accept to give it a 
dual license.

--
Felipe

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1342) 64bit JVM crashes on Linux

2009-06-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724422#action_12724422
 ] 

Michael Böckling commented on LUCENE-1342:
--

We've just run into this bug with Lucene 2.1.0 and jdk 1.6.0_07-b06. 

Are there any news on this issue? Sun can't ignore a HotSpot compiler bug, can 
they? I can contribute a crash log if desired.

> 64bit JVM crashes on Linux
> --
>
> Key: LUCENE-1342
> URL: https://issues.apache.org/jira/browse/LUCENE-1342
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: 2.6.18-53.el5 x86_64  GNU/Linux
> Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
>Reporter: Kevin Richards
> Attachments: hs_err_pid10565.log, hs_err_pid21301.log, 
> hs_err_pid27882.log
>
>
> Whilst running lucene in our QA environment we received the following 
> exception. This problem was also reported here : 
> http://confluence.atlassian.com/display/KB/JSP-20240+-+POSSIBLE+64+bit+JDK+1.6+update+4+may+have+HotSpot+problems.
> Is this a JVM problem or a problem in Lucene.
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x2adb9e3f, pid=2275, tid=1085356352
> #
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b19 mixed mode linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0x1fce3f]
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x2aab0007f000):  JavaThread "CompilerThread0" daemon 
> [_thread_in_vm, id=2301, stack(0x40a13000,0x40b14000)]
> siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), 
> si_addr=0x
> Registers:
> RAX=0x, RBX=0x2aab0007f000, RCX=0x, 
> RDX=0x2aab00309aa0
> RSP=0x40b10f60, RBP=0x40b10fb0, RSI=0x2aaab37d1ce8, 
> RDI=0x2aaad000
> R8 =0x2b40cd88, R9 =0x0ffc, R10=0x2b40cd90, 
> R11=0x2b410810
> R12=0x2aab00ae60b0, R13=0x2aab0a19cc30, R14=0x40b112f0, 
> R15=0x2aab00ae60b0
> RIP=0x2adb9e3f, EFL=0x00010246, CSGSFS=0x0033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x40b10f60)
> 0x40b10f60:   2aab0007f000 
> 0x40b10f70:   2aab0a19cc30 0001
> 0x40b10f80:   2aab0007f000 
> 0x40b10f90:   40b10fe0 2aab0a19cc30
> 0x40b10fa0:   2aab0a19cc30 2aab00ae60b0
> 0x40b10fb0:   40b10fe0 2ae9c2e4
> 0x40b10fc0:   2b413210 2b413350
> 0x40b10fd0:   40b112f0 2aab09796260
> 0x40b10fe0:   40b110e0 2ae9d7d8
> 0x40b10ff0:   2b40f3d0 2aab08c2a4c8
> 0x40b11000:   40b11940 2aab09796260
> 0x40b11010:   2aab09795b28 
> 0x40b11020:   2aab08c2a4c8 2aab009b9750
> 0x40b11030:   2aab09796260 40b11940
> 0x40b11040:   2b40f3d0 2023
> 0x40b11050:   40b11940 2aab09796260
> 0x40b11060:   40b11090 2b0f199e
> 0x40b11070:   40b11978 2aab08c2a458
> 0x40b11080:   2b413210 2023
> 0x40b11090:   40b110e0 2b0f1fcf
> 0x40b110a0:   2023 2aab09796260
> 0x40b110b0:   2aab08c2a3c8 40b123b0
> 0x40b110c0:   2aab08c2a458 40b112f0
> 0x40b110d0:   2b40f3d0 2aab00043670
> 0x40b110e0:   40b11160 2b0e808d
> 0x40b110f0:   2aab000417c0 2aab009b66a8
> 0x40b11100:    2aab009b9750
> 0x40b0:   40b112f0 2aab009bb360
> 0x40b11120:   0003 40b113d0
> 0x40b11130:   01002aab0052d0c0 40b113d0
> 0x40b11140:   00b3 40b112f0
> 0x40b11150:   40b113d0 2aab08c2a108 
> Instructions: (pc=0x2adb9e3f)
> 0x2adb9e2f:   48 89 5d b0 49 8b 55 08 49 8b 4c 24 08 48 8b 32
> 0x2adb9e3f:   4c 8b 21 8b 4e 1c 49 8d 7c 24 10 89 cb 4a 39 34 
> Stack: [0x40a13000,0x40b14000],  sp=0x40b10f60,  free 
> space=1015k
> Native frames: (J=compiled Java cod

[jira] Created: (LUCENE-1918) Adding empty ParallelReader indexes to an IndexWriter may cause ArrayIndexOutOfBoundsException or NoSuchElementException

2009-09-17 Thread JIRA
Adding empty ParallelReader indexes to an IndexWriter may cause 
ArrayIndexOutOfBoundsException or NoSuchElementException


 Key: LUCENE-1918
 URL: https://issues.apache.org/jira/browse/LUCENE-1918
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1, 2.4.2, 2.9
 Environment: any
Reporter: Christian Kohlschütter
 Fix For: 2.9, 2.4.1


Hi,
I recently stumbled upon this:

It is possible (and perfectly legal) to add empty indexes (IndexReaders) to an 
IndexWriter. However, when using ParallelReaders in this context, in two 
situations RuntimeExceptions may occur for no good reason.

Condition 1:
The indexes within the ParallelReader are just empty.

When adding them to the IndexWriter, we get a java.util.NoSuchElementException 
triggered by ParallelTermEnum's constructor. The reason for that is the 
TreeMap#firstKey() method which was assumed to return null if there is no entry 
(which is not true, apparently -- it only returns null if the first key in the 
Map is null).


Condition 2 (Assuming the aforementioned bug is fixed):
The indexes within the ParallelReader originally contained one or more fields 
with TermVectors, but all documents have been marked as deleted.

When adding the indexes to the IndexWriter, we get a 
java.lang.ArrayIndexOutOfBoundsException triggered by 
TermVectorsWriter#addAllDocVectors. The reason here is that TermVectorsWriter 
assumes that if the index is marked to have TermVectors, at least one field 
actually exists for that. This unfortunately is not true, either.

Patches and a testcase demonstrating the two bugs are provided.

Cheers,
Christian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1918) Adding empty ParallelReader indexes to an IndexWriter may cause ArrayIndexOutOfBoundsException or NoSuchElementException

2009-09-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Kohlschütter updated LUCENE-1918:
---

Attachment: ParallelReaderWithEmptyIndex.patch
ParallelReaderWithEmptyIndex-testcase.patch

Testcase and bugfixes for trunk (should also be applicable to 2.4.1)


> Adding empty ParallelReader indexes to an IndexWriter may cause 
> ArrayIndexOutOfBoundsException or NoSuchElementException
> 
>
> Key: LUCENE-1918
> URL: https://issues.apache.org/jira/browse/LUCENE-1918
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.4.1, 2.4.2, 2.9
> Environment: any
>Reporter: Christian Kohlschütter
> Fix For: 2.4.1, 2.9
>
> Attachments: ParallelReaderWithEmptyIndex-testcase.patch, 
> ParallelReaderWithEmptyIndex.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Hi,
> I recently stumbled upon this:
> It is possible (and perfectly legal) to add empty indexes (IndexReaders) to 
> an IndexWriter. However, when using ParallelReaders in this context, in two 
> situations RuntimeExceptions may occur for no good reason.
> Condition 1:
> The indexes within the ParallelReader are just empty.
> When adding them to the IndexWriter, we get a 
> java.util.NoSuchElementException triggered by ParallelTermEnum's constructor. 
> The reason for that is the TreeMap#firstKey() method which was assumed to 
> return null if there is no entry (which is not true, apparently -- it only 
> returns null if the first key in the Map is null).
> Condition 2 (Assuming the aforementioned bug is fixed):
> The indexes within the ParallelReader originally contained one or more fields 
> with TermVectors, but all documents have been marked as deleted.
> When adding the indexes to the IndexWriter, we get a 
> java.lang.ArrayIndexOutOfBoundsException triggered by 
> TermVectorsWriter#addAllDocVectors. The reason here is that TermVectorsWriter 
> assumes that if the index is marked to have TermVectors, at least one field 
> actually exists for that. This unfortunately is not true, either.
> Patches and a testcase demonstrating the two bugs are provided.
> Cheers,
> Christian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1186) [PATCH] Clear ThreadLocal instances in close()

2008-02-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12572092#action_12572092
 ] 

Christian Kohlschütter commented on LUCENE-1186:


This issue is rather a prophylactic one -- until now, I have not encountered an 
OutOfMemoryError or slowdown etc.

However, I think it is a good practice to release all resources as soon as an 
object is not used anymore. For SegmentReader, this is the case when #close() 
is called. More, as noted in LUCENE-436, some VMs (also recent ones) indeed 
seem to have problems when ThreadLocal values are not released, so I think it 
is not just a cosmetic issue.


> [PATCH] Clear ThreadLocal instances in close()
> --
>
> Key: LUCENE-1186
> URL: https://issues.apache.org/jira/browse/LUCENE-1186
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.3, 2.3.1, 2.4
> Environment: any
>Reporter: Christian Kohlschütter
>Priority: Minor
> Attachments: LUCENE-1186-SegmentReader.patch
>
>
> As already found out in LUCENE-436, there seems to be a garbage collection 
> problem with ThreadLocals at certain constellations, resulting in an 
> OutOfMemoryError.
> The resolution there was to remove the reference to the ThreadLocal value 
> when calling the close() method of the affected classes (see FieldsReader and 
> TermInfosReader).
> For Java < 5.0, this can effectively be done by calling 
> threadLocal.set(null); for Java >= 5.0, we would call threadLocal.remove()
> Analogously, this should be done in *any* class which creates ThreadLocal 
> values
> Right now, two classes of the core API make use of ThreadLocals, but do not 
> properly remove their references to the ThreadLocal value
> 1. org.apache.lucene.index.SegmentReader
> 2. org.apache.lucene.analysis.Analyzer
> For SegmentReader, I have attached a simple patch.
> For Analyzer, there currently is no patch because Analyzer does not provide a 
> close() method (future to-do?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1203) [PATCH] Allow setting IndexReader to IndexSearcher

2008-03-06 Thread JIRA
[PATCH] Allow setting IndexReader to IndexSearcher
--

 Key: LUCENE-1203
 URL: https://issues.apache.org/jira/browse/LUCENE-1203
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
 Environment: Linux/2.6
Reporter: Mindaugas Žakšauskas
 Attachments: IndexReaderSetter4IndexSearcher.patch

As I've received no counter-arguments for my Lucene Java-User mailing list (see 
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
PROTECTED]), I would like to propose adding a setter to set new instance of 
IndexReader to IndexSearcher. 

Why is this needed?

The FAQ 
(http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82)
 says:
bq. ??"Make sure you only open one IndexSearcher, and share it among all of the 
threads that are doing searches -- this is safe, and it will minimize the 
number of files that are open concurently."??
So does the JavaDoc 
(http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/search/IndexSearcher.html).

In my application, I don't want to expose anything about IndexReader; all they 
need to know is Searcher - see my post to the mailing list how would I do this. 
However, if the index is updated, reopened reader cannot be set back to 
IndexSearcher, a new instance of IndexSearcher needs to be created (*which 
contradicts FAQ and Javadoc*).

At the moment, the only way to go around this is to create a surrogate subclass 
of IndexSearcher and set new instance of IndexReader. A simple setter would 
just do the job.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1203) [PATCH] Allow setting IndexReader to IndexSearcher

2008-03-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mindaugas Žakšauskas updated LUCENE-1203:
-

Attachment: IndexReaderSetter4IndexSearcher.patch

> [PATCH] Allow setting IndexReader to IndexSearcher
> --
>
> Key: LUCENE-1203
> URL: https://issues.apache.org/jira/browse/LUCENE-1203
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
> Environment: Linux/2.6
>Reporter: Mindaugas Žakšauskas
> Attachments: IndexReaderSetter4IndexSearcher.patch
>
>
> As I've received no counter-arguments for my Lucene Java-User mailing list 
> (see 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]), I would like to propose adding a setter to set new instance of 
> IndexReader to IndexSearcher. 
> Why is this needed?
> The FAQ 
> (http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82)
>  says:
> bq. ??"Make sure you only open one IndexSearcher, and share it among all of 
> the threads that are doing searches -- this is safe, and it will minimize the 
> number of files that are open concurently."??
> So does the JavaDoc 
> (http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/search/IndexSearcher.html).
> In my application, I don't want to expose anything about IndexReader; all 
> they need to know is Searcher - see my post to the mailing list how would I 
> do this. However, if the index is updated, reopened reader cannot be set back 
> to IndexSearcher, a new instance of IndexSearcher needs to be created (*which 
> contradicts FAQ and Javadoc*).
> At the moment, the only way to go around this is to create a surrogate 
> subclass of IndexSearcher and set new instance of IndexReader. A simple 
> setter would just do the job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1203) [PATCH] Allow setting IndexReader to IndexSearcher

2008-03-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575757#action_12575757
 ] 

Mindaugas Žakšauskas commented on LUCENE-1203:
--

In this case FAQ and IndexSearcher Javadoc needs updating as they're clearly 
misleading on this case.

What would be your recommendation for minimizing the number of file descriptors 
used? We experience this problem and it's a real show stopper for us (see my 
post to the users mailing list).

Also, could you elaborate why is it harmful to add the setter? I was taught to 
avoid object creation if I can to save performance on garbage collection 
(regardless if the object is lightweight or not). Say, if I add 1000 new 
objects to the index, I potentially need to create 1000 object instances. Can't 
think of any reason of why can it be good.

Thanks!


> [PATCH] Allow setting IndexReader to IndexSearcher
> --
>
> Key: LUCENE-1203
>     URL: https://issues.apache.org/jira/browse/LUCENE-1203
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
> Environment: Linux/2.6
>Reporter: Mindaugas Žakšauskas
> Attachments: IndexReaderSetter4IndexSearcher.patch
>
>
> As I've received no counter-arguments for my Lucene Java-User mailing list 
> (see 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]), I would like to propose adding a setter to set new instance of 
> IndexReader to IndexSearcher. 
> Why is this needed?
> The FAQ 
> (http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82)
>  says:
> bq. ??"Make sure you only open one IndexSearcher, and share it among all of 
> the threads that are doing searches -- this is safe, and it will minimize the 
> number of files that are open concurently."??
> So does the JavaDoc 
> (http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/search/IndexSearcher.html).
> In my application, I don't want to expose anything about IndexReader; all 
> they need to know is Searcher - see my post to the mailing list how would I 
> do this. However, if the index is updated, reopened reader cannot be set back 
> to IndexSearcher, a new instance of IndexSearcher needs to be created (*which 
> contradicts FAQ and Javadoc*).
> At the moment, the only way to go around this is to create a surrogate 
> subclass of IndexSearcher and set new instance of IndexReader. A simple 
> setter would just do the job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1203) [PATCH] Allow setting IndexReader to IndexSearcher

2008-03-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12575780#action_12575780
 ] 

Mindaugas Žakšauskas commented on LUCENE-1203:
--

bq. ... is correct advice. It means for a particular view of the index.

Exactly! IndexSearcher is only 100% correct until something new is added to the 
index. Your second part of comment (about new IndexSearcher) should definitely 
be added to the FAQ/Javadoc.

We tend to use Lucene in a very similar way databases are used: items are added 
very, very frequently and there's an immediate need to have the up-to-date 
Searcher; while processes that add the new data and poll the index are 
asynchronous.
Such a situation generates plenty of Searchers which I wanted to avoid.


> [PATCH] Allow setting IndexReader to IndexSearcher
> --
>
> Key: LUCENE-1203
> URL: https://issues.apache.org/jira/browse/LUCENE-1203
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
> Environment: Linux/2.6
>Reporter: Mindaugas Žakšauskas
> Attachments: IndexReaderSetter4IndexSearcher.patch
>
>
> As I've received no counter-arguments for my Lucene Java-User mailing list 
> (see 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]), I would like to propose adding a setter to set new instance of 
> IndexReader to IndexSearcher. 
> Why is this needed?
> The FAQ 
> (http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82)
>  says:
> bq. ??"Make sure you only open one IndexSearcher, and share it among all of 
> the threads that are doing searches -- this is safe, and it will minimize the 
> number of files that are open concurently."??
> So does the JavaDoc 
> (http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/search/IndexSearcher.html).
> In my application, I don't want to expose anything about IndexReader; all 
> they need to know is Searcher - see my post to the mailing list how would I 
> do this. However, if the index is updated, reopened reader cannot be set back 
> to IndexSearcher, a new instance of IndexSearcher needs to be created (*which 
> contradicts FAQ and Javadoc*).
> At the moment, the only way to go around this is to create a surrogate 
> subclass of IndexSearcher and set new instance of IndexReader. A simple 
> setter would just do the job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-954) Toggle score normalization in Hits

2008-03-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579367#action_12579367
 ] 

Christian Kohlschütter commented on LUCENE-954:
---

I agree with Hoss. Please file a new issue if you want to see Hits (and 
consequently also Hit/HitIterator) being deprecated. I do not see any reason 
for this, though.

This patch is meant for helping Lucene users who currently use the Hits class 
and particularly have problems with the built-in score normalization, and not 
with its performance.



> Toggle score normalization in Hits
> --
>
> Key: LUCENE-954
> URL: https://issues.apache.org/jira/browse/LUCENE-954
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2, 2.3, 2.3.1, 2.4
> Environment: any
>Reporter: Christian Kohlschütter
> Fix For: 2.4
>
> Attachments: hits-scoreNorm.patch, LUCENE-954.patch
>
>
> The current implementation of the "Hits" class sometimes performs score 
> normalization.
> In particular, whenever the top-ranked score is bigger than 1.0, it is 
> normalized to a maximum of 1.0.
> In this case, Hits may return different score results than TopDocs-based 
> methods.
> In my scenario (a federated search system), Hits delievered just plain wrong 
> results.
> I was merging results from several sources, all having homogeneous statistics 
> (similar to MultiSearcher, but over the Internet using HTTP/XML-based 
> protocols).
> Sometimes, some of the sources had a top-score greater than 1, so I ended up 
> with garbled results.
> I suggest to add a switch to enable/disable this score-normalization at 
> runtime.
> My patch (attached) has an additional peformance benefit, since score 
> normalization now occurs only when Hits#score() is called, not when creating 
> the Hits result list. Whenever scores are not required, you save one 
> multiplication per retrieved hit (i.e., at least 100 multiplications with the 
> current implementation of Hits).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1257) Port to Java5

2008-04-02 Thread JIRA
Port to Java5
-

 Key: LUCENE-1257
 URL: https://issues.apache.org/jira/browse/LUCENE-1257
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis, Examples, Index, Other, Query/Scoring, 
QueryParser, Search, Store, Term Vectors
Affects Versions: 2.3.1
Reporter: Cédric Champeau
 Attachments: java5.patch

For my needs I've updated Lucene so that it uses Java 5 constructs. I know Java 
5 migration had been planned for 2.1 someday in the past, but don't know when 
it is planned now. This patch against the trunk includes :

- most obvious generics usage (there are tons of usages of sets, ... Those 
which are commonly used have been generified)
- PriorityQueue generification
- replacement of indexed for loops with for each constructs
- removal of unnececessary unboxing

The code is to my opinion much more readable with those features (you actually 
*know* what is stored in collections reading the code, without the need to 
lookup for field definitions everytime) and it simplifies many algorithms.

Note that this patch also includes an interface for the Query class. This has 
been done for my company's needs for building custom Query classes which add 
some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
casts. I know this introduction is not wanted by the team, but it really makes 
our developments easier to maintain. If you don't want to use this, replace all 
/Queriable/ calls with standard /Query/.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1257) Port to Java5

2008-04-02 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cédric Champeau updated LUCENE-1257:


Attachment: java5.patch

Patch against the trunk

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
> Attachments: java5.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1269) Analysers and Filters should not be final

2008-04-22 Thread JIRA
Analysers and Filters should not be final
-

 Key: LUCENE-1269
 URL: https://issues.apache.org/jira/browse/LUCENE-1269
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3.1
Reporter: Cédrik LIME


I am trying to extend some Lucene Analysers to further improve their behaviour. 
However, some Analysers and Filters are final classes that I cannot extend 
(thus resorting to copying the class, which is less than optimal).

Any reason we would want to inhibit people from extending a class like 
FrenchAnalyzer?

Could me make all Analysers and Filters in the contrib-analysis package 
non-final?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

2008-05-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594838#action_12594838
 ] 

François Terrier commented on LUCENE-1166:
--

Is there any plan of integrating this patch in the official lucene libraries in 
the short term ? 

> A tokenfilter to decompose compound words
> -
>
> Key: LUCENE-1166
> URL: https://issues.apache.org/jira/browse/LUCENE-1166
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Reporter: Thomas Peuss
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> CompoundTokenFilter.patch, CompoundTokenFilter.patch, 
> CompoundTokenFilter.patch, CompoundTokenFilter.patch, de.xml, hyphenation.dtd
>
>
> A tokenfilter to decompose compound words you find in many germanic languages 
> (like German, Swedish, ...) into single tokens.
> An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
> that you can find the word even when you only enter "Schiff".
> I use the hyphenation code from the Apache XML project FOP 
> (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
> Currently I use the FOP jars directly. I only use a handful of classes from 
> the FOP project.
> My question now:
> Would it be OK to copy this classes over to the Lucene project (renaming the 
> packages of course) or should I stick with the dependency to the FOP jars? 
> The FOP code uses the ASF V2 license as well.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2008-05-12 Thread JIRA
Set of Java classes that allow the Lucene search engine to use morphological 
information developed for the Apertium open-source machine translation platform 
(http://www.apertium.org)
--

 Key: LUCENE-1284
 URL: https://issues.apache.org/jira/browse/LUCENE-1284
 Project: Lucene - Java
  Issue Type: New Feature
 Environment: New feature developed under GNU/Linux, but it should work 
in any other Java-compliance platform
Reporter: Felipe Sánchez Martínez


Set of Java classes that allow the Lucene search engine to use morphological 
information developed for the Apertium open-source machine translation platform 
(http://www.apertium.org). Morphological information is used to index new 
documents and to process smarter queries in which morphological attributes can 
be used to specify query terms.

The tool makes use of morphological analyzers and dictionaries developed for 
the open-source machine translation platform Apertium (http://apertium.org) 
and, optionally, the part-of-speech taggers developed for it. Currently there 
are morphological dictionaries available for Spanish, Catalan, Galician, 
Portuguese, 
Aranese, Romanian, French and English. In addition new dictionaries are being 
developed for Esperanto, Occitan, Basque, Swedish, Danish, 
Welsh, Polish and Italian, among others; we hope more language pairs to be 
added to the Apertium machine translation platform in the near future.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2008-05-12 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe Sánchez Martínez updated LUCENE-1284:


Attachment: apertium-morph.2008-05-12.patch

Patch file containing all the new classes created. The patch will create a new 
folder in contrib. No existing code is modified.

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
> Attachments: apertium-morph.2008-05-12.patch
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2008-05-12 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe Sánchez Martínez updated LUCENE-1284:


Attachment: apertium-morph.2008-05-12.tgz

All the files compressed together. Decompress in the lucene trunk folder

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
> Attachments: apertium-morph.2008-05-12.patch, 
> apertium-morph.2008-05-12.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2008-05-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe Sánchez Martínez updated LUCENE-1284:


Attachment: (was: apertium-morph.2008-05-12.patch)

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.2008-05-12.tgz, 
> apertium-morph.2008-05-19.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2008-05-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe Sánchez Martínez updated LUCENE-1284:


Attachment: apertium-morph.2008-05-19.tgz

Typo in a package name: src/java/org/apache/lucene/benckmark/ (should be 
benchmark) solved.

build.xml fixed. I have tried on a clean SVN version and it compiles without 
errors. Using sun-java-6.

Forget the previous attachments.

--
Felipe. 

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.2008-05-12.tgz, 
> apertium-morph.2008-05-19.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.a

2008-05-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe Sánchez Martínez updated LUCENE-1284:


Attachment: (was: apertium-morph.2008-05-12.tgz)

> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org)
> --
>
> Key: LUCENE-1284
> URL: https://issues.apache.org/jira/browse/LUCENE-1284
> Project: Lucene - Java
>  Issue Type: New Feature
> Environment: New feature developed under GNU/Linux, but it should 
> work in any other Java-compliance platform
>Reporter: Felipe Sánchez Martínez
>Assignee: Otis Gospodnetic
> Attachments: apertium-morph.2008-05-19.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological 
> information developed for the Apertium open-source machine translation 
> platform (http://www.apertium.org). Morphological information is used to 
> index new documents and to process smarter queries in which morphological 
> attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for 
> the open-source machine translation platform Apertium (http://apertium.org) 
> and, optionally, the part-of-speech taggers developed for it. Currently there 
> are morphological dictionaries available for Spanish, Catalan, Galician, 
> Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being 
> developed for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be 
> added to the Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598252#action_12598252
 ] 

Christian Kohlschütter commented on LUCENE-1290:


-1 from me for the current solution.

Deprecating Hits necessarily means deprecating HitIterator. With 
Hits/HitIterator we have two really simple ways to iterate over a long list of 
search results. The TopDocs/HitCollector-based approach is basically one level 
below Hits, and thus, Hits can clearly be regarded a convenience class then. It 
is not as flexible as HitCollector, but serves its purpose very well. 

What could make sense is to deprecate the Searcher#search() methods which 
return a Hits instance, to reduce API clutter. Hits could have a public 
constructor that takes a Searcher instance instead.

> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598471#action_12598471
 ] 

Christian Kohlschütter commented on LUCENE-1290:


Michael,

the current implementation of Hits certainly has its deficiencies, but 
represents a very simple way to retrieve documents from Lucene. As long as 
there is no real replacement, I simply do not a reason to deprecate it.

A replacement could be an API which allows something like:

for(Iterator it = searcher.iterator(query); it.hasNext(); ) {
  (...)
  if (...) break;
}




> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1290) Deprecate Hits

2008-05-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598565#action_12598565
 ] 

Christian Kohlschütter commented on LUCENE-1290:


Michael:
The HitCollector callback is called in index order (or in any other, 
non-deterministic order), whereas the results in Hits are sorted (by relevance 
or any given Sort order). 

Uwe:
Good idea, this would be even better than the plain iterator class.


> Deprecate Hits
> --
>
> Key: LUCENE-1290
> URL: https://issues.apache.org/jira/browse/LUCENE-1290
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1290.patch, lucene-1290.patch
>
>
> The Hits class has several drawbacks as pointed out in LUCENE-954.
> The other search APIs that use TopDocCollector and TopDocs should be used 
> instead.
> This patch:
> - deprecates org/apache/lucene/search/Hits, Hit, and HitIterator, as well as
>   the Searcher.search( * ) methods which return a Hits Object.
> - removes all references to Hits from the core and uses TopDocs and ScoreDoc
>   instead
> - Changes the demo SearchFiles: adds the two modes 'paging search' and 
> 'streaming search',
>   each of which demonstrating a different way of using the search APIs. The 
> former
>   uses TopDocs and a TopDocCollector, the latter a custom HitCollector 
> implementation.
> - Updates the online tutorial that descibes the demo.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1183) TRStringDistance uses way too much memory (with patch)

2008-05-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599799#action_12599799
 ] 

Cédrik LIME commented on LUCENE-1183:
-

All of Bob's FuzzyTermEnum patch is in my patch. I only left some smallish 
optimizations that didn't bring much but did hurt code readability. In other 
words, should you commit my patch, you will have most of (99.9%) LUCENE-691.
I think this is an important patch for Lucene 2.4, as it brings vast 
performance improvements in fuzzy search (no hard numbers, sorry).

> TRStringDistance uses way too much memory (with patch)
> --
>
> Key: LUCENE-1183
> URL: https://issues.apache.org/jira/browse/LUCENE-1183
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3
>Reporter: Cédrik LIME
>Assignee: Otis Gospodnetic
>Priority: Minor
> Attachments: FuzzyTermEnum.patch, TRStringDistance.java, 
> TRStringDistance.patch
>
>   Original Estimate: 0.17h
>  Remaining Estimate: 0.17h
>
> The implementation of TRStringDistance is based on version 2.1 of 
> org.apache.commons.lang.StringUtils#getLevenshteinDistance(String, String), 
> which uses an un-optimized implementation of the Levenshtein Distance 
> algorithm (it uses way too much memory). Please see Bug 38911 
> (http://issues.apache.org/bugzilla/show_bug.cgi?id=38911) for more 
> information.
> The commons-lang implementation has been heavily optimized as of version 2.2 
> (3x speed-up). I have reported the new implementation to TRStringDistance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1203) [PATCH] Allow setting IndexReader to IndexSearcher

2008-07-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613329#action_12613329
 ] 

Mindaugas Žakšauskas commented on LUCENE-1203:
--

I know I'm posting this after rather insane amount of time, but just wanted to 
get an opinion about another approach.

After what has been done for LUCENE-743, would it not make sense to add 
refresh() method to the searcher which would reopen() the reader?
My understanding is that even if code bits would be referencing/relying on the 
old reader, they could still use it as Javadoc says the old one should remain 
unclosed although the new searches, etc. would carry on with the updated 
reader. Am I wrong?

> [PATCH] Allow setting IndexReader to IndexSearcher
> --
>
> Key: LUCENE-1203
> URL: https://issues.apache.org/jira/browse/LUCENE-1203
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
> Environment: Linux/2.6
>Reporter: Mindaugas Žakšauskas
> Attachments: IndexReaderSetter4IndexSearcher.patch
>
>
> As I've received no counter-arguments for my Lucene Java-User mailing list 
> (see 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]), I would like to propose adding a setter to set new instance of 
> IndexReader to IndexSearcher. 
> Why is this needed?
> The FAQ 
> (http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82)
>  says:
> bq. ??"Make sure you only open one IndexSearcher, and share it among all of 
> the threads that are doing searches -- this is safe, and it will minimize the 
> number of files that are open concurently."??
> So does the JavaDoc 
> (http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/search/IndexSearcher.html).
> In my application, I don't want to expose anything about IndexReader; all 
> they need to know is Searcher - see my post to the mailing list how would I 
> do this. However, if the index is updated, reopened reader cannot be set back 
> to IndexSearcher, a new instance of IndexSearcher needs to be created (*which 
> contradicts FAQ and Javadoc*).
> At the moment, the only way to go around this is to create a surrogate 
> subclass of IndexSearcher and set new instance of IndexReader. A simple 
> setter would just do the job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1203) [PATCH] Allow setting IndexReader to IndexSearcher

2008-07-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613338#action_12613338
 ] 

Mindaugas Žakšauskas commented on LUCENE-1203:
--

Correct me if I'm wrong, but I thought the ??reopen()?? returns new instance of 
the refreshed reader (in a case if the index was modified), but the current 
instance remains unchanged. If this is true, how would I set the refreshed 
instance of ??IndexReader?? to existing ??IndexSearcher???

Would be nice if you could confirm this is actually the case (and possibly add 
a small bit of clarification to the IndexReader Javadoc if my assumptions were 
wrong).

Thanks a lot!


> [PATCH] Allow setting IndexReader to IndexSearcher
> --
>
> Key: LUCENE-1203
> URL: https://issues.apache.org/jira/browse/LUCENE-1203
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
> Environment: Linux/2.6
>Reporter: Mindaugas Žakšauskas
> Attachments: IndexReaderSetter4IndexSearcher.patch
>
>
> As I've received no counter-arguments for my Lucene Java-User mailing list 
> (see 
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]), I would like to propose adding a setter to set new instance of 
> IndexReader to IndexSearcher. 
> Why is this needed?
> The FAQ 
> (http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82)
>  says:
> bq. ??"Make sure you only open one IndexSearcher, and share it among all of 
> the threads that are doing searches -- this is safe, and it will minimize the 
> number of files that are open concurently."??
> So does the JavaDoc 
> (http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/search/IndexSearcher.html).
> In my application, I don't want to expose anything about IndexReader; all 
> they need to know is Searcher - see my post to the mailing list how would I 
> do this. However, if the index is updated, reopened reader cannot be set back 
> to IndexSearcher, a new instance of IndexSearcher needs to be created (*which 
> contradicts FAQ and Javadoc*).
> At the moment, the only way to go around this is to create a surrogate 
> subclass of IndexSearcher and set new instance of IndexReader. A simple 
> setter would just do the job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1344) Make the Lucene jar an OSGi bundle

2008-07-23 Thread JIRA
Make the Lucene jar an OSGi bundle
--

 Key: LUCENE-1344
 URL: https://issues.apache.org/jira/browse/LUCENE-1344
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Reporter: Nicolas Lalevée


In order to use Lucene in an OSGi environment, some additional headers are 
needed in the manifest of the jar. As Lucene has no dependency, it is pretty 
straight forward and it ill be easy to maintain I think.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1344) Make the Lucene jar an OSGi bundle

2008-07-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-1344:


Attachment: LUCENE-1344-r679133.patch

Here is a patch against trunk.

The patch on common-build.xml allows the core or every contrib to package the 
jar as an OSGi bundle. When building, it just need a property 
"bundle.manifest.file" which is pointing to the template MANIFEST.MF to use. 
The patch on build.xml with the MANIFEST.MF makes lucene core jar an OSGi 
bundle.

Also in order to not have to maintain a third version scheme, I have added a 
/release target to compute the versions. So this doc should be updated :
http://wiki.apache.org/lucene-java/ReleaseTodo
{code}ant -Dversion=2.3.0-rc1 -Dspec.version=2.3.0 clean dist dist-src 
generate-maven-artifacts{code}
should be replaced by:
{code}ant /release clean dist dist-src generate-maven-artifacts{code}

Then about maintenance, the version in the MANIFEST.MF file is just usefull for 
people having the Lucene source in Eclipse and usin it as an OSGi bundle. The 
version is actually overridden while building the jar. And every new java 
package that is part of the Lucene API have to be added to the Export-Package 
header.



> Make the Lucene jar an OSGi bundle
> --
>
> Key: LUCENE-1344
> URL: https://issues.apache.org/jira/browse/LUCENE-1344
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicolas Lalevée
> Attachments: LUCENE-1344-r679133.patch
>
>
> In order to use Lucene in an OSGi environment, some additional headers are 
> needed in the manifest of the jar. As Lucene has no dependency, it is pretty 
> straight forward and it ill be easy to maintain I think.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1351) Add some ligatures (ff, fi, fl, ft, st) to ISOLatin1AccentFilter

2008-08-05 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cédrik LIME updated LUCENE-1351:


Attachment: ISOLatin1AccentFilter.patch

> Add some ligatures (ff, fi, fl, ft, st) to ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1351
> URL: https://issues.apache.org/jira/browse/LUCENE-1351
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.2
>Reporter: Cédrik LIME
>Priority: Minor
> Attachments: ISOLatin1AccentFilter.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> ISOLatin1AccentFilter remove common diacritics and some ligature. This patch 
> adds support for additional common ligatures: ff, fi, fl, ft, st.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   3   4   5   6   7   8   9   10   >