Re: Custom filters document numbers

2005-03-01 Thread tomsdepot-lucene
I'm also interested in knowing what can change the doc numbers.

Does this happen frequently?  Like Stanislav has been asking... what sort of
operations on the index cause the document number to change for any given
document?  If the document numbers change frequently, is there a
straightforward way to modify Lucene to keep the document numbers the same for
the life of the document?  I'd like to have mappings in my sql database that
point to the document numbers that Lucene search returns in its Hits objects.

Thanks,

-Tom-

--- Stanislav Jordanov [EMAIL PROTECTED] wrote:

 The first statement is clear to me:
 I know that an IndexReader sees a 'snapshot' of the document set that was
 taken in the moment of the Reader's creation.
 
 What I don't know is whether this 'snapshot' has also its doc numbers fixed
 or they may change asynchronously.
 And another thing I don't know is what are the index operations that may
 cause the (doc - doc number) mapping to change.
 Is it only after delete or there are other ocasions, or I'd better not count
 on this at all.
 
 StJ
 
 - Original Message - 
 From: Vanlerberghe, Luc [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Thursday, February 24, 2005 4:07 PM
 Subject: RE: Custom filters  document numbers
 
 
  An IndexReader will always see the same set of documents.
  Even if another process deletes some documents, adds new ones or
  optimizes the complete index, your IndexReader instance will not see
  those changes.
 
  If you detect that the Lucene index changed (e.g. by calling
  IndexReader.getCurrentVersion(...) once in a while), you should close
  and reopen your 'current' IndexReader and recalculate any data that
  relies on the Lucene document numbers.
 
  Regards, Luc.
 
  -Original Message-
  From: Stanislav Jordanov [mailto:[EMAIL PROTECTED]
  Sent: donderdag 24 februari 2005 14:18
  To: Lucene Users List
  Subject: Custom filters  document numbers
 
  Given an IndexReader a custom filter is supposed to create a bit set,
  that maps each document numbers to {'visible', 'invisible'} On the other
  hand, it is stated that Lucene is allowed to change document numbers.
  Is it guaranteed that this BitSet's view of document numbers won't
  change while the BitSet is still in use (or perhaps the corresponding
  IndexReader is still opened) ?
 
  And another (more low-level) question.
  When Lucene may change document numbers?
  Is it only when the index is optimized after there has been a delete
  operation?
 
  Regards: StJ
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search question

2004-12-23 Thread roy-lucene-user
Erik,

They both use the StandardAnalyzer... however looking at the toString() makes
everything clearer.  In the case a string has the following email address:
[EMAIL PROTECTED], it gets split like so: first.last domain.com

However in 1.4 it does not get split.

So now we just check to see if an index was built using 1.2 or 1.4 and have
some checks thrown in.

Thanks for the guidance.

Roy.

On Wed, 22 Dec 2004 18:41:44 -0500, Erik Hatcher wrote
 What does toString() return for each of those queries?  Are you 
 using the same analyzer in both cases?
 
   Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



search question

2004-12-22 Thread roy-lucene-user
Hi guys,

We have an index with some fields containing email addresses.  Doing a search 
for an email address with this format: [EMAIL PROTECTED], does not bring up any 
results with lucene 1.4.

The query: Field1:[EMAIL PROTECTED]

However it returns results with 1.2.  Any ideas?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lock file paths

2004-11-15 Thread roy-lucene-user
Hey guys,

Quick question... is there a way to get the file paths to the lock files?  Or 
do I have to modify the src?  Currently I can't find any methods that will 
return a lock's file path.

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene : avoiding locking

2004-11-11 Thread yahootintin-lucene
I'm working on a similar project...
Make sure that only one call to the index method is occuring at
a time.  Synchronizing that method should do it.

--- Luke Shannon [EMAIL PROTECTED] wrote:

 Hi All;
 
 I have hit a snag in my Lucene integration and don't know what
 to do.
 
  My company has a content management product. Each time
 someone changes the
  directory structure or a file with in it that portion of the
 site needs to
  be re-indexed so the changes are reflected in future searches
 (indexing
 must
  happen during run time).
 
  I have written a Indexer class with a static Index() method.
 The idea is
 too
  call the method every time something changes and the index
 needs to be
  re-examined. I am hoping the logic put in by Doug Cutting
 surrounding the
  UID will make indexing efficient enough to be called so
 frequently.
 
  This class works great when I tested it on my own little site
 (I have about
  2000 file). But when I drop the functionality into the QA
 environment I get
  a locking error.
 
  I can't access the stack trace, all I can get at is a log
 file the
  application writes too. Here is the section my class wrote.
 It was right in
  the middle of indexing and bang lock issue.
 
  I don't know if the problem is in my code or something in the
 existing
  application.
 
  Error Message:
  ENTER|SearchEventProcessor.visit(ContentNodeDeleteEvent)
  |INFO|INDEXING INFO: Start Indexing new content.
  |INFO|INDEXING INFO: Index Folder Did Not Exist. Start
 Creation Of New
 Index
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING INFO: Beginnging Incremental update
 comparisions
  |INFO|INDEXING ERROR: Unable to index new content Lock obtain
 timed out:
 

Lock@/usr/tomcat/jakarta-tomcat-5.0.19/temp/lucene-398fbd170a5457d05e2f4d432
  10f7fe8-write.lock
 
 |ENTER|UpdateCacheEventProcessor.visit(ContentNodeDeleteEvent)
 
  Here is my code. You will recognize it pretty much as the
 IndexHTML class
  from the Lucene demo written by Doug Cutting. I have put a
 ton of comments
  in a attempt to understand what is going on.
 
  Any help would be appreciated.
 
  Luke
 
  package com.fbhm.bolt.search;
 
  /*
   * Created on Nov 11, 2004
   *
   * This class will create a single index file for the Content
   * Management System (CMS). It contains logic to ensure
   * indexing is done intelligently. Based on IndexHTML.java
   * from the demo folder that ships with Lucene
   */
 
  import java.io.File;
  import java.io.IOException;
  import java.util.Arrays;
  import java.util.Date;
 
  import org.apache.lucene.analysis.standard.StandardAnalyzer;
  import org.apache.lucene.document.Document;
  import org.apache.lucene.index.IndexReader;
  import org.apache.lucene.index.IndexWriter;
  import org.apache.lucene.index.Term;
  import org.apache.lucene.index.TermEnum;
  import org.pdfbox.searchengine.lucene.LucenePDFDocument;
  import org.apache.lucene.demo.HTMLDocument;
 
  import com.alaia.common.debug.Trace;
  import com.alaia.common.util.AppProperties;
 
  /**
   * @author lshannon Description: br
   *   This class is used to index a content folder. It
 contains logic to
   *   ensure only new or documents that have been modified
 since the last
   *   search are indexed. br
   *   Based on code writen by Doug Cutting in the IndexHTML
 class found in
   *   the Lucene demo
   */
  public class Indexer {
   //true during deletion pass, this is when the index already
 exists
   private static boolean deleting = false;
 
   //object to read existing indexes
   private static IndexReader reader;
 
   //object to write to the index folder
   private static IndexWriter writer;
 
   //this will be used to write the index file
   private static TermEnum uidIter;
 
   /*
* This static method does all the work, the end result is
 an up-to-date
  index folder
   */
   public static void Index() {
//we will assume to start the index has been created
boolean create = true;
//set

Re: lucene file locking question

2004-11-11 Thread yahootintin-lucene
Disabling locking is only recommended for read-only indexes that
aren't being modified.  I think there is a comment in the code
about a good example of this being an index you read off of a
CD-ROM.

--- John Wang [EMAIL PROTECTED] wrote:

 Hi folks:
 
   My application builds a super-index around the lucene
 index,
 e.g. stores some additional information outside of lucene.
 
I am using my own locking outside of the lucene index
 via
 FileLock object in the jdk1.4 nio package.
 
My code does the following:
 
 FileLock lock=null;
 try{
 lock=myLockFileChannel.lock();
 
 indexing into lucene;
 
 indexing additional information;
 
 }
 
 finally{
   try{
   commit lucene index by closing the IndexWriter
 instance.
   }
   finally{
 if (lock!=null){
lock.release();
 }
   }
 }
 
 
 Now here is the weird thing, say I terminate the process in
 the middle
 of indexing, and run the program again, I would get a Lock
 obtain
 time out exception, as long as you delete the stale lock
 file, the
 index remains uncorrupted.
 
 However, if I turn lucene file lock off since I have a lock
 outside it anyways, 
 (by doing: 
 static{
 System.setProperty(disableLuceneLocks,true);
   }
 )
 
 and do the same thing. Instead I get an unrecoverable
 corrupted index.
 
 Does lucene lock really guarentee index integrity under this
 kind of
 abuse or am I just getting lucky?
 If so, can someone shine some light on how?
 
 Thanks in advance
 
 -John
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Locking issue

2004-11-10 Thread yahootintin-lucene
Whoops!  Looks like my attachment didn't make it through.  I'm
re-attaching my simple test app.

Thanks.

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 On Nov 10, 2004, at 5:48 PM, [EMAIL PROTECTED]
 wrote:
  Hi,
 
  With the information provided, I have no
  idea what the issue
  may be.
 
  Is there some information that I should post that will help
 determine
  why Lucene is giving me this error?
 
 You mentioned posting code - though I don't recall getting an 
 attachment.  If you could post it as a Bugzilla issue with
 your code 
 attached, it would be preserved outside of our mailboxes.  If
 the code 
 is self-contained enough for me to try it, I will at some
 point in the 
 near future.
 
   Erik
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Locking issue

2004-11-10 Thread yahootintin-lucene
Yes, I tried that too and it worked.  The issue is that our
Operations folks plan to install this on a pretty busy box and I
was hoping that Lucene wouldn't cause issues if it only had a
small slice of the CPU.

Guess I'll tell them to buy a bigger box!  Unless you have any
other ideas.  I'm running some tests with a larger timeout to
see if that helps.

--- Erik Hatcher [EMAIL PROTECTED] wrote:

 I just added a Thread.sleep(1000) in the writer thread and it
 has run 
 for quite some time, and is still running as I send this.
 
   Erik
 
 On Nov 10, 2004, at 8:02 PM, [EMAIL PROTECTED]
 wrote:
 
  I added it to Bugzilla like you suggested:
  http://issues.apache.org/bugzilla/show_bug.cgi?id=32171
 
 
  Let me know if you see any way to get around this issue.
 
  --- Lucene
  Users List [EMAIL PROTECTED] wrote:
  Whoops!  Looks like my
  attachment didn't make it through.  I'm
  re-attaching my simple test app.
 
 
  Thanks.
 
  --- Erik Hatcher [EMAIL PROTECTED] wrote:
 
 
  On Nov 10, 2004, at 5:48 PM,
 [EMAIL PROTECTED]
 
  wrote:
  Hi,
 
  With the information provided, I have no
 
  idea what the issue
  may be.
 
  Is there some information
  that I should post that will help
  determine
  why Lucene is giving
  me this error?
 
  You mentioned posting code - though I don't recall
  getting an
  attachment.  If you could post it as a Bugzilla issue with
 
  your code
  attached, it would be preserved outside of our mailboxes.
   If
  the code
  is self-contained enough for me to try it, I will
  at some
  point in the
  near future.
 
Erik
 
 
 
 
 

-
 
  To unsubscribe, e-mail:
  [EMAIL PROTECTED]
 
  For additional commands, e-mail:
  [EMAIL PROTECTED]
 
 
 
 
 
 

-
 
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For
  additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search scalability

2004-11-10 Thread yahootintin-lucene
Does it take 800MB of RAM to load that index into a
RAMDirectory?  Or are only some of the files loaded into RAM?

--- Otis Gospodnetic [EMAIL PROTECTED] wrote:

 Hello,
 
 100 parallel searches going against a single index on a single
 disk
 means a lot of disk seeks all happening at once.  One simple
 way of
 working around this is to load your FSDirectory into
 RAMDirectory. 
 This should be faster (could you report your
 observations/comparisons?).  You can also try using ramfs if
 you are
 using Linux.
 
 Otis
 
 --- Ravi [EMAIL PROTECTED] wrote:
 
   We have one large index for a document repository of
 800,000
  documents.
  The size of the index is 800MB. When we do searches against
 the
  index,
  it takes 300-500ms for a single search. We wanted to test
 the
  scalability and tried 100 parallel searches against the
 index with
  the
  same query and the average response time was 13 seconds. We
 used a
  simple IndexSearcher. Same searcher object was shared by all
 the
  searches. I'm sure people have success in configuring lucene
 for
  better
  scalability. Can somebody share their approach?
  
  Thanks 
  Ravi. 
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene1.4.1 + OutOf Memory

2004-11-09 Thread yahootintin-lucene
There is a memory leak in the sorting code of Lucene 1.4.1. 
1.4.2 has the fix!

--- Karthik N S [EMAIL PROTECTED] wrote:

 
 Hi
 Guys
 
 Apologies..
 
 
 
 History
 
 Ist type :  4  subindexes   +  MultiSearcher  + Search on
 Content Field
 Only  for 2000 hits
 
   
=
 Exception  [ Too many Files Open ]
 
 
 
 
 
 IInd type :  40 Mergerd Indexes [1000 subindexes each]   + 
 MultiSearcher
 /ParallelSearcher +  Search on Content Field Only for 2
 hits
 
   
=
 Exception  [ OutOf Memeory  ]
 
 
 
 System Config  [same for both type]
 
 Amd Processor [High End Single]
 RAM  1GB
 O/s Linux  ( jantoo type )
 Appserver Tomcat 5.05
 Jdk [ IBM  Blackdown-1.4.1-01  ( == Jdk1.4.1) ]
 
 Index contains 15 Fields
 Search
 Done only on 1 field
 Retrieve 11 corrosponding fields
 3 Fields  are for debug details
 
 
 Switched from Ist type to IInd Type
 
 Can some body suggest me Why is this Happening
 
 Thx in advance
 
 
 
 
   WITH WARM REGARDS
   HAVE A NICE DAY
   [ N.S.KARTHIK]
 
 
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: demo HTML parser question

2004-09-23 Thread roy-lucene-user
Hi Fred,

We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.  That's just a guess, I would love to hear
from others about this.  Anyway, since it is a separate thread, a token error
could kill it and there is no way for the calling thread to know about it.

We had to create our own html parser since we only cared about grabbing the
entire text from the html document and also we wanted to avoid the extra
thread.  We also do a lot of SKIPping for minimal EOF errors (html documents
in email almost never follow standards).  For your html needs, you might want
to check out other JavaCC HTML parsers from the JavaCC web site.

Roy.

On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote
 Hi,
 
 I've been working with the HTML parser demo that comes with
 Lucene and I'm trying to understand why it's multi-threaded,
 and, more importantly, how to exit gracefully on errors.
 
 I've discovered if I throw an exception in the front-end static
 code (main(), etc.), the JVM hangs instead of exiting. Presumably
 this is because there are threads hanging around doing something.
 But I'm not sure what!
 
 Any pointers? I just want to exit gracefully on an error such as
 a required meta tag is missing or similar.
 
 Thanks,
 
 Fred
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



compiling 1.4 source

2004-09-23 Thread roy-lucene-user
Hi guys,

So we started upgrading to 1.4 and we need to add some of our own custom code.
 After compiling with ant, I noticed that the 1.4 ant script builds a jar
called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar.  I'm pretty sure I
did not download the wrong source.  Is this just a wrong name in the
properties or does the source code actually contain lucene 1.5 rc1 code?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hits.doc(x) and range queries

2004-09-14 Thread roy-lucene-user
Hi guys!

I've posted previously that Hits.doc(x) was taking a long time.  Turns out it
has to do with a date range in our query.  We usually do date ranges like this:
Date:[(lucene date field) - (lucene date field)]

Sometimes the begin date is 0 which is what we get from
DateField.dateToString( ( new Date( 0 ) ).

This is when getting our search results from the Hits object takes an absurd
amount of time.  Its usually each time the Hits object attempts to get more
results from an IndexSearcher ( aka, every 100? ).

It also takes up more memory...

I was wondering why it affects the search so much even though we're only
returning 350 or so results.  Does the QueryParser do something similar to the
DateFilter on range queries?  Would it be better to use a DateFilter?

We're using Lucene 1.2 (with plans to upgrade).  Do newer versions of Lucene
have this problem?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Custom filter

2004-08-24 Thread roy-lucene-user
On Fri, 20 Aug 2004 20:01:36 -0400, Erik Hatcher wrote
 
 On Aug 20, 2004, at 6:48 PM, [EMAIL PROTECTED] wrote:
  We're currently in lucene 1.2... haven't moved to 1.3 yet.
 
 Skip 1.3 and go straight to 1.4.1 :)
 
 Upgrade - why not?

Well we have some MASSIVE indexes so updating needs to be planned out.  In the
meantime we continue with 1.2.  So, just for curiousity's sake... any clue on
the filter?  Or perhaps someone could clue me in on what kind of terms the
query parser creates ( and what the searcher class does with them ) when it
has something like (From:(blah OR blah2) OR To:(blah OR blah2)).  Tried to
look at the QueryParser.jj file but javacc makes my head hurt...

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Custom filter

2004-08-20 Thread roy-lucene-user
Hi guys!

I was hoping someone here could help me out with a custom filter.

We have an index of emails and do some searches on the text of an email message and 
also searches based on the email addresses in a To, From or CC.

Since we also do searches on a bunch of emails, we created a custom filter for 
searches on an array of fields for an array of values.  [code included below]

The problem we're having is that creating a query string like so:
Message:viagra AND (From:(email1 OR email2) OR To:(email1 OR email2) OR CC:(email1 OR 
email2))
would return results, but our filter combined with a query string of Message:viagra 
sometimes wouldn't.

One thing I noticed is that when the results do return with the filter, the email has 
the format of [EMAIL PROTECTED], but the one that doesn't has something like [EMAIL 
PROTECTED]

Also it might have something to do with the storage of the From or To or CC.  We don't 
parse out the email addresses before storing them.  So sometimes the value of a 
From/To/CC field might be [EMAIL PROTECTED] or local [EMAIL PROTECTED] or even 
[EMAIL PROTECTED].  Could the carrots be throwing off my filter?

I also wouldn't mind any suggestions to doing this filter better.

Here is the bits method from our custom filter:
-
final public BitSet bits( IndexReader reader ) throws IOException {
BitSet bits = new BitSet( reader.maxDoc() );

for ( int x = 0; x  fields.length; x++ ) {
for ( int y = 0; y  values.length; y++ ) {
TermDocs termDocs = reader.termDocs( new Term( fields[x], values[y] ) 
);
try {
while ( termDocs.next() ) {
bits.set( termDocs.doc() );
}
}
finally {
termDocs.close();
}
}
}
return bits;
}
-

Thanks in advance,

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Proximity searching and phrase

2004-07-30 Thread Lucene
Hi,

I was wondering is there is a way to do proximity searches with phrases
eg very good NEAR sometimes.

Any help on this would be welcome.

Many thanks,


Roy



Re: addIndexes vs addDocument

2004-07-07 Thread roy-lucene-user
Otis,

Okay, got it... however we weren't creating new document objects... just
grabbing a document through an IndexReader and calling addDocument on another
index.  Would that still work with unstored fields(well, its working for us
since we don't have any unstored fields)?

Thanks a lot!

Roy.

On Tue, 6 Jul 2004 19:46:30 -0700 (PDT), Otis Gospodnetic wrote
 Quick example.
 Index A has fields 'title' and 'contents'.
 Field 'contents' is stored in A as Field.UnStored.
 This means that you cannot retrieve the original content of the
 'contents' field, since that value was not stored verbatim in the
 index.
 Therefore, you cannot create a new Document instance, pull out String
 value of the 'contents' field from A, use it to create another field,
 add it to the new Document instance, and add that Document to a new
 index B using addDocument method.
 
 addIndexes method does not need to pull out the original String field
 values from Documents, so it will work.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



addIndexes and optimize

2004-07-07 Thread roy-lucene-user
Hey y'all again,

Just wondering why the IndexWriter.addIndexes method calls optimize before and after 
it starts merging segments together.

We would like to create an addIndexes method that doesn't optimize and call optimize 
on the IndexWriter later.

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



moving 1.2 index to 1.4

2004-07-02 Thread roy-lucene-user
Hey guys,

We have a couple of giant indexes that were done in lucene 1.2.  We would like to move 
to lucene 1.4 at some point.

We have heard that we would probably need to re-index our indexes to take advantage of 
certain new features/optimizations of lucene 1.3/1.4.

We were wondering if it was possible to open our old 1.2 index with an IndexReader, 
get each Document object, and add it to a new 1.4 index?  Would it be the same as 
re-building an index from scratch?

Thanks!

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



stop words in index

2004-06-19 Thread lucene
Hi!

How comes that stop words show up in index (HighFreqTerms)? Yes, I do you the 
same analyzer for indexing and searching.

class SearchFacade
{
private final static String[] GERMAN_STOP_WORDS = new String[] { foo, 
bar };
private final static Analyzer GERMAN_ANALYZER = new 
SnowballAnalyzer( German2, GERMAN_STOP_WORDS );

public void index()
{
writer = new IndexWriter( Configuration.Lucene.INDEX, GERMAN_ANALYZER, 
true );
...
}

public void search(String q)
{
final Query q = MultiFieldQueryParser.parse( query, new String[] { 
blah, 
foo, bar }, GERMAN_ANALYZER );
...
}
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Author or SearchBean

2004-06-04 Thread lucene
Hi!

Where can I get the mail address of the author of SearchBean (sandbox) from?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: pagable results

2004-05-15 Thread lucene
On Tuesday 11 May 2004 15:58, Ryan Sonnek wrote:
 When performing a search with lucene, is it possible to only return a
 subset of the results?  I need to be able to page through results, and it

Yes, http://www.nitwit.de/vlh2/ :-)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-26 Thread lucene
On Monday 12 April 2004 20:54, [EMAIL PROTECTED] wrote:
 On Sunday 11 April 2004 17:46, Erik Hatcher wrote:
  In other words, you need to invent your own pattern here?!  :)

 I just experimented a bit and came up with the ValueListSupplier which
 replaces the ValueList in the VLH. Seems to work so far... :-) Comments are
 greatly appreciated!

FYI http://www.nitwit.de/vlh2/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searcher not aware of index changes

2004-04-21 Thread lucene
Hi!

My Searcher's instance it not aware of changes to the index. I even create a 
new instance but it seems only a complete restart does help(?):

indexSearcher = new IndexSearcher(IndexReader.open(index));

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searcher not aware of index changes

2004-04-21 Thread lucene
On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote:
 This is not normal behaviour. Normally using a new IndexSearcher should
 reflect the modified state of your index. Could you post a more
 informative bit of code?

BTW Why can't Lucene care for it itself?


Well, according to my logging it does create a new instance. I use only one 
instance of SessoinFacade:

public class SearchFacade extends Observable
{
protected class IndexObserver implements Observer
{
private final Log log = LogFactory.getLog(getClass());

public Searcher indexSearcher;

public IndexObserver()
{
newSearcher();  // init
}

public void update(Observable o, Object arg)
{
log.debug(Index has changed, creating new Searcher );
newSearcher();
}

private void newSearcher()
{
try
{
indexSearcher = new 
IndexSearcher(IndexReader.open(Configuration.LuceneIndex.MAIN));
}
catch (IOException e)
{
log.error(Could not instantiate searcher:  + e);
}
}

public Searcher getIndexSearcher()
{
return indexSearcher;
}
}

private IndexObserver indexObserver;

public SearchFacade()
{
addObserver(indexObserver = new IndexObserver());
}

public void createIndex()
{
...
setChanged();   // index has changed
notifyObservers();
}

public Hits search(String query)
{
Searcher searcher = indexObserver.getIndexSearcher();
}

}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-12 Thread lucene
On Sunday 11 April 2004 17:46, Erik Hatcher wrote:
 In other words, you need to invent your own pattern here?!  :)

I just experimented a bit and came up with the ValueListSupplier which 
replaces the ValueList in the VLH. Seems to work so far... :-) Comments are 
greatly appreciated!

Timo

public class ValueListSupplier implements IValueListIterator
{
private final Log log = LogFactory.getLog(this.getClass());

// TODO junit test case
private Hits hits;
protected BitSet fetched;
protected List list;
protected int index;

public ValueListSupplier(Hits hits)
{
int size = hits.length();
this.list = new ArrayList(size);
// stupid idiots at SUN
for (int i = 0; i  size; i++) list.add(null);
this.fetched = new BitSet();
this.hits = hits;
this.index = 0;
}

public List getList()
{
return list;
}

public int size()
{
return list.size();
}

public boolean hasPrevious()
{
return index  0;
}

public boolean hasNext()
{
return index  size();
}

/**
 * @param index
 */
public synchronized void move(int index)
{
this.index = index;
}

public void reset()
{
move(0);
}

public Object current()
{
validate(index, index + 1);
return list.get(index);
}

public List previous(int count)
{
int from = Math.max(0, index - count);
int to = index;

validate(from, to);
move(from);
return list.subList(from, to);
}

public List next(int count)
{
int from = index;
int to = Math.min(Math.max(0, size() - 1), index + count);

validate(from, to);
move(to);
return list.subList(from, to);
}

/**
 * @param from
 * starting index (inclusive)
 * @param to
 * ending index (exclusive)
 */
private void validate(int from, int to)
{
while ((from = fetched.nextClearBit(from))  to)
{
log.debug(fetching # + from);

try
{
list.set(from, 
SearchResultAdapter.wrap(hits.doc(from)));
fetched.set(from);
}
catch (IOException e)
{
// TODO potentially bug
e.printStackTrace();
}
}
}

}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
 Thats the beauty it is up to you to load the doc iff you want it.

As I want all of them I don't see why this should be faster at all...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Sunday 11 April 2004 13:40, Erik Hatcher wrote:
 using a HitCollector you are bypassing those mechanisms.  Whether it is
 measurably faster would depend on several other factors.

Well, it is hardly faster, so this is no real solution :-\

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Saturday 10 April 2004 20:40, Erik Hatcher wrote:
 Thats the beauty it is up to you to load the doc iff you want it.

Well, there's another problem with HitCollector: the list I build is not 
sorted by score :-(

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-11 Thread lucene
On Sunday 11 April 2004 17:16, Erik Hatcher wrote:
 Well, yes the one we already discussed.  Let your presentation tier
 talk directly to Hits, so you are as efficient as possible with access
 to documents, and only fetch what you need.

 Again, don't let patterns get in your way.

Well, the sense of tiers and (BTW: language-independant) patterns is to 
modularize software and make things exchangable. This way
neither the presentation tier nor the search engine is exchangable.

The problem actually is that VLH is designed to have a static list of VOs. VLH 
needs to evolve to support something like a data provider that dynamically 
may add data. The problems here so far is that an Iterator must throw an 
ConcurrentModificationException if the backing data is modified but as data 
in a VLH is actually never removed but only added this should be something 
possible to implement.

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ValueListHandler pattern with Lucene

2004-04-10 Thread lucene
On Friday 09 April 2004 23:59, Ype Kingma wrote:
 When you need 3000 hits and their stored fields, you might
 consider using the lower level search API with your own HitCollector.

I apologize for the stupid question but ... where's the actualy result in 
HitCollector? :-) 

  collect(int doc, float score) 

Where doc is the index and score is its score - and where's the Document?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ValueListHandler pattern with Lucene

2004-04-09 Thread lucene
Hi!

I implemented a VLH pattern Lucene's search hits but noticed that hits.doc() 
is quite slow (3000+ hits took about 500ms).

So, I want to ask people here for a solution. I tought about something like a 
wrapper for the VO (value/transfer object), i.e. that the VO does not 
actually contain the value but a reference to lucene's Hits instance. But 
this somewhat a hack...

Any ideas?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Friday 02 April 2004 23:48, Erik Hatcher wrote:
 On Apr 2, 2004, at 10:00 AM, [EMAIL PROTECTED] wrote:
  On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
  Field.Keyword is suitable for storing data like Url.  Give that a try.
 
  I just tried this a minute ago and found that I cannot use wildcards
  with
  Keywords: url:www.yahoo.*

 You *can* use wildcards with keywords (in fact, a keyword really has no
 meaning once indexed - everything is a term at that point).

Well, I just tried. I  also was surprised actually - but it just didn't work.

I can use wildcards for

  doc.add(Field.Text(url, row.getString(url)));

but I cannot for

  doc.add(Field.Keyword(url, row.getString(url)));

   - create a utility (I've posted one on the list in the past) that
 shows what your analyzer is doing graphically.

Interesting. Can you give me subject/date of that posting?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Simple date/range question

2004-04-03 Thread lucene
On Friday 02 April 2004 17:03, [EMAIL PROTECTED] wrote:
 date:[20030101 TO 20030202]

 [java] 11:05:53,735 ERROR [view.SearchAction] 
org.apache.lucene.queryParser.ParseException: Encountered 20030202 at line 
1, column 18.
 [java] Was expecting:
 [java] ] ...

Why is this?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Simple date/range question

2004-04-03 Thread lucene
On Saturday 03 April 2004 11:53, Erik Hatcher wrote:
 I didn't catch in your first message that it was throwing a
 ParseException this is odd.  Are you certain that date:[20030101
 TO 20030202] is the complete string your passing to QueryParser?  Did

Yes.

 you subclass QueryParser?  If so, what is that code?  (what is the

No.

I use a MultiFieldQueryParser:

Query qQuery = MultiFieldQueryParser.parse(query, new String[] { id, 
title, summary, contents, date }, GERMAN_ANALYZER); 
Hits hits = searcher.search(qQuery);

 complete stack trace?)

 [java] 12:38:03,109 ERROR [view.SearchAction] 
org.apache.lucene.queryParser.ParseException: Encountered 20030404 at line 
1, column 18.
 [java] Was expecting:
 [java] ] ...
 [java] org.apache.lucene.queryParser.ParseException: Encountered 
20030404 at line 1, column 18.
 [java] Was expecting:
 [java] ] ...
 [java] at 
org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:994)
 [java] at 
org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:874)
 [java] at 
org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:657)
 [java] at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:521)
 [java] at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:464)
 [java] at 
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
 [java] at 
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87)
 [java] at 
org.apache.lucene.queryParser.MultiFieldQueryParser.parse(MultiFieldQueryParser.java:115)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Saturday 03 April 2004 11:48, Erik Hatcher wrote:
 Provide us the results of running your url through that, using the same

SnowballAnalyzer(German2):

Analzying http://www.yahoo.com/foo/bar.html;
org.apache.lucene.analysis.WhitespaceAnalyzer:
[http://www.yahoo.com/foo/bar.html] 

org.apache.lucene.analysis.SimpleAnalyzer:
[http] [www] [yahoo] [com] [foo] [bar] [html] 

org.apache.lucene.analysis.StopAnalyzer:
[http] [www] [yahoo] [com] [foo] [bar] [html] 

org.apache.lucene.analysis.standard.StandardAnalyzer:
[http] [www.yahoo.com] [foo] [bar.html] 

org.apache.lucene.analysis.snowball.SnowballAnalyzer:
[http] [www.yahoo.com] [foo] [bar.html] 

 analyzer you are using, and also do the same on .toString of the query
 you parsed.  Those two pieces of info will tell all.

url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* 
url:www.yahoo*

Well, I actually use a MultiFieldQueryParser, that's probably why the term 
does appear so often. Strange parser, it should be clear that am explicit 
url:xyz should only look in the url field, shouldn't it?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Saturday 03 April 2004 15:19, Erik Hatcher wrote:
 date:[20030101 TO 20030202]

I found the/my bug. 

Since Lucene is case-sensitive, I do lower-case all queries for user's 
convenience. The ParseException is thrown because the TO becomes to.

Well, I really think Lucene needs to daff such stumbling blocks aside...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-03 Thread lucene
On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
 No objections that error messages and such could be made clearer.
 Patches welcome!  Care to submit better error message handling in this
 case?  Or perhaps allow lower-case to?

I think the best would be if Lucene would simply have a 
setCaseSensitive(boolean).

IMHO it's in any case a bad idea to make searches case-sensitive (per 
default).

 But, also, folks need to really step back and practice basic
 troubleshooting skills.  I asked you if that string was what you passed
 to the QueryParser and you said yes, when in fact it was not.  And you

I forgot that I did lower-case it. I fact I even output it in it's original 
state but lower-case it just before I pass it to lucene. That lower-casing is 
what I would call a hack and hence it's no surprise that I forgot it :-)

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Zero hits for queries ending with a number

2004-04-02 Thread lucene
On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
 Field.Keyword is suitable for storing data like Url.  Give that a try.

I just tried this a minute ago and found that I cannot use wildcards with 
Keywords: url:www.yahoo.*

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Simple date/range question

2004-04-02 Thread lucene
Hi!

I do have some problems with date and the QueryParser range syntax.

code:

java.sql.Timestamp time = row.getTimestamp(timestamp);
if (time != null) doc.add(Field.Keyword(date, new Date(time.getTime(;

query:
date:[20030101 TO 20030202]
date:20030101

The first query does throw a ParserException, the second doesn't return any 
hits.

Hmm...there must be something simple I misunderstood :) BTW what about custom 
date format in QueryParser (...and are the last two digits actually the day 
or month)?

TIA
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Simple date/range question

2004-04-02 Thread lucene
On Friday 02 April 2004 18:59, Otis Gospodnetic wrote:
 You Timestamp contains HH mm, and ss, that's likely why your second

My timestamp contains date and time.

 query doesn't match anything.
 Drop everything other than MMDD from the index, and things should
 work.

What's wrong with new Date(timestamp)?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing numbers

2004-03-10 Thread lucene
On Tuesday 09 March 2004 20:51, Timothy Stone wrote:
 Michael Giles wrote:
  Tim,
 
  Looks like you can only access it with a subscription.  :(  Sounds good,
  though.
 
 Really? I don't have a subscription. Got to it via the archives actually
 now that I think about it:

 Try Volume 7, Issue 12.

I also need an subscription for: 
http://www.sys-con.com/story/search.cfm?pub=1ss=lucene

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing numbers

2004-03-07 Thread lucene
On Fri, 5 Mar 2004 19:18:04 -0500, Erik Hatcher [EMAIL PROTECTED] 
wrote:

 Thanks for the idea for a good example for the upcoming Lucene in Action  
 book... it's been added!

Thanks for mentioning me in the book ;)

What about boolean fields? It's certainly not a good idea to use true or 
false strings...

BTW, isn't it slow to treat everything as strings?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Storing numbers

2004-03-05 Thread lucene
Hi!

I want to store numbers (id) in my index:

long id = 1069421083284;
doc.add(Field.UnStored(in, String.valueOf(id)));  

But searching for id:1069421083284 doesn't return any hits.

Well, did I misunderstand something? UnStored is the number is stored but not 
index (analyzed), isn't it? Anyway, Field.Text doesn't work either.

TIA
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean for multiple terms

2004-03-05 Thread lucene
On Thursday 04 March 2004 17:55, [EMAIL PROTECTED] wrote:
 Consider the query +michael +jackson not to return any hits because
 there's no michael in index, but there is jackson (e.g. janet...). Is
 there any reasonable approach how to determine whether one or multiple
 terms of a query - and which - do let the query fail?

In order to illustrate, google for george buhs - it will suggest george 
bush.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing numbers

2004-03-05 Thread lucene
On Friday 05 March 2004 12:27, Morus Walter wrote:
  doc.add(Field.UnStored(in, String.valueOf(id)));
 
  But searching for id:1069421083284 doesn't return any hits.

 If your field is named 'in' you shouldn't search in 'id'. Right?

 Well, indexing and analyzing are different things.
 UnStored means, the number is not stored (as the name says) but indexed.
 And IIRC it's analyzed before indexing. Shouldn't make a difference for
 a single number.

 What I'd use in this case is an unstored keyword (given that you really
 don't want to have the id returned from lucene, which is the consequence of
 not storing).

Sorry, typo :-)

I do have severeal docs in index and each doc does have an id. And I just want 
to find a particular doc by its id. 

doc.add(Field.UnIndexed(id, String.valueOf(id)));

doesn't work either. And as I mentioned not even Field.Text does work

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing numbers

2004-03-05 Thread lucene
On Friday 05 March 2004 18:01, Erik Hatcher wrote:
 0001 for example.  Be sure all numbers have the same width
 and zero padded.

And what about a range like 100 TO 1000?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Did you mean for multiple terms

2004-03-04 Thread lucene
Hi!

Consider the query +michael +jackson not to return any hits because there's 
no michael in index, but there is jackson (e.g. janet...). Is there any 
reasonable approach how to determine whether one or multiple terms of a query 
- and which - do let the query fail?

Kind Regards
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene scalability/clustering

2004-02-22 Thread lucene
On Saturday 21 February 2004 20:24, Otis Gospodnetic wrote:
 http://jakarta.apache.org/lucene/docs/benchmarks.html

BTW, where can I get Peter Halacsy's IndexSearcherCache?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene scalability/clustering

2004-02-21 Thread lucene
Hi!

How well does Lucene scale? Is it able to handle 100.000 (more or less 
complex) queries a day (i.e. 9 to 5) on an index with half a million docs?

What hardware is recommended for that demand? What to do if it cannot handle 
it quickly enough?

Regards,
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multiple equal Fields?

2004-02-17 Thread lucene
Hi!
What happens if I do this:

doc.add(Field.Text(foo, bar));
doc.add(Field.Text(foo, blah));

Is there a field foo with value blah or are there two foos (actually not 
possible) or is there one foo with the values bar and blah?

And what does happen in this case:

doc.add(Field.Text(foo, bar));
doc.add(Field.Text(foo, bar));
doc.add(Field.Text(foo, bar));

Does lucene store this only once?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-17 Thread lucene
On Monday 16 February 2004 20:56, Erik Hatcher wrote:
 On Feb 16, 2004, at 9:50 AM, [EMAIL PROTECTED] wrote:
  TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new
  StringReader(doc.getField(contents).stringValue()));

 The field is the field name.  No built-in analyzers use it, but custom
 analyzers could key off of it to do field-specific analysis.  Look at

If I want to tokenize all Fields I would have to get a tokenStream of each 
Field seperately and process them seperately? Or can I get one master 
stream that compounds all Fields?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-17 Thread lucene
On Tuesday 17 February 2004 15:18, Erik Hatcher wrote:
 You would do them separately.  I'm not clear on what you are trying to
 do.  The Analyzer does all this during indexing automatically for you,
 but it sounds like you are just trying to emulate what an Analyzer
 already does to extract words from text?

I am still doing this:

TokenStream in = analyzer.tokenStream(contents, new 
StringReader(reader.document(i).getField(contents).stringValue()));

And I want to extract all words from all Fields.

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-17 Thread lucene
On Tuesday 17 February 2004 16:13, Erik Hatcher wrote:
 The words (or terms) are already in the index ready to be read very
 rapidly and accurately.  IndexReader is what you want to investigate if
 your fields are indexed.

 Look into IndexReader and pull the terms directly rather than
 re-analyzing the text.  Provided contents was an indexed field, you

Well, but my index was created using a GermanAnalyzer. I have to re-analyze it 
with WhitespaceAnalyzer if I don't want the words to be truncated...

What you do is what I did at the beginning of the thread :-)

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-17 Thread lucene
On Tuesday 17 February 2004 18:05, Erik Hatcher wrote:
 *arg* I feel like we are going in circles here.

Me, too :-)

 Why use the GermanAnalyzer at all if it is not what you want?  Re-index!

I want to use the GermanAnalyzer. But not for the did you mean 
functionality...

That's what this thread is all about :)

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Incrementally updating and monitoring the index

2004-02-16 Thread lucene
On Friday 13 February 2004 19:10, Stephane James Vaucher wrote:
 Very possible, before adding a document, you can check (with the judicious
 use of an id) if it has already been added. If it hasn't, do your
 notification, but this requires programming.

So you mean adding the new documents to a temporary index first, running all 
queries against it and then write the temp index to the final index?

RAMDirectory ram = new RAMDirectory();
for (docs...)
ram.addDocument(doc);

IndexSearcher searcher = new IndexSearcher(ram),
for (queries...)
if (searcher.search(query) != null)
notify();

finalIndex.addIndexes(ram);
finalIndex.optimize();

?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-16 Thread lucene
On Thursday 12 February 2004 18:35, Viparthi, Kiran (AFIS) wrote:
 As mentioned the only way I can see is to get the output of the analyzer
 directly as a TokenStream
 iterate through it and insert it into a Map.

Could you provide or point me to some example code on how to get and use 
TokenStream. The API docs are somewhat unclear to me...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 12:02, Viparthi, Kiran (AFIS) wrote:
 As mentioned I didn't use any information from index so I didn't uses any
 TokenStream but let me check it out.

deprecated:

String description = doc.getField(contents).stringValue();
final java.io.Reader r = new StringReader(description);
final TokenStream in = analyzer.tokenStream(r);
for (Token token; (token = in.next()) != null; )
{
System.out.println(token.termText());
}

But the result is the same, the words are actually truncated (instead of 
has, had, have, etc. only ha) :-(

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 12:40, Erik Hatcher wrote:
 On Feb 16, 2004, at 6:12 AM, [EMAIL PROTECTED] wrote:
  String description = doc.getField(contents).stringValue();

 What is the value of description here?

? The value of the field contents :-) Long, plain text..

  final java.io.Reader r = new StringReader(description);
  final TokenStream in = analyzer.tokenStream(r);

 And what analyzer are you using here?

GermanAnalyzer (yes, has, had, etc. below is fictional but most people 
here probably don't speak german...e.g. automobile may become automob or 
something like this).

  But the result is the same, the words are actually truncated (instead
  of
  has, had, have, etc. only ha) :-(

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 15:16, Erik Hatcher wrote:
 And thus the nature of the problem.  Try using the WhitespaceAnalyzer
 instead to see what you get.

Much better! :-) But sometimes it still returns multiple words as a single 
term...:-\

And it does not care for punctuation, but that's probably something I'll have 
to do on my own...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 15:27, [EMAIL PROTECTED] wrote:
 But sometimes it still returns multiple words as a single term...:-\

Sorry, silly mistake of mine.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 12:12, [EMAIL PROTECTED] wrote:
 deprecated:

 String description = doc.getField(contents).stringValue();
 final java.io.Reader r = new StringReader(description);
 final TokenStream in = analyzer.tokenStream(r);
 for (Token token; (token = in.next()) != null; )
 {
   System.out.println(token.termText());
 }

Can somebody explain tokenStream() to me?

This is not deprecated:

TokenStream in = new WhitespaceAnalyzer().tokenStream(contents, new 
StringReader(doc.getField(contents).stringValue()));

But what is the first argument (field) for tokenStream() good for? Actually I 
can type whatever I want...? Don't understand the short description in the 
API docs...

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-16 Thread lucene
On Monday 16 February 2004 15:16, Erik Hatcher wrote:
 And thus the nature of the problem.  Try using the WhitespaceAnalyzer
 instead to see what you get.

Can I chain multiple analyzer in order to filter common stop words?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Word not in index

2004-02-16 Thread lucene
Hi!

I do build a list of all unique words in all my docs from 
WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a 
GermanAnalyzer in another index. There are plenty of word in the word list 
that don't return any hits when searching the doc index built using the 
GermanAnalyzer - and these are no stop words.

Why is this?

Thanks a lot!
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 19:20, [EMAIL PROTECTED] wrote:
 Why is this?

Another curiosity is that apparently the case does matter: 
albert (Einstein :) does return hits, but Albert does not - despite the 
docs contain Albert and not albert.

Can somebody explain?

Thanks!
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 19:57, Otis Gospodnetic wrote:
 Searches ARE case sensitive, it is just that some Analyzers lowercase
 all tokens.  If you are using WhitespaceAnalyzer, then tokens will not

GermanAnalyzer apparently is one of them. Too bad :-( Is there a 
case-sensitive alternative out there?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 19:45, Markus Spath wrote:
 Analyzers preprocess the text to be indexed; different Analyzers will
 generate different text-tokens that are indexed. only you can know which
 Analyzer fits your needs, but you need to apply this one consistently for
 indexing, searching and generating lists of unique words, if you want to
 get expectable results.

Well, not sure whether I understood.

GermanAnalyzer - just as any other analyzer - does index all word except stop 
words, right? What's actually the sense of a search engine if I cannot search 
for words in the text? :-)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Word not in index

2004-02-16 Thread lucene
On Monday 16 February 2004 20:56, Otis Gospodnetic wrote:
 Timo, by the nature of your questions it seems like you didn't see the
 Articles section of Lucene's site.  There are links to several articles

 --- [EMAIL PROTECTED] wrote:
  Well, not sure whether I understood.

Well, was actually a case problem, too... :)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Limiting hit count

2004-02-13 Thread lucene
On Friday 13 February 2004 15:02, Erik Hatcher wrote:
 Use a HitCollector and grab the first one that comes in, then bail out.
   That should do the trick for getting the first hit only.

According to the API docs I ought to use HitCollector only if I need all 
hits :-) And there's certainly a reason for it - I don't think that this will 
speed up the search ;)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Incrementally updating and monitoring the index

2004-02-13 Thread lucene
Hi!

Can Lucene incrementally update its index (i.e. balancing will a list of docs 
and removing those that are no more found)?

I'd like to monitor the index for certain queries/terms, i.e. I want to be 
notified if there are (new) hits for a list of terms each time after I add a 
document to the index - continously. 

Is this possibe? The index will contain several hundrets of thousands of 
documents and will be frequently accessed concurrently.

TIA
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-12 Thread lucene
Hi Ronnie!

On Thursday 12 February 2004 09:50, [EMAIL PROTECTED] wrote:
 There is no built-in way in Lucene to achieve this. I have done a simple
 implementation with a patched FuzzyQuery for each term. A new method
 (bestOrderRewrite) returns a ordered list of all fuzzy terms that indeed
 exist in index. There is no guarantee that the suggested term is spelled

Could you please post your FuzzyQuery (did you pach the class or extend it?) 
or send via email?

Thanks a lot
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-12 Thread lucene
On Thursday 12 February 2004 09:43, Viparthi, Kiran (AFIS) wrote:
 We archived this by creating a separate index words extracting the
 complete list of words.

How were you extracting the words?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-12 Thread lucene
On Thursday 12 February 2004 18:03, [EMAIL PROTECTED] wrote:
 On Thursday 12 February 2004 17:53, [EMAIL PROTECTED] wrote:
  How were you extracting the words?

 Oops, sorry that this stupid question :) Got it.

Hm, seems the question wasn't so stupid anyway:

IndexReader reader = IndexReader.open(ram);
TermEnum te = reader.terms();
while(te.next())
{
...

But this includes obviously parts of words, too :-\

Timo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ANNOUNCE: Plucene

2004-02-11 Thread lucene
Hi!

Somewhat off-topic: is there a PHP port of Lucene?

Warm regards
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Did you mean...

2004-02-11 Thread lucene
On Thursday 12 February 2004 00:15, Matt Tucker wrote:
 We implemented that type of system using a spelling engine by Wintertree:

 http://www.wintertree-software.com

 There are some free Java spelling packages out there too that you could
 likely use.

But this does not ensure that the word really exists in the index. The word 
google does propose however to exist.

Regards
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLDocument

2004-02-04 Thread lucene
On Monday 02 February 2004 10:41, John Moylan wrote:
 Another easy HTML parser is HTMLparser.sf.net

This one doesn't seem to be a SAX parser...:-\

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[newbie] Hit quality rating

2004-02-04 Thread lucene
Hi!

Is there a hit quality rating in Lucene or are there only hits and non-hits?

Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [newbie] Hit quality rating

2004-02-04 Thread lucene
On Wednesday 04 February 2004 14:48, Otis Gospodnetic wrote:
 There is score.

Oops, you are right Hits.score(). But it seems I have to implement a sorting 
iterator on my own :-\

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SQLDirectory

2004-02-02 Thread lucene
On Monday 02 February 2004 21:08, Jochen wrote:
 RE: Lucene Optimized Query Broken?

Thanks for the hint. Alas, I also didn't find it there :-( Anyway, I need 
something that does work on any (Postgres) SQL db.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLDocument

2004-02-02 Thread lucene
On Sunday 01 February 2004 15:27, Felix Huber wrote:
 Of course it's there: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/

Thanks. But didn't find that contribution/ant directory there anyway...:-(


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SQLDirectory

2004-02-02 Thread lucene
On Monday 02 February 2004 22:00, Philippe Laflamme wrote:
 I'll look into making the implementation available if you're interested.

I'd be very interested!

Please :)
Timo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SQLDirectory

2004-02-01 Thread lucene
Hi!

There was some third-party SQLDirectory for lucene 1.2 which was abandoned for 
a matter of performance. Well, why not loading the index into RAM? Is there 
some (official) SQLDirectory for 1.3?

searcher = new IndexSearcher(IndexReader.open(new RAMDirectory(new 
SQLDirectory()));

I'd really like to have the index where I do have all the data - in the 
database.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HTMLDocument

2004-02-01 Thread lucene
Hi!

Is there any HTMLDocument out there? The one in the demo package of lucene 
does not handle non-wellformed HTML files (what about nekohtml?) and seems to 
have some other inabilities and bugs as well (and why isn't it part of the 
distro but in a demo package?!)?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLDocument

2004-02-01 Thread lucene
On Sunday 01 February 2004 13:21, Erik Hatcher wrote:
 On Feb 1, 2004, at 6:19 AM, [EMAIL PROTECTED] wrote:

 Nutch uses NekoHTML, so you can browse around that codebase and borrow

Nutch(.org)? No code there...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



storing index in database

2003-10-03 Thread lucene
Hi!

Somebody wrote a SQLDirectory for lucene 1.2 (only) but discontinued it for a 
matter of performance issues.

Well, I really would like to store that index at the same place as the data 
ifself - in the database and not somewhere in the filesystem. I don't quite 
understand the performance problem at all but in any event if a index sizes 
only some MBytes why not selecting all the index once out of the DB and 
keeping it in memory?

So, I'd like to ask people here whether there is a way to and which one is the 
best to store the index reasonably in db.

Timo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: storing index in database

2003-10-03 Thread lucene
On Friday 03 October 2003 16:29, Guilherme Barile wrote:
 Why not just use a RAMDirectory ?

Yes, that was my idea to store the index database and load it into memory. I'm 
just asking people hese whether this is a good idea or if there are better 
(more standard) ways (where I have to do less on my own).

...since I'm a lucene newbie :-)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Redefine the wildcards

2003-09-26 Thread lucene
On Friday 26 September 2003 15:47, [EMAIL PROTECTED] wrote:
 because of too much hits. So i wonder if it possible to redefine the
 wildcards in lucene to make them replace only numbers and not caracters .

What about regular expressions?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



wildcards in fields?

2003-09-25 Thread lucene
Hi!

I search in a field called url. url:www.blah.com does return hits while 
url:blah.com does not. So I tried url:*blah.com but this does even throw a 
ParseException.

What am I doing wrong?

Timo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



stop words in index

2003-09-25 Thread lucene
Hi!

I use a GermanAnalyzer for indexing and searching, search for der (the) 
does not return any hits. But examining the index with Luke does show up 
der as the top ranked word. Other word which are probably stop words as 
well (zum) return hits.

bug?
Timo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NLucene up to date ?

2003-08-04 Thread lucene
Yes, given the lack of updating the c# version I thought users would be maintaining 
their own version in line with current developments.  I too had to add those items you 
mentioned.

What I would like to see is all these 'implementations' consolidated and maintained 
regularly as per java.
I am not sure how widely known Lucene is in the .NET community - my guess it it isn't. 
 A tried and tested Lucene .NET version will definitely help it reach other audiences.

I am pleased to hear Pasha (re)taking up the reigns.

Brendon

[EMAIL PROTECTED] wrote:
 I talked to one of the maintainers of NLucene and he said that he was
 planning on releasing a 1.2 version (not beta apparently) in two months.
 That was back in June and I haven't heard or seen anything since then so I
 cant really say if it is still being actively developed.  Sounds like you
 are doing the same thing I am doing which is adding functionality that you
 need on your own.  I've also added a few things to NLucene like multifield
 queries and the default boolean operator setting.
 
 Brian
 
  Hi all,
 
  http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2.
  Does anyone know if this source is still being maintained to be closer to
 the java developments ?
  Was this an external project to Apache Jakarta ?
 
  I (we) have just successfully released a search engine using a c#
 implmentation of Lucene.  Code had to be brought up to date in line with
 recent java builds, and enhanced with additional features (eg field sorting,
 term position score factoring, etc).
 
  Any other c# users who would like to see NLucene kept in line with the
 java version ?
 
  Maybe I'm just being lazy with having to maintain my own version of Lucene
 =).
  Surely there are others out there who are c# users and follow the mailing
 lists (I remember a Brian somewhere !) but seldom post.
 
  Brendon
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: NLucene up to date ? Lucene.Net is up to date.

2003-08-04 Thread lucene
Excellent news.

Will you be keeping the source up to date with the java developments ?
Can't wait to get my hands on the source, yes that damn bit shift operator (unsigned 
?) always worried me =)

Just by the way, would the .NET version have a similar style sandbox area where users 
can submit small add-on type functionality ?
For example, field sorting.
Would love to share the code for use and comment as this seems to be a common request.

Big thanks Pasha,
Brendon

[EMAIL PROTECTED] wrote:
 Hi,
 
  I talked to one of the maintainers of NLucene and he said 
  that he was planning on releasing a 1.2 version (not beta 
  apparently) in two months. That was back in June and I 
  haven't heard or seen anything since then so I cant really 
  say if it is still being actively developed.  Sounds like you 
  are doing the same thing I am doing which is adding 
  functionality that you need on your own.  I've also added a 
  few things to NLucene like multifield queries and the default 
  boolean operator setting.
 
 By the way, I hope that Lucene.Net 1.3rc1 will be available 
 from http://sourceforge.net/ in this week. Lucene.Net is ready, but
 sourceforge is not :)
 
 Lucene.Net is a complete up to date port of Lucene 1.3rc1
 includes samples and demos (web demo also).
 
 A few differences between nLucene and Lucene.net are:
 1. version of Lucene: Lucene.Net is a 1.3rc1, nLucene - is a 1.2
 2. java code compatible: Lucene.Net only change naming notation, like
 IndexWriter, nLucene 
 implement some methods as a attributes and others
 3. demos: Lucene.Net contain all of Lucene demos and tests include web
 demos. nLucene does not.
 4. .NET Framework 1.1 and VS 2003 compatible
 5. (for internal developer only): correct implement of  java operator
 :)
 
 Pasha
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: NLucene up to date ?

2003-08-01 Thread lucene
No additional classes have been created.
The functionality was simply implemented via new properties and method overloading, so 
original signatures remain intact.

As far as supporting future versions, I cannot say as I will no longer be using it at 
work.  Keeping the c# version in line with java would have to be done in my own time, 
so no guarantees.

Taking the 1.2b2 source I only brought in the fixes, enhancements, etc that affected 
how I was using Lucene.  I keep up with the nightly builds on a regular basis and 
update the c# source where appropriate, so any bugs should have been rectified.

Brendon

[EMAIL PROTECTED] wrote:
 Hi,
 
  From: [EMAIL PROTECTED] 
  
  I (we) have just successfully released a search engine using 
  a c# implmentation of Lucene.  Code had to be brought up to 
  date in line with recent java builds, and enhanced with 
  additional features (eg field sorting, term position score 
  factoring, etc).
 
 Is it hard-code additional or new classes?
 Are you going to support new versions of lucene? 
 
 Pasha
 
 P.s nLucene is lucene 1.2 based with old bugs and not supported.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



NLucene up to date ?

2003-07-31 Thread lucene
Hi all,

http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2.
Does anyone know if this source is still being maintained to be closer to the java 
developments ?
Was this an external project to Apache Jakarta ?

I (we) have just successfully released a search engine using a c# implmentation of 
Lucene.  Code had to be brought up to date in line with recent java builds, and 
enhanced with additional features (eg field sorting, term position score factoring, 
etc).

Any other c# users who would like to see NLucene kept in line with the java version ?

Maybe I'm just being lazy with having to maintain my own version of Lucene =).
Surely there are others out there who are c# users and follow the mailing lists (I 
remember a Brian somewhere !) but seldom post.

Brendon





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NLucene up to date ?

2003-07-31 Thread lucene
Replies to Erik and Scott inline.


[EMAIL PROTECTED] wrote:
 Do these implementations maintain file compatibility with the Java version?
 
 Scott

Yes and no, explanation will help me explain.

The field ordering functionality required additional files to be created at index time 
if the Document.Field property indicates so.
At search time, the entire contents of the 'field sorting' files are read in.  As the 
IndexReader is shared for all client calls (for a pre-defined period of time as the 
index has been implemented 'incremental' style) this cost is only incurred once.

Code-wise, the technique follows the pattern for the Normalisation byte writing and 
reading, the difference being an Int being written.  Yes, there is a memory usage hit, 
but the performance and functionality offered offsets this.

All other file formats remain identical.
I have coded LuceNET (!) so that it gracefully continues if the index segments do not 
have these additional 'sorting' files (naming convention like the normalisation files).


 Erik Hatcher wrote:
 
  I'd love to see there be quality implementations of the Lucene API in 
  other languages, that are up to date with the latest Java codebase.
 
  I'm embarking on a Ruby port, which I'm hosting at rubyforge.org.  
  There is a Python version called Lupy.
 
  A related question I have is what about performance comparisons 
  between the different language implementations?  Will Java be the 
  fastest?  Is there a test suite already available that can demonstrate 
  the performance characteristics of a particular implementation?  I'd 
  love to see the numbers and see if even the Java version can be beat.
 
  Erik


Performance wise, queries typically run in hundreths of seconds.
Including term position in the scoring impacted the timings as expected.

Indexing takes time, but then this wasn't really part of the design goals.

As far as comparing to the java implementation in terms in performance, I haven't 
tried as this workplace is a MS shop.

Java vs c# all over ?  Just kidding =)


 
 
  On Thursday, July 31, 2003, at 08:43  AM, 
  [EMAIL PROTECTED] wrote:
 
  Hi all,
 
  http://sourceforge.net/projects/nlucene/ has a version numbered 1.2b2.
  Does anyone know if this source is still being maintained to be 
  closer to the java developments ?
  Was this an external project to Apache Jakarta ?
 
  I (we) have just successfully released a search engine using a c# 
  implmentation of Lucene.  Code had to be brought up to date in line 
  with recent java builds, and enhanced with additional features (eg 
  field sorting, term position score factoring, etc).
 
  Any other c# users who would like to see NLucene kept in line with 
  the java version ?
 
  Maybe I'm just being lazy with having to maintain my own version of 
  Lucene =).
  Surely there are others out there who are c# users and follow the 
  mailing lists (I remember a Brian somewhere !) but seldom post.
 
  Brendon
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Distribution of junit.jar with Lucene Binaries

2003-06-15 Thread lucene
For what reason is the JUnit-Lib paked within the binary-Distribution of 
Lucene ? 

Greetings
Manfred

-- 
+++ GMX - Mail, Messaging  more  http://www.gmx.net +++
Bitte lächeln! Fotogalerie online mit GMX ohne eigene Homepage!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Stress/scalability testing Lucene

2002-11-20 Thread roy-lucene-user
Ah, for some reason i thought none of the Lucene methods were thread safe,
or is this only in the case of reading and writing at the same time?  I
thought I read this in the FAQ.

Roy.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 20, 2002 5:04 PM
To: Lucene Users List
Subject: Re: Stress/scalability testing Lucene


* Replies will be sent through Spamex to [EMAIL PROTECTED]
* For additional info click - http://www.spamex.com/i/?v=886513

Justin Greene wrote:
 We created a thread pool to read and parse the email
 messages.  10 threads seems to be the magic number here for us.  We then
 created a queue of messages to be indexed onto which we push the parsed
 messages and have a single thread adding messages to the index.

IndexWriter.addDocument(Document) is thread safe, so you don't need a 
separate indexing thread.  So long as your analyzer is thread safe, you 
can index each messages in the thread that parses it, for even greater 
parallelism.

Doug


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


This email and any attachments are confidential and may be 
legally privileged. No confidentiality or privilege is waived 
or lost by any transmission in error.  If you are not the 
intended recipient you are hereby notified that any use, 
printing, copying or disclosure is strictly prohibited.  
Please delete this email and any attachments, without 
printing, copying, forwarding or saving them and notify the 
sender immediately by reply e-mail.  Zurich Capital Markets 
and its affiliates reserve the right to monitor all e-mail 
communications through its networks.  Unless otherwise 
stated, any pricing information in this e-mail is indicative 
only, is subject to change and does not constitute an offer 
to enter into any transaction at such price and any terms in 
relation to any proposed transaction are indicative only and 
subject to express final confirmation.



the order of fields in Document.fields()

2002-11-13 Thread roy-lucene-user
Quick question about Document.fields().

Lucene provides you with a method to retrieve the value of a field or grab
all fields as an Enumeration.  It does not, however, allow you to grab all
values of one field for a document, it will only return the last value added
for that field.  

For example, I am indexing email messages that might have multiple To/CC/BCC
fields in the message header.  Currently to grab all the values when I
display an email that has been indexed, I must use the fields() method to
grab an Enumeration of all fields in a document.  I then separate them into
different arrays based on the field names.  However I am concerned about the
order of the fields since I consider the first To or CC or BCC to be the
main value for each field.  

Is the order of the fields returned in the order that they are added?  Or is
there no order?  If there is no order, can someone suggest a solution?

Thanks!

Roy.


This email and any attachments are confidential and may be 
legally privileged. No confidentiality or privilege is waived 
or lost by any transmission in error.  If you are not the 
intended recipient you are hereby notified that any use, 
printing, copying or disclosure is strictly prohibited.  
Please delete this email and any attachments, without 
printing, copying, forwarding or saving them and notify the 
sender immediately by reply e-mail.  Zurich Capital Markets 
and its affiliates reserve the right to monitor all e-mail 
communications through its networks.  Unless otherwise 
stated, any pricing information in this e-mail is indicative 
only, is subject to change and does not constitute an offer 
to enter into any transaction at such price and any terms in 
relation to any proposed transaction are indicative only and 
subject to express final confirmation.



RE: the order of fields in Document.fields()

2002-11-13 Thread roy-lucene-user
Shouldn't there be at least one method that returns an array of fields in
the correct order?

Roy.

-Original Message-
The order is preserved (or reversed actually), so it's not random.
It's reverse of the order of the order in which the fields were added
to the document.

This would be easy to test...


This email and any attachments are confidential and may be 
legally privileged. No confidentiality or privilege is waived 
or lost by any transmission in error.  If you are not the 
intended recipient you are hereby notified that any use, 
printing, copying or disclosure is strictly prohibited.  
Please delete this email and any attachments, without 
printing, copying, forwarding or saving them and notify the 
sender immediately by reply e-mail.  Zurich Capital Markets 
and its affiliates reserve the right to monitor all e-mail 
communications through its networks.  Unless otherwise 
stated, any pricing information in this e-mail is indicative 
only, is subject to change and does not constitute an offer 
to enter into any transaction at such price and any terms in 
relation to any proposed transaction are indicative only and 
subject to express final confirmation.



Deleting a document found in a search

2002-10-09 Thread lucene . user

I am just getting started with Lucene and I think I have a problem
understanding  some basic concepts.

I am using two-part identifiers to uniquely identify a document in the
index.  So whenever I want to index a document, I first want to find and
delete the old form.

To find it, I intend to use:

BooleanQuery findOurs = new BooleanQuery();
findOurs.add(new TermQuery(new Term(Id, id)), true, false);
findOurs.add(new TermQuery(new Term(Domain, domain)), true, false);

System.out.println(Deleting document matching: \ +
   findOurs.toString() + '');

Searcher searcher = new IndexSearcher(directory);
Hits hits = searcher.search(findOurs);

// Assert: hits.length() = 1

for (int i = 0 ; i  hits.length()  i  10; i++) {
  Document d = hits.doc(i);

  // Now what can I do to find document id?

  int id = ??
searcher.delete(id);
}

But I can't discover how to convert a search result into a document id.  It
is recorded in the private HitDoc class, but since it is not publicly
accessible, there must be a reason why it would not work to add a public
getter for it.

Is there an alternative way that I can do this?  My first thought is to
define a Field.Keyword(composite-key, domain + \u + id).  This
would allow me to use the delete(Term) interface to delete the key.

-- 
Thanks, Adrian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Enumerating all Terms

2002-10-09 Thread lucene . user

Is there a way of getting a list of all Terms that have been indexed?  I
guess it would approximate a wildcard query of the form *:* if that were
valid, and instead of returning matching documents, just returning the
fields and values.
-- 
Thanks, Adrian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




  1   2   >