Software License

2002-02-25 Thread Rafael Luque

Hi all,

I know Lucene is a free project, however I think its use is under Apache Software 
License (ASL) terms, so someone using Lucene should reference the project, use the 
logo 'powered by Lucene', ...

I have suspects about a company releasing a commercial search engine based on Lucene 
and not mentioning Lucene at all. What kind of actions can we take to protect Open 
Source projects like Lucene of this kind of malicious use?

Thanks, 



RE: Googlifying lucene querys

2002-02-25 Thread Howk, Michael

In the Lucene build that we've got (2/21) the question mark does not do a
single-character replace. Does anyone know why? We're using the
StandardAnalyzer and the default QueryParser.

-Original Message-
From: Peter Carlson [mailto:[EMAIL PROTECTED]]
Sent: Saturday, February 23, 2002 5:23 PM
To: Lucene Users List
Subject: Re: Googlifying lucene querys


Hi Jari,

Lucene is designed as an API with different components broken out so a
developer can create the uniqueness required.

One part of Lucene is the QueryParser. The QueryParser takes a search string
and create a set of classes based on the current QueryParser.jj
implementation and turns it into a Lucene Query. This is meant to be a good
solution for most people, but it is just a sample of what can be done.

In the current implementation of QueryParser

'george bush white house'
Will create an OR query of
George OR bush OR white house
Basically, the default is an OR between words unless otherwise specified.

You can use other boolean operators like AND, and NOT
So 
'george AND bush OR white house NOT ford'

Lucene and the current QueryParser supports
wildcards with the * character
Single character replace with the ? Character
Fuzzy searches with the ~ character when next to a single word term
Proximity searches (just added to QueryParser) with the ~3 next to a phrase
term

Again, you can create your own QueryParser to create your desired
implementation.

I hope this helps.

--Peter




On 2/23/02 8:19 AM, Jari Aarniala [EMAIL PROTECTED] wrote:

 +george +bush +white +house
 
 Well, that's pretty obvious even for me :) If you have separate words,
 just tokenize the string and add a plus in front of each of the words.
 But what I'm trying to do here is this:
 
 Let's say I have a more complicated query, say
 
 'george bush white house'
 
 There you have two separate words, george and bush and then
 white house enclosed in quotes. If I use a piece of simple
 tokenization code, the above query becomes
 
 +georbe +bush +white +house
 
 See what I mean? That won't work the way expected.
 Anyway, I'm still a bit confused the inner workings of Lucene,
 so maybe I'll come up with something myself.
 
 Jari Aarniala
 [EMAIL PROTECTED] 
 
 
 
 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]
 
 


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Googlifying lucene querys

2002-02-25 Thread Doug Cutting

If you put the title in a separate field from the contents, and search both
fields, matches in the title will usually be stronger, without explicit
boosting.  This is because the scores are normalized by the length of the
field, and the title tends to be much shorter than the contents.  So even
without boosting, title matches usually come before contents matches.

Doug

 -Original Message-
 From: Spencer, Dave [mailto:[EMAIL PROTECTED]]
 Sent: Monday, February 25, 2002 10:22 AM
 To: Lucene Users List
 Subject: RE: Googlifying lucene querys
 
 
 I'm pretty sure google gives priority to the words appearing in the
 title and URL.
 
 I believe sect 4.2.5 says this here:
 http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzww
 w-db.stanf
 ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf
 from here: 
 http://citeseer.nj.nec.com/brin98anatomy.html
 
 So you have to have Lucene store the title as a separate field.
 
 This is then what you'd have if like me you boost (the caret 
 is boost)
 the title by *5 and the URL by *2:
 
 +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0
 url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0
 contents:white) +(title:house^5.0 url:house^2.0 contents:house)
 
 
 -Original Message-
 From: Ian Lea [mailto:[EMAIL PROTECTED]]
 Sent: Saturday, February 23, 2002 8:15 AM
 To: Lucene Users List
 Subject: Re: Googlifying lucene querys
 
 
 +george +bush +white +house
 
 
 --
 Ian.
 
 Jari Aarniala wrote:
  
  Hello,
  
  Despite of the confusing subject ;) my question is simple. I'm just
  trying out Lucene for the first time and would like to know how one
  would go on implementing the search on the index with the same logic
  that Google uses.
  For example, if the user input is george bush white house,
 how
  do I easily construct a query that searches ALL of the 
 words above? If
 I
  have understood correctly, passing the search string above to the
  queryParser creates a query that search for ANY of the words above.
  
  Thanks for any help,
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 
 --
 To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Googlifying lucene querys

2002-02-25 Thread Joshua O'Madadhain

You cannot, in general, structure a Lucene query such that it will yield
the same document rankings that Google would for that (query, document
set).  The reason for this is that Google employs a scoring algorithm that
includes information about the topology of the pages (i.e., how the
pages are linked together).  (An overview of what Google does in this
regard may be found at http://www.google.com/technology/index.html .)
Thus, in order to get Lucene to do what Google does, you'd have to
rewrite large chunks of it.

Joshua

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Mon, 25 Feb 2002, Spencer, Dave wrote:

 I'm pretty sure google gives priority to the words appearing in the
 title and URL.
 
 I believe sect 4.2.5 says this here:
 http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzwww-db.stanf
 ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf
 from here: 
 http://citeseer.nj.nec.com/brin98anatomy.html
 
 So you have to have Lucene store the title as a separate field.
 
 This is then what you'd have if like me you boost (the caret is boost)
 the title by *5 and the URL by *2:
 
 +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0
 url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0
 contents:white) +(title:house^5.0 url:house^2.0 contents:house)
 
 
 -Original Message-
 From: Ian Lea [mailto:[EMAIL PROTECTED]]
 Sent: Saturday, February 23, 2002 8:15 AM
 To: Lucene Users List
 Subject: Re: Googlifying lucene querys
 
 
 +george +bush +white +house
 
 
 --
 Ian.
 
 Jari Aarniala wrote:
  
  Hello,
  
  Despite of the confusing subject ;) my question is simple. I'm just
  trying out Lucene for the first time and would like to know how one
  would go on implementing the search on the index with the same logic
  that Google uses.
  For example, if the user input is george bush white house,
 how
  do I easily construct a query that searches ALL of the words above? If
 I
  have understood correctly, passing the search string above to the
  queryParser creates a query that search for ANY of the words above.
  
  Thanks for any help,
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]
 
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Build index using RAMDirectory out of memory errors

2002-02-25 Thread Kurt Vaag

I have been using Lucene for 3 weeks and it rules.

The indexing process can be slow. So I searched the mailgroup archives
and found example code using RAMDirectory to improve indexing speed.
The example code I found was indexing 100,000 files at a time to the
RAMDirectory before writing to disk.

I tried indexing 10,000 files at a time to the RAMDirectory before writing
to disk. This drastically improved indexing times but sometimes I get
out of memory errors. I am indexing text files and adding 9 fields from
an Oracle database.

Environment:
Solaris 2.8 with 1G of ram and 2G of swap
Java 1.3.1
Lucene 1.2-rc4

Any ideas for eliminating the out of memory errors ?




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Googlifying lucene querys

2002-02-25 Thread Doug Cutting

 From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]]
 
 You cannot, in general, structure a Lucene query such that it 
 will yield
 the same document rankings that Google would for that (query, document
 set).  The reason for this is that Google employs a scoring 
 algorithm that
 includes information about the topology of the pages (i.e., how the
 pages are linked together).  (An overview of what Google does in this
 regard may be found at http://www.google.com/technology/index.html .)
 Thus, in order to get Lucene to do what Google does, you'd have to
 rewrite large chunks of it.

I don't agree with your conclusion: you would not have to re-write much of
Lucene to incorporate this sort of information.  To my understanding, Google
uses linking information as a factor in scoring.  Thus every document in the
index has a factor computed from its links that is multiplied into its
score.

Lucene already keeps a factor per document that is multiplied into its
score, but one that is computed from the document's length, not its links.
Thus, once one has computed link scores, to add them to Lucene we just need
to permit applications to affect this factor, with something like a
Document.setBoost(float) method.  The representation of the per-document
factor would also need to change a little internally.  It is currently
stored as a single byte, and multiplying in an arbitrary factor would cause
overflow.  But enlarging it to 16 bits would be a small change.

So adding such a capability would require re-writing only a very small chunk
of Lucene.  Computing a link-based factor would also take some code, but
that's writing, not re-writing.

Doug

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Build index using RAMDirectory out of memory errors

2002-02-25 Thread Winton Davies

java  -Xmx1000m  

Sorry if you already tried resizing your heap. Actually with 1.3.1 
you could go up above a gig, but really swapping aint gonna help much.

Winton


I have been using Lucene for 3 weeks and it rules.

The indexing process can be slow. So I searched the mailgroup archives
and found example code using RAMDirectory to improve indexing speed.
The example code I found was indexing 100,000 files at a time to the
RAMDirectory before writing to disk.

I tried indexing 10,000 files at a time to the RAMDirectory before writing
to disk. This drastically improved indexing times but sometimes I get
out of memory errors. I am indexing text files and adding 9 fields from
an Oracle database.

Environment:
Solaris 2.8 with 1G of ram and 2G of swap
Java 1.3.1
Lucene 1.2-rc4

Any ideas for eliminating the out of memory errors ?




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


-- 

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




is there any way to create and manage a controlled vocabulary in lucene?

2002-02-25 Thread Philipp Chudinov

subj?



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Performance Tuning

2002-02-25 Thread Otis Gospodnetic

You could try playing with a merge factor...

Otis

--- Aruna Raghavan [EMAIL PROTECTED] wrote:
 Hi,
 Are there any ways to finetune the CPU performance with Lucene? I
 know of
 the usage of optimize() calls but I am wondering if there are any
 other ways
 to improve the CPU time/Disk space performace.
 Thanks!
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Sports - Coverage of the 2002 Olympic Games
http://sports.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Build index using RAMDirectory out of memory errors

2002-02-25 Thread Ian Lea

Have you tried different values for IndexWriter.mergeFactor?
Setting it to 1000 gave me a 10* speed improvement on some
large index some time ago. Not RAMDirectory though.
Your mileage may vary.


--
Ian.


Kurt Vaag wrote:
 
 I have been using Lucene for 3 weeks and it rules.
 
 The indexing process can be slow. So I searched the mailgroup archives
 and found example code using RAMDirectory to improve indexing speed.
 The example code I found was indexing 100,000 files at a time to the
 RAMDirectory before writing to disk.
 
 I tried indexing 10,000 files at a time to the RAMDirectory before writing
 to disk. This drastically improved indexing times but sometimes I get
 out of memory errors. I am indexing text files and adding 9 fields from
 an Oracle database.
 
 Environment:
 Solaris 2.8 with 1G of ram and 2G of swap
 Java 1.3.1
 Lucene 1.2-rc4
 
 Any ideas for eliminating the out of memory errors ?

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Build index using RAMDirectory out of memory errors

2002-02-25 Thread Kurt Vaag

Thanks Winton,

Thats what it was. I just assumed java would take all the 1G that
it needed. Didn't realize the default was 64M. Also thanks for not
saying RTFM (which I had done but didn't know what TF to do with the
-Xmx option).

-Kurt

-Original Message-
From: Winton Davies [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 25, 2002 12:22 PM
To: Lucene Users List
Subject: Re: Build index using RAMDirectory out of memory errors


java  -Xmx1000m  

Sorry if you already tried resizing your heap. Actually with 1.3.1
you could go up above a gig, but really swapping aint gonna help much.

Winton


I have been using Lucene for 3 weeks and it rules.

The indexing process can be slow. So I searched the mailgroup archives
and found example code using RAMDirectory to improve indexing speed.
The example code I found was indexing 100,000 files at a time to the
RAMDirectory before writing to disk.

I tried indexing 10,000 files at a time to the RAMDirectory before writing
to disk. This drastically improved indexing times but sometimes I get
out of memory errors. I am indexing text files and adding 9 fields from
an Oracle database.

Environment:
Solaris 2.8 with 1G of ram and 2G of swap
Java 1.3.1
Lucene 1.2-rc4

Any ideas for eliminating the out of memory errors ?




--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Index Locked For Write

2002-02-25 Thread Hayes, Mark

I am not a Lucene expert but I would like to understand the threading issues
also, and I'm wondering if the following is true when using Lucene in a
multithreaded application.

I understand there are three modes for using IndexReader and IndexWriter:

A- IndexReader for reading only, not deleting
B- IndexReader for deleting (and reading)
C- IndexWriter (for adding and optimizing)

Any number of readers may be used concurrently in mode A.  But for B and C
the reader or writer may not be kept open for long periods.  Write
operations create a lock, and closing the reader or writer is the only way
to release the lock.  In theory a single writer could be kept open, but its
lock will prevent deletions (which are performed with a separate reader).

Therefore for B and C each set of changes should be made inside a
synchronized block where the reader or writer is opened and closed.  This
prevents multiple writers (or readers used for deleting) from being open at
once.  The synchronization should be done on an object that identifies a
particular index, e.g., on a global object if there is only one index.  For
example:

class myindex {
  static final Object INDEX_LOCK = new Object();
  void delete(int[] docs) {
synchronized (INDEX_LOCK) {
  IndexReader reader = IndexReader.open(...);
  try {
for (int i = 0; i  docs.length; i++) {
  reader.delete(docs[i]);
}
  } finally {
reader.close();
  }
}
  }
  void add(Document[] docs) {
synchronized (INDEX_LOCK) {
  IndexWriter writer = new IndexWriter(...);
  try {
for (int i = 0; i  docs.length; i++) {
  writer.add(docs[i]);
}
writer.optimize();
  } finally {
writer.close();
  }
}  
  }
}

Of course there are other techniques for global locking such as 'static
synchronized' methods.  Locking on a separate object per index is the
general case (where multiple indexes are present).

Is this correct?  Or should Lucene be waiting on the write lock instead of
throwing an exception?
mark

 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
 Sent: Sunday, February 24, 2002 9:22 PM
 To: Lucene Users List
 Subject: RE: Index Locked For Write
 
 
 
 --- Howk, Michael [EMAIL PROTECTED] wrote:
  Out of curiosity, why didn't we need to close the writer in rc2 or
  rc3?
  
  When you suggest a synchronized keyword, are you suggesting that
  the
  writer is not inherently thread-safe? Do we need to write our own
  thread
  management on top of Lucene?
 
 Sorry, that might have been a wrong suggestion, IndexWriter (at least
 the add method) is supposed to be thread safe.
 
 Otis
 
 
  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
  Sent: Thursday, February 21, 2002 4:07 PM
  To: Lucene Users List
  Subject: RE: Index Locked For Write
  
  
  You could use synchronized keyword and use IndexReader.isLocked() or
  something like that, no?
  
  Otis
  
  --- Howk, Michael [EMAIL PROTECTED] wrote:
   Thank you for your quick responses. But in our application, we're
   working in
   a transactional environment where multiple threads are accessing a
   single
   writer using the recommended singleton pattern. Since no 
 thread has
   exclusive access to the writer, how can we have one thread
   arbitrarily
   decide to close the writer?
   
   Michael
   
   -Original Message-
   From: Mark Tucker [mailto:[EMAIL PROTECTED]]
   Sent: Thursday, February 21, 2002 3:51 PM
   To: Lucene Users List
   Subject: RE: Index Locked For Write
   
   
   You forgot to close your writer after the call to optimize.
   
   -Original Message-
   From: Howk, Michael [mailto:[EMAIL PROTECTED]]
   Sent: Thursday, February 21, 2002 2:49 PM
   To: Lucene Mailing List (E-mail)
   Subject: Index Locked For Write
   
   
   We just got the newest daily build (to try to fix some NullPointer
   errors
   with ? and _ characters), and we're getting the same problem
  that
   Daniel
   Calvo mentioned: Index Locked for Write. Here's basically what our
   code is
   doing:
 IndexWriter writer = new IndexWriter(path, 
 analyzer, create);
 try {
 Document doc = new Document();
 doc.add(Field.Keyword(DOC_ID, 14));
 doc.add(Field.UnStored(ANY, mushu));
 writer.addDocument(doc);
 writer.optimize();
   
 // Search the document for our keyword
 {   
 IndexReader reader = IndexReader.open(path);
 IndexSearcher searcher = new IndexSearcher(reader);
 Vector returnStuff = searcher.search(mushu);
 }
   
 // Verify that we got one record back
 assertNotNull(returnStuff);
 assertEquals(1, returnStuff.size());
 }
 finally {
 // Clean up after ourselves
 IndexReader reader = IndexReader.open(path);
 

Re: is there any way to create and manage a controlled vocabularyin lucene?

2002-02-25 Thread Peter Carlson

Hi,
Are you just trying to have Lucene index terms that are in your Vocaulary.

If you, then you can great your own analyzer returns words in your
vocabulary.

Also, you could use the StandardAnalyzer, and then you could create your own
Lucene Document and only add words that match your vocabulary.

If you just want to see if it works, you might try to just add code on top
of your own document. There are many examples of Lucene Documents. The
HTMLDocument in the demo or just the text document.

Hope this helps

--Peter

On 2/25/02 11:29 AM, Philipp Chudinov [EMAIL PROTECTED] wrote:

 subj?
 
 
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]
 
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]