Bug 23650 (aka docs out of order)?

2005-02-25 Thread petite_abeille
Re: http://issues.apache.org/bugzilla/show_bug.cgi?id=23650
Hello,
I'm pretty confident that I'm misusing Lucene one way or another... and  
of course it was just a question of time before I ran into this docs  
out of order exception:

java.lang.IllegalStateException: docs out of order
	at  
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java: 
353)
	at  
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java: 
316)
	at  
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java: 
290)
	at  
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java: 
254)
	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:93)
	at  
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
	at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
	at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:389)

Still... the question is... which sort of misuse would trigger such  
exception?

For the record, this is using Lucene 1.4.3 on the following platform:
JVM: Java HotSpot(TM) Client VM 1.5.0-beta
Language: English (United States)
Encoding: Cp1252
Memory: 17 MB
Implementation: Sun Microsystems Inc.
OS: Windows XP 5.1
Architecture: X86
Any insight much appreciated :)
Thanks!
Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ngramj

2005-02-24 Thread petite_abeille
On Feb 24, 2005, at 14:50, Gusenbauer Stefan wrote:
Does anyone know a good tutorial or the javadoc for ngramj because i  
need it for guessing the language of the documents which should be  
indexed?
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/ 
languageidentifier/

Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-23 Thread petite_abeille
On Jan 24, 2005, at 00:10, Vic wrote:
(Is there a btree seralization impl in java?)
http://jdbm.sourceforge.net/
Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread petite_abeille
On Jan 22, 2005, at 23:50, Kevin A. Burton wrote:
The problem I think for everyone right now is that 32bits just doesn't 
cut it in production systems...   2G of memory per process and you 
really start to feel it.
Hmmm... no... no pain at all... or perhaps you are implying that your 
entire system is running on one puny JVM instance... in that case, this 
is perhaps more of a design problem than an implementation one... 
YMMV...

Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene appreciation

2004-12-16 Thread petite_abeille
On Dec 16, 2004, at 17:26, Rony Kahan wrote:
If you are interested in Lucene work you can set up an rss feed
or email alert from here: 
http://www.indeed.com/search?q=lucenesort=date
Looks great :)
One thing though, the web search returns 14 hits for the above query. 
Using the RSS feed only returns 4 of them. What gives?

Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread petite_abeille
On Dec 14, 2004, at 15:40, Kevin L. Cobb wrote:
Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose.
ZOE [1] [2] takes the same approach and uses Lucene as a relational 
engine of sort.

However, for both practical and ideological reasons, its does not store 
any raw data in the Lucene indices themselves but instead uses JDBM [2] 
for that purpose.

All things considered, update issues aside, Lucene turns out to be a 
very flexible thin database.

Cheers,
PA.
[1] http://zoe.nu/
[2] http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
[3] http://jdbm.sourceforge.net/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[RFE] IndexWriter.updateDocument()

2004-12-14 Thread petite_abeille
Well, the subject says it all...
If there is one thing which is overly cumbersome in Lucene, it's 
updating documents, therefore this Request For Enhancement:

Please consider enhancing the IndexWriter API to include an 
updateDocument(...) method to take care of all the gory details 
involved in such operation.

Thanks in advance.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: GETVALUES +SEARCH

2004-12-01 Thread petite_abeille
On Dec 01, 2004, at 13:37, Karthik N S wrote:
  We create a ArrayList Object and Load all the Hit Values into them 
and
return
  the same for Display purpose on a Servlet.
Talking of which...
It would be very handy if org.apache.lucene.search.Hits would implement 
the java.util.List interface... in addition, 
org.apache.lucene.document.Document could implement java.util.Map...

That way, the rest of the application could pretend to simply have to 
deal with a List of Maps, without having to get exposed to any Lucene 
internals...

Thought?
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: GETVALUES +SEARCH

2004-12-01 Thread petite_abeille
On Dec 01, 2004, at 20:06, Erik Hatcher wrote:
I also extensively use multiple fields of the same name.
Odd... on the other hand... perhaps this is une affaire de gout...
 So does this rule out implementing the Map interface on Document?
Why? Nobody mentioned what value such a Map would hold... in the worst 
case scenario it could hold a Collection... or perhaps its not worth 
bothering with such esoterism and simply state that the DocumentMap 
only supports one value per key... after all... the purpose of 
providing standard interface such as List and Map is to simplify 
things... not to make them more cumbersome...

PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: GETVALUES +SEARCH

2004-12-01 Thread petite_abeille
On Dec 01, 2004, at 20:43, Erik Hatcher wrote:
Sure, I could put it all together as a space separated String and use  
the WhitespaceAnalyzer, but why not do it this way?  What other  
suggestions do you have for doing this?
If this works for you, I don't see any problem with it.
In general, I avoid storing any raw data in a Lucene Document. And only  
uses Lucene for, er, indexing... but this is just me :)

But lets go back to that fabled Map interface for Document... if the  
purpose of such interface is to keep thing simple it could behave just  
like Document.get() [1]:

Returns the string value of the field with the given name if any exist  
in  this document, or null. If multiple fields exist with this name,  
this method returns the first value added.

If for some reason(s) you need multiple values per field, stick with  
getFields()...

What's wrong with that?
PA.
[1]  
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/ 
Document.html#get(java.lang.String)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: GETVALUES +SEARCH

2004-12-01 Thread petite_abeille
On Dec 01, 2004, at 21:14, Chris Hostetter wrote:
The real question in my mind is not how should we impliment 'get' 
given
that we allow multiple values?, a better question is how should we
impliment 'put'?
Yes, retrofitting Document.add() in the Map interface would be a pain. 
But this is not really what I was getting at. This is more about Hits 
and accessing its values. One problem at the time :)

If you think you know how to satisfy 90% of the users, i would still
suggest that instead of making Codument impliment Map, instead add
a toMap() functin that returns a wrapper with the rules that you 
think
make sense.  (and leave the Document API uncluttered of the Map 
functions
that people who don't care about Map don't need to see)
Agree. Document is fine as it is. It would be nice though to have a 
more or less standard interface to access the result set (e.g. 
Collection)... as consumers of Hits are more likely to be build in 
terms of the Collection API than anything specific to Lucene...

PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[OT] Re: Lots Of Interest in Lucene Desktop

2004-10-29 Thread petite_abeille
On Oct 28, 2004, at 20:26, Kevin A. Burton wrote:
http://www.peerfear.org/rss/permalink/2004/10/28/ 
LotsOfInterestInLuceneDesktop/
Many people, few ideas :)
http://www.popsearch.net/index.html
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Google Desktop Could be Better

2004-10-15 Thread petite_abeille
On Oct 15, 2004, at 16:10, Tom Cunningham wrote:
I'd be interested in trying to implement some of these ideas on Mac OS  
X, mostly because it's not already covered by Google Desktop, and I  
think the screensaver idea would work pretty well there.  Anyone else  
want to give this a shot?
Google invades (Windows) desktops: what's the Mac plan?
http://www.bmannconsulting.com/node/1350
Google Desktop Search - It's About Time, But Not Complete
http://bradnickel.com/?q=node/view/105
On the other hand, Apple is introducing Spotlight in their next Mac OS  
X iteration:

http://www.apple.com/macosx/tiger/spotlight.html
http://www.apple.com/macosx/tiger/spotlighttech.html
While waiting for Godot, you may want to consider the existing Search  
Kit framework as an alternative to Lucene for Mac OS X specific tasks:

http://developer.apple.com/documentation/UserExperience/Reference/ 
SearchKit/

Cheers,
PA.
--
http://zoe.nu/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Encrypted indexes

2004-10-13 Thread petite_abeille
On Oct 13, 2004, at 15:26, Nader Henein wrote:
Well, are you storing any data for retrieval from the index, because 
you could encrypt the actual data and then encrypt the search string 
public key style.
Alternatively, write your index to an encrypted volume... something 
along the line of FileVault and PGP Disk [1] [2].

PA.
[1] http://www.apple.com/macosx/features/filevault/
[2] http://www.pgp.com/products/desktop/index.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing size

2004-09-01 Thread petite_abeille
Hi Niraj,
On Sep 01, 2004, at 06:45, Niraj Alok wrote:
If I make some of them Field.Unstored, I can see from the javadocs  
that it
will be indexed and tokenized but not stored. If it is not stored, how  
can I
use it while searching?
The different type of fields don't impact how you do your search. This  
is always the same.

Using Unstored fields simply means that you use Lucene as a pure index  
for search purpose only, not for storing any data.

Specifically, the assumption is that your original data lives somewhere  
else, outside of Lucene. If this assumption is true, then you can index  
everything as Unstored with the addition of one Keyword per document.  
The Keyword field holds some sort of unique identifier which allows you  
to retrieve the original data if necessary (e.g. a primary key, an URI,  
what not).

Here is an example of this approach:
(1) For indexing, check the indexValuesWithID() method
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ 
SZIndex.java?view=markup

Note the addition of a Field.Keyword for each document and the use of  
Field.UnStored for everything else

(2) For fetching, check objectsWithSpecificationAndHitsInStore()
http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ 
SZFinder.java?view=markup

HTH.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


alternative query syntax?

2004-08-31 Thread petite_abeille
Hello,
I would like to provide an alternative query syntax for ranges by using 
a colon (':') or two dots ('..') instead of ' TO '.

For example:
mod_date:[20020101:20030101]
Or
mod_date:[20020101..20030101]
What would be the correct procedure to modify the QueryParser to 
achieve this? Should I simply change QueryParser.jj's RANGEIN_TO and 
RANGEEX_TO to the appropriate character sequence and regenerate the 
corresponding Java classes with JavaCC?

Any pointers appreciated as I'm not familiar with JavaCC :)
TIA.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: indexing size

2004-08-31 Thread petite_abeille
On Aug 31, 2004, at 17:17, Otis Gospodnetic wrote:
You also have a large number of
fields, and it looks like a lot (all?) of them are stored and indexed.
That's what that large .fdt file indicated.  That file is  206 MB in
size.
Try using Field.UnStored() to avoid storing all those data in your 
indices as it's usually not necessary.

PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene and MVC (was Re: Bad file descriptor (IOException) using SearchBean contribution)

2004-05-19 Thread petite_abeille
On May 20, 2004, at 04:38, Erik Hatcher wrote:
OffTopic: havoc and Struts go well together ;)  Pick up Tapestry 
instead!
Nah. Keep it really Simple [1] instead :o)
http://simpleweb.sourceforge.net/
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-13 Thread petite_abeille
On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:

He mentioned that I might be able to squeeze 5-10% out of index merges 
this way.
Talking of which... what strategy(ies) do people use to minimize 
downtime when updating an index?

My current strategy is as follow:

(1) use a temporary RAMDirectory for ongoing updates.
(2) perform a copy on write when flushing the RAMDirectory into the 
persistent index.

The second step means that I create an offline copy of a live index 
before invoking addIndexes() and then substitute the old index with the 
new, updated, one. While this effectively increase the time it takes to 
update an index, it nonetheless reduce the *perceived* downtime for it.

Thoughts? Alternative strategies?

TIA.

Cheers,

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Did you mean...

2004-02-12 Thread petite_abeille
On Feb 12, 2004, at 16:42, Abhay Saswade wrote:

How about creating spellcheck dictionary with all words in lucene 
index?
That way you ensure that the word really exists in the index.
You can indeed use the terms identified by Lucene as the dictionary 
words ands apply traditional spell checking tricks like phonetic 
encodings, Levinstein distance and so on.

This approach works reasonably well in practice.

Cheers,

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread petite_abeille
On Feb 10, 2004, at 14:03, Scott ganyo wrote:

I have.  While document.add() itself doesn't increase over time, the 
merge does.  Ways of partially overcoming this include increasing the 
mergeFactor (but this will increase the number of file handles used), 
or building blocks of the index in memory and then merging them to 
disk.  This has been discussed before, so you should be able to find 
additional information on this fairly easily.
This is what I noticed also: adding documents by itself is a fairly 
benign operation, but anything that triggers an index merge in one form 
or another is a killer as an index grows in size.

So, overall, adding more documents does slow down the indexing.

At least this is the impression I get. But I would love to be proven 
wrong on this :)

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: index: how to store binary data or objects ?

2004-02-10 Thread petite_abeille
On Feb 10, 2004, at 14:53, Markus Brosch wrote:

My application will deal with small data sets. The problem is, that 
I want
to index the content (String) of some objects. I want to refer to that
object once I found this by a keyword or whatever.  So, using a simple 
map or
tree?
Something along these lines:

- When indexing your object, you create one Lucene document for it and 
store its unique identifier as a keyword along side whatever you want 
to index.

- When retrieving your documents, you can use this keyword to reference 
your object.

Another problem is, that my objects can change their content and must 
be
reindexed. Is it possible to remove the single index for that object 
and build
a new one without reindexing all?
Yes.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[OT] Re: Need Advices and Help

2004-02-05 Thread petite_abeille
On Feb 05, 2004, at 13:01, Otis Gospodnetic wrote:

I believe it would be the value of a 'Message-ID' or 'Reference' or
'Reference-ID' message header.
However, I remember reading that mail readers are not very good at
sticking to a standard (some RFC, I guess), so they don't always
provide the corrent ID, or they store it under non-standard names, etc.
My suggestion: Look up Zoe (see Lucene Powered By page), download it,
check its source and learn from it.
http://zoe.nu/itstories/story.php?data=storiesnum=24sec=3

And be ready for a lot of pain and suffering ;)

Trying to normalize email is not for the faint-hearted.

Just my 2¢.

Cheers,

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[OT] Digital Format-Specific Validation

2003-12-06 Thread petite_abeille
http://hul.harvard.edu/jhove/

Might be of interest to some :)

Cheers,

PA.

smime.p7s
Description: S/MIME cryptographic signature


moving documents from one index to another?

2003-11-20 Thread petite_abeille
Hello,

I'm trying to move a Document from one Index to another, without 
necessarily reindexing it...

The Document is composed of one Field.Keyword and a bunch of 
Field.UnStored.

Reading such a Document from one index and then adding it to another 
one doesn't seems to have the expected effect though.

Assuming that 'aReader' and 'aWriter' works on different indices:

aDocument = aReader.document( index );

aWriter.addDocument( aDocument );

The Document added to the second index doesn't seem to preserve its 
informations...

What gives? Should I do that at a lower level? Does it make sense in 
the first place to try to move a raw Lucene's Document between 
indices?

TIA.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: moving documents from one index to another?

2003-11-20 Thread petite_abeille
On Nov 20, 2003, at 13:45, Eric Jain wrote:

If the document contains unstored fields, the only way to reconstruct
the document is by iterating through all terms in the index and picking
out those that reference the document.
Hmmm... how would you do that? Something along the lines of 
aReader.terms() and then for each Term use aReader.termDocs() to try to 
figure out which document it belongs to? Something else altogether? How 
do you move the doc/terms to the other index then?

 This is likely to be to
inefficient for any practical purposes...
That's ok :)

Alternatively, would it be possible to use FieldsReader/FieldsWriter or 
such to move the raw data from one index to the other without ill side 
effects?

TIA.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: moving documents from one index to another?

2003-11-20 Thread petite_abeille
On Nov 20, 2003, at 14:13, Eric Jain wrote:

That's what I had in mind, but maybe there is better way. Once all 
terms
are collected, they can be reassembled into a new document that that 
can
then be indexed again.
I see. Assuming I have the relevant terms for a given document, how 
would a build a new document based on those terms? Something like 
adding each term's field and text to the new document? What would a 
term's text hold for an unstored field?

TIA.

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: moving documents from one index to another?

2003-11-20 Thread petite_abeille
On Nov 20, 2003, at 14:34, Eric Jain wrote:

I believe a term always contains it's own text. (It must be somewhere,
after all...) Documents on the other hand may or may not contain the
original text, depending on whether a field is stored or not.
This seems to be the case: the term's text hold the correct value.

Thanks.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: moving documents from one index to another?

2003-11-20 Thread petite_abeille
On Nov 20, 2003, at 14:34, Eric Jain wrote:

I see. Assuming I have the relevant terms for a given document, how
would a build a new document based on those terms? Something like
adding each term's field and text to the new document?
Yes.
Ok. Retrieving the term for a document turns out to be pretty 
straightforward, but building a new document turns out to be slightly 
more convoluted than expected... I basically need to know which kind of 
field to create (Stored, Indexed, Tokenized), but this information 
doesn't seem to be available in the document I'm trying to clone. I 
thought I could use the original Document's getField() method to 
retrieve this information, but aside from the Keyword field, none of 
the other fields are available... where can I get this info at this 
stage?

Here is the problematic method for cloning a document:

	private Document cloneDocumentWithTerms(final Document aDocument, 
final Collection someTerms)
	{
		if ( aDocument != null )
		{
			if ( someTerms != null )
			{
Document	anotherDocument = new Document();

anotherDocument.setBoost( aDocument.getBoost() );

 for ( Iterator anIterator = someTerms.iterator(); 
anIterator.hasNext(); )
 {
	TermaTerm = (Term) anIterator.next();
	String  aKey = aTerm.field();
	String  aValue = aTerm.text();
	Field   aField = aDocument.getField( aKey );
	boolean isStored = aField.isStored();
	boolean isIndexed = aField.isIndexed();
	boolean isTokenized = aField.isTokenized();
	Field   anotherField = new Field( aKey, aValue, isStored, 
isIndexed, isTokenized );
	
	anotherField.setBoost( aField.getBoost() );
	
	anotherDocument.add( anotherField );
 }

 return anotherDocument;
			}

			throw new IllegalArgumentException( Index.cloneDocumentWithTerms: 
null terms. );
		}

		throw new IllegalArgumentException( Index.cloneDocumentWithTerms: 
null document. );
	}

The problem is that aDocument.getField( aKey ) returns null most of the 
time. What gives?

TIA.

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document ID's and duplicates

2003-11-19 Thread petite_abeille
On Nov 19, 2003, at 18:14, Don Kaiser wrote:

If you do this will the old version of the document be replaced by the 
new one?
No. They will coexist. In Lucene, an update implies a delete/insert 
sequence.

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 19:50, Chong, Herb wrote:

if you are handling inter correlation properly, then terms can't cross 
sentence boundaries.
Could you not break down your document along sentences boundary? If you 
manage to figure out what a sentence is, that is.

if you are not paying attention to sentence boundaries, then you are 
not following rules of linguistics.
Rules of linguistics? Is there such a thing? :)

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:27, Dror Matalon wrote:

I might be the only person on the list who's having a hard time
following this discussion.
Nope. I don't understand a word of what those guys are talking about 
either :)

 Would one of you wise folks care to point me
to a good dummies, also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collecting the words and having pointers to documents and
weights, but beyond that ...
That's good enough :)

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 20:29, Philippe Laflamme wrote:

Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very 
broad
research subject but a lot has come out of it.
A lot of what? If statements? :)

More specifically, Rule-based taggers have become very popular since 
Eric
Brill published his works on trainable rule-based tagging.

Essentially, it comes to down analysing sentences to determine the role
(noun, verb, etc.) of each words. It's very helpful to extract 
noun-phrases
such has cardiovascular disease or magnetic resonance imaging from
documents.
I would agree with that. But it's easier said than done. And the result 
are never, er, clear cut.

So, yep... you can definitely derive rules to analyse natural 
language...
Well... beyond the jargon and the impressive math... this all boils 
down to fuzzy heuristics and judgment calls... but perhaps this is just 
me :)

I'm sure you already know about all of this...
Not really. I'm more of a dilettante than a NLP expert.

just thought it might be
interesting for some...
Sure. But my take on this, is that pigs will fly before NLP turns into 
a predictable science :)

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Vector Space Model in Lucene?

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 21:16, Chong, Herb wrote:

if you know what TREC is, you know what i meant earlier. this isn't 
exotic technology, this is close to 15 year old technology.
This is not really what I asked. What I would be interested to know is 
what approach you consider to provide the biggest bang for you bucks?

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread petite_abeille
On Nov 14, 2003, at 21:14, Philippe Laflamme wrote:

Rules of linguistics? Is there such a thing? :)
Actually, yes there is. Natural Language Processing (NLP) is a very
broad
research subject but a lot has come out of it.
A lot of what? If statements? :)
Yes... just like every software boils down to branching and while 
loops for
the processor... ;o)
Hehe... ;) But NLP seems to suffer more from heuristics disguised in 
fancy jargon than other fields...


I would agree with that. But it's easier said than done.
Yes, of course this is very complex. That's why NLP is a very popular 
field
of research: it's challenging!
Indeed.


And the result are never, er, clear cut.
You're correct, results are not 100% perfect. But getting 95% is pretty
impressive when you're dealing with a computer software. Don't forget, 
even
with many years (decades even) of experience with our own language, we
humans still manage to misunderstand certain sentences... can you 
really
expect a software to be 100% correct all the time?
Nope. Therefore my tongue in cheek comments...


Sure. But my take on this, is that pigs will fly before NLP turns into
a predictable science :)
Maybe you're right, technologies derived from NLP may never be 
perfect. But
it doesn't make them useless. Quite the contrary I think.
Perhaps. I'm not saying it's utterly useless as a whole. But... NLP has 
a noted tendency to over promise and under deliver. Plus, it's marred 
with too much jargons which is suspicious in and by itself :)

I'm not a Lucene expert, but I'm sure it could benefit from using 
derived
NLP methods for text analysis.
For hardcore text analysis, perhaps. But Lucene is an low level 
indexing library. You can build something much more, er, esoteric on 
top of it. But I don't think that the core library would benefit from 
any bizarre additions. Plus, the core elements of the library provide 
already more than enough room to play with whatever scheme you may have 
in mind.

Maybe someone out there has some experience
they might want to share with us?
Perhaps. But one way or another, and as far as Lucene is concerned, you 
will be better off building something exotic on top of Lucene than 
messing around with its internals.

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: fuzzy searches

2003-11-13 Thread petite_abeille
On Nov 11, 2003, at 21:02, Bruce Ritchie wrote:

Just a note the LSI is encumbered by US patents 4,839,853 and 
5,301,109. It would be wise to make sure that any implementation is 
either blessed by the patent holders or does not infringe on the 
patents.
Since when did developers turn into armchair IP lawyers? Is it a 
national game?

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Objection to using /tmp for lock files.

2003-11-13 Thread petite_abeille
On Nov 13, 2003, at 19:00, Dror Matalon wrote:

I've been experimenting with it and it seems to work as advertised. It
has the advantage of not requiring *any* write capability in /tmp or
anywhere else.
There is a system property to turn off the lock files altogether.

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query Filters on term A in query A AND (B OR C OR D)

2003-11-13 Thread petite_abeille
On Nov 13, 2003, at 22:32, Jie Yang wrote:

I am trying to optimse the 500 OR
terms so that it does not do a full 2 millions docs
search but on the 1000 returned.
Would it be beneficial to move the first result set into its own 
(transient) index to perform the second part of your query?

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Overview to Lucene

2003-11-12 Thread petite_abeille
Hi Ralf,

On Nov 12, 2003, at 14:06, [EMAIL PROTECTED] wrote:

Does anybody know good articles which demonstrate parts of that or 
give a
good start into Lucene?
Otis Gospodnetic's articles are a good starting point:

Introduction to Text Indexing with Apache Jakarta Lucene
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html
Advanced Text Indexing with Lucene
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html
Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:05, Marcel Stör wrote:

As everybody seems to be so exited about it, would someone please be 
so kind to explain
what document based clustering is?
This mostly means finding document which are similar in some way(s). 
The similitude is mostly in the eyes of the beholder. In such a 
world, a cluster would be a pile of document sharing something. As 
far as Lucene goes, a straightforward way of approaching this could be 
to use an entire document content to query an index. Lucene's result 
set could be construed as a document cluster. Admittedly, this is 
ground zero of document clustering, but here you go anyway :)

Here is an illustration:

Patterns in Unstructured Data
Discovery, Aggregation, and Visualization
http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm
Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:58, Tate Avery wrote:

Categorization typically assigns documents to a node in a pre-defined 
taxonomy.

For clustering, however, the categorization 'structure' is emergent... 
i.e. the clusters (which are analogous to taxonomy nodes) are created 
dynamically based on the content of the documents at hand.
Another way to look at it is this:

An attempt to apply the Dewey Decimal system to an orgy. [1]

Without a Dewey Decimal system that is.

Cheers,

PA.

[1] http://www.eod.com/devil/archive/semantic_web.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 21:32, maurits van wijland wrote:

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
Leo Galambos, author of the Egothor project, constantly supports us 
with fresh ideas and includes Carrot components in his own project!

http://www.cs.put.poznan.pl/dweiss/carrot/xml/authors.xml?lang=en

Small world :)

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: The best way forward

2003-11-04 Thread petite_abeille
On Nov 04, 2003, at 13:04, Otis Gospodnetic wrote:

Eventually i am going to try to implement something similar to google
groups, indexing lots of NNTP traffic. Has anyone done this before
with lucune?
Not that I know, but people have used Lucene to index their email,
which is somewhat similar.
Very similar indeed :)

Perhaps you should take a look at ZOE:

http://zoe.nu/

It uses Lucene quiet extensively to index emails type of things.

NNTP support could be a stone throw away as you would only need to 
plugin the appropriate JavaMail's Store to handle NNTP specifics.

On the other hand, I doubt that anyone has tried to index anything on 
the scale of Google's data set... NNTP or not :)

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Relational Search

2003-11-04 Thread petite_abeille
On Nov 04, 2003, at 19:28, Tate Avery wrote:

Does anyone have any creative ideas for tackling this problem with 
Lucene?
Perhaps Not sure if this quiet what you are after, but you could 
take a look at ZOE's SZObject framework. It's build on top of Lucene to 
provide lightweight ODBMS like functionality.

Cheers,

PA.

--
http://zoe.nu/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: The best way forward

2003-11-04 Thread petite_abeille
Hi Dror,

On Nov 04, 2003, at 19:33, Dror Matalon wrote:

By the way, we're also thinking of integrating newsgroups into RSS
aggregator which you can see at  www.fastbuzz.com.
ZOE does something similar already.

It can vend messages as RSS feeds:

http://zoe.nu/itstories/story.php?data=storiesnum=43sec=2

And also aggregate RSS feeds:

http://zoe.nu/itstories/story.php?data=storiesnum=67sec=2

Are you interested in comparing notes, or possibly pooling resources?
Who? ZOE? Perhaps. You should drop by its mailing list:

https://lists.sourceforge.net/lists/listinfo/zoe-develop
https://lists.sourceforge.net/lists/listinfo/zoe-general
Archives available here:

http://news.gmane.org/gmane.mail.zoe.devel/
http://news.gmane.org/gmane.mail.zoe.general/
 We
have plenty of technical resources, and we've run news servers before,
although it's been a few years.
Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Term out of order.

2003-10-30 Thread petite_abeille
On Oct 30, 2003, at 13:36, Pasha Bizhan wrote:

I think that it's problem of java version of Lucene.
Because all core algorithms of Lucene and Lucene.Net are identical.
Talking of which... it appears... that... something... is... wrong...  
somewhere...

This definitely needs some additional investigation on my side as I'm  
quiet at loss about this sudden exception and I cannot reproduce it  
myself... sigh...

Trace: java.io.IOException: term out of order
	at  
org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:103)
	at  
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java: 
249)
	at  
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java: 
225)
	at  
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java: 
188)
	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
	at  
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:425)
	at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:301)
	at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:316)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Exotic format indexing?

2003-10-30 Thread petite_abeille
Hello,

Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a 
popular question on this list...

The traditional approach seems to be to try to find some kind of format 
specific reader to properly extract the textual part of such documents 
for indexing. The drawback of such an approach is that its complicated 
and cumborsome: many different formats, not that many Java libraries to 
understand them all.

An alternative to such a mess could be perhaps to convert those 
multitude of formats into something more or less standard and then 
extract the text from that. But again, this doesn't seem to be such a 
straightforward proposition. For example, one could image printing 
every document to PDF and then convert the resulting PDF to text. Not a 
piece of cake in Java.

Finally, a while back, somebody on this list mentioned quiet a 
different approach: simply read the raw binary document and go fishing 
for what looks like text. I would like to try that :)

Does anyone remember this proposal? Has anyone tried such an approach?

Thanks for any pointers.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 182 file formats for lucene!!! was: Re: Exotic format indexing?

2003-10-30 Thread petite_abeille
Hi Stefan,

On Oct 30, 2003, at 21:02, Stefan Groschupf wrote:

just to let you know, i had implement for the nutch project a plugin 
that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.
Yes, I saw that. Great work :)

Unfortunately, using OpenOffice is not an option in my case :(

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Exotic format indexing?

2003-10-30 Thread petite_abeille
On Oct 30, 2003, at 20:48, Ben Litchfield wrote:

Unfortunately, it is not quite so easy.  I am not sure about Word
documents
The raw text is visible.

but PDFs usually have there contents compressed
Yep. PDF is really an image format ;)

so a raw
fishing around for text would be pointless.
That's alright. I can handle PDF separately if the need arise.

 Your best bet is to use a
package like the one from textmining.org that handles various formats 
for
you.
Perhaps. But I'm only looking for a good enough solution, not a 
perfect one :)

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: java.nio.channels.FileLock

2003-10-29 Thread petite_abeille
On Oct 29, 2003, at 19:08, Ronald Muller wrote:

What is the advantage of using a FileLock object instead of the way 
Lucene
does it? (I do not see it)
Less code. Less worries.

Also note an mportant limitation:
File locks are held on behalf of the entire Java virtual machine. 
They are
not suitable for controlling access to a file by multiple threads 
within the
same virtual machine.
Perhaps. Have you used it? Any practical experience with it? For or 
against?

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Weird NPE in RAMInputStream when merging indices

2003-10-22 Thread petite_abeille
Hi Otis,

On Wednesday, Oct 22, 2003, at 18:06 Europe/Amsterdam, Otis Gospodnetic 
wrote:

Since 'files' is a Hashtable, neither the key nor the value (file) can
be null, even though the NPE in RAMInputStream constructor implies that
file was null.
Yep... pretty weird... but looking at openFile(String name)... could it 
somehow be possible that the name is invalid for some reasons and 
therefore doesn't exists in the Hashtable? So files.get(name) would 
return null and new RAMInputStream(file) would then raise a NPE?

This would not explain why the name is invalid in the first place... 
but that could be a start for an investigation... what do you think?

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: new release: 1.3 RC2

2003-10-22 Thread petite_abeille
Hello,

On Wednesday, Oct 22, 2003, at 18:13 Europe/Amsterdam, Doug Cutting 
wrote:

A new Lucene release is available.
Very nice. Thanks :)

Quick question regarding release note number 11:

What's the difference between IndexWriter.addIndexes(IndexReader[]) and 
IndexWriter.addIndexes(Directory[]) beside the fact that one takes an 
array of IndexReader and the other an array of Directory? Any 
functional differences? Is one way recommended over the other?

Cheers,

PA.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Weird NPE in RAMInputStream when merging indices

2003-10-21 Thread petite_abeille
Hello,

What could cause such weird exception?

RAMInputStream.init: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.lucene.store.RAMInputStream.init(RAMDirectory.java:217)
at org.apache.lucene.store.RAMDirectory.openFile(RAMDirectory.java:182)
at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:78)
at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:116)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:378)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:298)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:313)

I don't know if this is a one off as I cannot reproduce this problem 
nor I have seen this before, but I thought I could as well ask.

This is triggered by merging a RAMDirectory into a FSDirectory. Looking 
at the RAMDirectory source code, this exception seems to indicate that 
the file argument to the RAMInputStream constructor is null... how 
could that ever happen?

Here is the code which triggers this weirdness:

this.writer().addIndexes( new Directory[] { aRamDirectory } );

The RAM writer is checked before invoking this code to make sure there 
is some content in the RAM directory:

aRamWriter.docCount()  0

This has been working very reliably since the dawn of time, so I'm a 
little bit at loss as how to diagnose this weird exception...

Any ideas?

Thanks.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[OT] Open Source Goes to COMDEX

2003-10-20 Thread petite_abeille
Hello,

This is pretty much off topic, but...

ZOE has been nominated as one of the candidate project to go the Open 
Source Innovation Area on the COMDEX Exhibit Floor.

http://www.oreillynet.com/contest/comdex/

ZOE is one of the few Java project short listed and it uses Lucene 
quiet extensively.

Show your support by voting for ZOE :)

Cheers,

PA.

--
http://zoe.nu/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Index locked for write

2003-10-04 Thread petite_abeille
[Posted to Dev by mistake]
[Reposted to User]
[Sorry for the mess]
Hello,

I recently updated from 1.3 RC1 to the latest cvs version. RC1 has 
proven very reliable for me, but I needed Dmitry compound index 
functionality. Therefore the move to the cvs version.

I have been using 1.3 RC1 without any problem. But... since updating to 
the cvs version, I'm getting a lot of apparently random IOException 
related to locking:

java.io.IOException: Index locked for write: 
Lock@/tmp/lucene-5b228139f8fe55f7c74441a7d59f8f89-write.lock
at 
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
at 
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:150)

This is most likely due to some problem on my side, but for the life of 
me I cannot track it down nor reproduce it :(

Also, the only change related to Lucene on my side was the update from 
1.3 RC1 to the cvs version. Perhaps this has triggered a dormant bug 
in my app. Or perhaps something has changed in the cvs version which 
impact me negatively. Other way, I'm at loss.

My guess would be that this is most likely a threading issue. On my 
side, I use a very conservative threading which supposedly synchronized 
any access to Lucene. And this hasn't changed for a good while.

Any idea where I should look in such a situation? Any significant 
changes related to locking on Lucene side?

For the record, this problem seems to mostly manifest itself under Mac 
OS X, running Java 1.4.1_01.

Thanks.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which lock belong to which index?

2003-10-02 Thread petite_abeille
Hi Otis,

On Thursday, Oct 2, 2003, at 13:56 Europe/Amsterdam, Otis Gospodnetic 
wrote:

I cannot remember the answer I got, but I asked the same question after
the code was changed to put locks in java.io.tmpdir.
Because I have an application that deals with a lot of indices
simultaneously, I felt like this will make things more difficult in
cases where you have stale locks, etc.
Try the archive, though, as I seem to recall that somebody, Doug or
Scott gave me the answer.
I see... I'm sure I could get to the lock name and scan the tmp 
directory for a match... but why such complication in the first place? 
The only thing I can think of is for application running on read-only 
media... but in such a case there is no need to a lock in the first 
place...

Cheers,

PA. Very confused.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Is the lucene index serializable?

2003-09-23 Thread petite_abeille
Can I send a small lucene index by SOAP/TCP/HTTP/RMI? Is there a way 
to serialize a Lucene Index?
I wan to send it from the Indexer server to the Search Server, and 
then do a merge operation in the Search Server with the previous index 
file.
Well, what about a very old fashioned way instead? Something like 
tar.gz.ftp? Not very glamourous, but workable...

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Design question

2003-09-23 Thread petite_abeille
I, like a lot of other people are new to Lucene.  Practical examples
are pretty scarce.
If you don't mind learning by example, take a look at the Powered by 
Lucene page. A fair number of those projects are open source.

http://jakarta.apache.org/lucene/docs/powered.html

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread petite_abeille
Hi Otis,

On Thursday, Sep 4, 2003, Otis Gospodnetic wrote:

Has anyone written an application that uses Lucene to index Java code,
either from the source .java files, or compiled .class files?
If you are talking about my ultra secret project Zapata: Coding 
Mexican Style, then yes ;)

But... it uses runtime information to reach its devious ends and is 
more like a documentation tool than anything else...

Anyway, this is how it goes:

Given a set of binary jar files it builds an object graph of the 
bytecode: packages, classes, methods and so on. Complete with 
interdependencies and other handy informations. The bytecode is also 
run through a decompiler and pretty printed to normalize the source. 
Code segments are attached and indexed alongside their owners (class or 
method). All this fully indexed, searchable and cross referenced.

This is built upon the same engine used by ZOE, so the end result is 
very much along the lines of what ZOE does for email, but for code 
instead... fun, fun, fun ;)

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene app to index Java code

2003-09-04 Thread petite_abeille
Hi Erik,

On Thursday, Sep 4, 2003, at 15:03 Europe/Zurich, Erik Hatcher wrote:

- XDoclet could be used to sweep through Java code and build a 
text/XML file as richly as you'd like from the information there 
(complete with JavaDoc tags, which Zapata will miss :)),
Correct. This happen to be on purpose :) Does XDoclet build an 
intertwingled object graph of your code along the way? Performing a 
plain search on a code base is pretty trivial... what seems to be more 
interesting would be to put that in context.

Zapata does something along the line of what MagicHat does for 
Objective-C:

http://homepage.mac.com/petite_abeille/MagicHat/

But from the sound of what Otis is saying this is not what you guys are 
looking for... back to the pampa then...

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: StandardTokenizer problem

2003-09-04 Thread petite_abeille
On Thursday, Sep 4, 2003, at 16:07 Europe/Zurich, Nicolas Maisonneuve 
wrote:

I.B.M can be a host or acronym, so threre is a problem , no  ?
Perhaps as far as this parser goes... but... in practice... '.M' is not 
a valid TLD.

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IndexReader.delete(Term)?

2003-08-27 Thread petite_abeille
Hi Erik,

On Wednesday, Aug 27, 2003, Erik Hatcher wrote:

What you are doing looks fine to me.  I'm sure these are obvious 
questions, kinda like is your computer plugged in?, but here goes:

- How are you determining that the document is still there?  With an 
IndexReader?  IndexSearcher?
- A freshly created (i.e. after the delete) Index[Searcher|Reader]?
- And finally, did you remember to recompile?!  :))  (just kidding)
Thanks for the moral support :)

In any case, hard liquors and coding doesn't always mix well together, 
so obviously I was shooting myself in the foot...

For the record, I'm using a RAMDirectory which then gets flushed into a 
FSDirectory.

Deleting something means checking both the RAM and FS directory. Which 
is what I do.

But... because of the internal caching done by the IndexWriter, a 
document is not made available straight away... therefore 
IndexReader.delete(Term) returning zero and me banging my head against 
the wall... adjusting the order of operations did solve the problem...

Which brings a question: is there a way to influence the IndexWriter's 
internal RAM cache, beside closing or optimizing a writer?

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IndexReader.delete(Term)?

2003-08-26 Thread petite_abeille
Hello,

This is more a sanity check, than anything else, but...

I'm trying to delete a document using IndexReader.delete(Term)... (for 
the record I'm using 1.3-rc1)

The document was created with a Field.Keyword() to uniquely identify it.

The document exists, was saved, can be queried, life is good :)

But then... when trying to delete the same document later on... 
IndexReader.delete(Term) returns 0 and the document doesn't get 
deleted... which is driving me crazy 8}

Here is what I'm doing to delete the document:

Term	aTerm = new Term( aKey, anID );

aReader.delete( aTerm );
aReader.close();
The term looks like the following:

szid:3FA7168800F7FDE8ECAA35500A00012D

But this doesn't seem to do anything... The document is still there no 
matter what...

I'm sure I'm doing something very wrong, but for the life of me I 
cannot see what... anything obvious I'm missing?

Thanks.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Advanced Text Indexing with Lucene

2003-03-06 Thread petite_abeille
Another fine article by Otis:

http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing Tips and Hints

2003-02-25 Thread petite_abeille
On Tuesday, Feb 25, 2003, at 11:48 Europe/Zurich, Andrzej Bialecki 
wrote:

This is strange, or at least counter-intuitive - if you buffer larger 
parts of data in RAM than the standard implementation does, it should 
definitely be faster... Let's wait and see what Terry comes up with.

BTW. how large indexes did you use for testing?
A small testing set: around 100 MB.

Also, it could be that the indexing process is bound by some other 
bottleneck,
Most definitively.

 and buffering helps only when searching already existing index.
Ooops... forgot to mention that the purpose of my testing was to test 
searching... I don't mind indexing speed that much... in any case... 
more generally I wanted to see if a buffered random access file would 
help in my peculiar situation... but no noticeable differences in my 
case one way or another... on the other hand... that could be just me 
as there is much more than straightforward Lucene indexing/searching 
going on. Let that not discourage you :-) In any case, Lucene itself is 
pretty speedy overall. The only bottleneck is index merging in my 
experience.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Best HTML Parser !!

2003-02-25 Thread petite_abeille
On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Lukas Zapletal wrote:

I have some good experiences with JTidy. It works like DOM-XML parser 
and cleans HTML it by the way.
I use jtidy also. Both for parsing and clean-up. Works pretty nicely.

This is VERY useful, because EVERY HTML have at least ONE error.
This rule should be tattooed on every parsers head: out of the 
laboratory, nothing is compliant. Which render the race to more 
compliance among the different parsers somewhat ridiculous.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


read past EOF?

2003-01-07 Thread petite_abeille
Hello,

Here is a pretty fatal exception I get from time to time in Lucene...

java.io.IOException: read past EOF
at  
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:277)
at org.apache.lucene.store.InputStream.readBytes(Unknown Source)
at org.apache.lucene.index.SegmentReader.norms(Unknown Source)
at org.apache.lucene.index.SegmentReader.norms(Unknown Source)
at org.apache.lucene.search.TermQuery.scorer(Unknown Source)
at org.apache.lucene.search.BooleanQuery.scorer(Unknown Source)
at org.apache.lucene.search.Query.scorer(Unknown Source)
at org.apache.lucene.search.IndexSearcher.search(Unknown Source)
at org.apache.lucene.search.Hits.getMoreDocs(Unknown Source)
at org.apache.lucene.search.Hits.init(Unknown Source)
at org.apache.lucene.search.Searcher.search(Unknown Source)
at org.apache.lucene.search.Searcher.search(Unknown Source)

Any idea what could cause such, er, misbehavior?

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Heuristics on searching HTML Documents ?

2002-12-30 Thread petite_abeille

On Monday, Dec 30, 2002, at 15:01 Europe/Zurich, Erik Hatcher wrote:


If you have control over the HTML, how about marking the navbar pieces 
with a certain CSS class and then filtering that out from what you 
index?  It seems like that would be a reasonable way to filter it - 
but this is of course provided its your HTML and not someone elses.

Alternatively, if the documents creation is out of your hands, you 
could try to compute the longest common prefix/suffix of a set of 
document and discount that from your indexing.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: powered by lucene question

2002-12-27 Thread petite_abeille

On Friday, Dec 27, 2002, at 18:22 Europe/Zurich, Otis Gospodnetic wrote:


It would be nice to make that Lucene image clickable, which should be a
piece of cake, since Zoe uses HTML for rendering the UI.
Doable?


Well... yes. This is how it works in the application itself: you can 
click on the Lucene logo... The screenshot was simply to give you a 
preview of what to expect...

Am I missing something?-)

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: write.lock file

2002-12-20 Thread petite_abeille

On Friday, Dec 20, 2002, at 19:48 Europe/Zurich, Doug Cutting wrote:


Can you provide a reproducible test case that demonstrates index 
corruption?

I honestly wish I could. Unfortunately, because of the nature of the 
application (Otis is familiar with it), I never seem to be able to come 
up with a consistent test case. I might be using Lucene in a very 
peculiar way (?) which includes a lot of concurrent read/write/delete 
on multiple indexes/threads. So I haven't managed to can the problem 
in a nice and tidy batch oriented test case. Sight...

In any case, the external symptoms are: bad file descriptor, read past 
EOF and array out of bound. Next time around, I will capture the full 
stack trace and forward it to the list if you guys are interested.

Cheers,

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



powered by lucene question

2002-12-20 Thread petite_abeille
Hello,

I'm in the process of creating the about page for my app and I was 
wondering what are the requirements to get included in the Powered by 
Lucene page?

The app is a desktop application... it's not a web site. The only 
requirement I see is Please include something like the following with 
your search results: search powered by lucene.

My question is: is it good enough to put the above in the about page 
or does it _has_ to be in the search results page? I'm fine with the 
first scenario (about page) but I'm very reticent to the second (search 
page). This reticence has nothing to do with branding or failure to 
give Lucene credit: it's simply that the search page is already very 
crowded and I don't think adding anything to it will improve the 
situation :-| On the contrary :-(

Comments?

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



package information?

2002-12-20 Thread petite_abeille
Hi,

Would it be possible for Lucene to provide package informations? 
Basically all the java.lang.Package attributes... Things like 
implementation vendor, name, version and so on... This would make it 
easier to identify which packages/versions are used.

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: package information?

2002-12-20 Thread petite_abeille

On Friday, Dec 20, 2002, at 21:44 Europe/Zurich, Eric Isakson wrote:


I think this info is available via the Manifest that is created during 
the build. This is cut from the build.xml from the latest CVS...


Great! I must have overlooked it somehow.

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: write.lock file

2002-12-17 Thread petite_abeille

On Tuesday, Dec 17, 2002, at 17:43 Europe/Zurich, Doug Cutting wrote:


Index updates are atomic, so it is very unlikely that the index is 
corrupted, unless the underlying file system itself is corrupted.

Ummm... Perhaps in theory... In practice, indexes seems to get 
corrupted quiet easily in my experience. On the other hand, I seldom 
get a file system corruption. As always, YMMV.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Indexing in a CBD Environment

2002-12-11 Thread petite_abeille

On Wednesday, Dec 11, 2002, at 15:21 Europe/Zurich, Cohan, Sean wrote:


Is there a better way to provide an acceptable searching mechanism 
using the
relational database engine?

Well it depend of what you mean by acceptable... but if you are using 
Oracle, you should look into Oracle Text:

http://otn.oracle.com/products/text/content.html
http://www.searchtools.com/tools/oracle-search.html

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Indexing in a CBD Environment

2002-12-10 Thread petite_abeille

On Wednesday, Dec 11, 2002, at 07:16 Europe/Zurich, Otis Gospodnetic 
wrote:

It uses Lucene as an
object store, of sort, I believe, with variuos relations between
objects (I did not look at the source, but I suspect it does this based
on the functionality it offers).


Yep. The basic approach ZOE takes is to create one index per class and 
index the primary and foreigns key as keywords. It then query the 
different indexes to simulate a relational storage... Which is all 
handy, dandy... On the other hand, if you already have a relational 
database in the first place, there is no reason to go through this 
circus in the first place...

  You may want to look at its source.


If you are so inclined, you can check the alt.dev.szobject package for 
more gory details. In particular, SZIndex deals with Lucene directly.

You can find the app and its source here:

http://guests.evectors.it/zoe/

Cheers,

PA.



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Indexing email messages?

2002-12-06 Thread petite_abeille

On Friday, Dec 6, 2002, at 11:12 Europe/Zurich, Ashley Collins wrote:


I'm using Lucene to index MIME messages and have a couple of questions.


You should take a look at ZOE as it does all that and more. It's open 
source and uses Lucene to index every single bits of email.

http://guests.evectors.it/zoe/

Cheers,

PA.



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Re: Readability score?

2002-11-23 Thread petite_abeille

On Friday, Nov 22, 2002, at 20:46 Europe/Zurich, petite_abeille wrote:


Does anyone have a handy library to compute readability score?


Here is an extract from a paper describing the Flesch index and an  
algorithm to count syllables... Does that make any sense?

Thanks.

The Flesch index: An easily programmable readability analysis  
algorithm
-- John Talburt

 ... Each vowel (a, e, i, o, u, y) in a word counts as one syllable  
subject to the following sub-rules: Ignore final -ES, -ED, -E (except  
for -LE) Words of three letters or less count as one syllable  
Consecutive vowels count as one syllable. Although there are many  
exceptions to these rules, it works in a remarkable number of cases.  
...

http://portal.acm.org/ 
citation.cfm?id=10583coll=portaldl=ACMCFID=5876721CFTOKEN=58538732


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Readability score?

2002-11-22 Thread petite_abeille
Hello,

This is slightly off topic but...

Does anyone have a handy library to compute readability score?

Something like Flesch Reading Ease score  Co:

http://thibs.menloschool.org/~djwong/docs/wordReadabilityformulas.html

Would you like to share?-)

Thanks.

R.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




using lucene as a lookup table?

2002-09-27 Thread petite_abeille

Hello,

I would like to use Lucene as a kind of lookup table (aka Map):

A document would have two fields:

- the first field would represent a random lookup key in the form of a 
Field.Keyword
- the second field would be an object id also stored as a Field.Keyword

Which sounds fine in theory. Unfortunately it doesn't seem to quiet 
work in practice: when inserting a new document and trying to look it 
up straight away I usually don't get any result back for a while.

Maybe I'm simply missing something very obvious, but how does one 
lookup a document that was just inserted in an index?

Though?

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: using lucene as a lookup table?

2002-09-27 Thread petite_abeille


On Friday, Sep 27, 2002, at 13:27 Europe/Zurich, petite_abeille wrote:

 - the first field would represent a random lookup key in the form of a 
 Field.Keyword

Ooops... I should have mention that the key field is stored as Field( 
aKey, aValue, false, true, false): eg not stored, indexed, not 
tokenized. It it's basically only indexed as I don't need its value for 
lookup purpose.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: text format and scoring

2002-08-03 Thread petite_abeille

Hi Alex,

On Saturday, August 3, 2002, at 11:13 , Alex Murzaku wrote:

 Hi PA! How are things going?

Doing all right :-)


 It's an interesting question but I don't think Lucene
 (as it is today) could change weights based on
 semantics (either assigned by formatting tags or maybe
 looked up in some dictionary like WordNet)...

Ummm... I see.


 Some time ago, Doug sent to this list the formula for
 the score computation which is:

Thanks.


 The only thing that counts is the frequency of the
 terms in the document and among documents.

 A way to influence the final score might be to tweak
 the real frequencies during indexing with some
 parameters configured externally. Let's say if the
 word is underlined then multiply its count by X. This
 modified TF should influence the final score
 accordingly.

 Just a thought...

I see. That's what I'm basically doing right now somehow: I index a 
document multiple time (eg an email could be indexed by subject, first 
sentence and body content). Then I do multiple searches. And use a 
ranking comparator to evaluate the result based on how many time I get 
a specific document plus its Lucene scores and other funky heuristics. 
Which seems to work ok, but is kind of cumbersome :-( Same deal for 
finding related document. Lucene is very good for finding similar 
document, but for related (think cluster ;-), I basically end up 
doing some term categorization and assign some multiplying factor for 
each term category. Which then I feed to Lucene to get something more 
akin to a cluster of document...

In any case, I was simply wandering if there was a more straightforward 
way of doing things.

Cheers,

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




text format and scoring

2002-08-02 Thread petite_abeille

Hello,

I was wandering what would be a good way to incorporate text format 
information in Lucene word/document scoring. For example, when turning 
HTML into plain text for indexing purpose, a lot of potentially useful 
information are lost: eg tags like bold, strong and so on could be 
understood as conveying emphasis information about some words. If 
somebody took the pain to underline some words, why throw it away? 
Assuming there is some interesting meaning in a document format/layout, 
and a way to understand it and weight it, how could one incorporate this 
information into document scoring?

Thanks for any insights :-)

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Lucene for OSX?

2002-07-16 Thread petite_abeille

Hello,

I was wandering if anybody knows of a Lucene port to straight C or 
Objective C...?!?

I need something equivalent to Lucene (but native if possible) on Mac OS 
X...

Thanks for any pointers!-)

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene for OSX?

2002-07-16 Thread petite_abeille


On Tuesday, July 16, 2002, at 03:41 , Otis Gospodnetic wrote:

 The only thing that I can think of right now is omseek on sf.net, but
 that project seems somewhat dead. I think that is in C or C++.

Thanks. I also found something called Onix (http://www.lextek.com/onix/)

Anybody have any experience with it? And how it compare to Lucene?

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene for OSX?

2002-07-16 Thread petite_abeille

Hi James,

On Tuesday, July 16, 2002, at 03:52 , Brook, James wrote:

 How about this? I think it's what they use for Sherlock.

 Apple Information Access Toolkit (AIAT)
 http://www.devworld.apple.com/dev/aiat/

Well, that's basically the first incarnation of Lucene :-) And in fact I 
was thinking to use it. However it seems to be missing from the latest 
osx... If you know otherwise, where is it hidden then?

 I have an Objective C WebObjects 4.5 application running on Mac OS X 
 Server
 1.2 that uses it to directly search the blobs of an OpenBase database.

Wow... I briefly used AIAT myself... but as I said, I cannot find it 
anymore.

 I believe that it is written in C++, but it can easily be wrapped.

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene for OSX?

2002-07-16 Thread petite_abeille


On Tuesday, July 16, 2002, at 04:04 , Brook, James wrote:

 It looks like it's available for FTP download as an 'SDK' on this page
 http://developer.apple.com/sdk/

 I have no idea whether this is up-to-date or compatible with the latest 
 OS
 X.

Thanks. I will take a look into it.

Cheers,

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




[OT] Zoe open source

2002-06-03 Thread petite_abeille

Hello,

I'm releasing Zoe under the Apple Public Source License and putting 
together a SourceForge project to coordinate the future development of 
Zoe. Our plan is to choose a handful of experienced developers to form 
the core development team for Zoe. Anyone is free to contribute code 
which members of the development team will review add back into the 
codebase. Over time we will invite developers who have demonstrated 
their interest and abilities to join our core team. We'll keep the 
mailing lists public and encourage everyone to sign up and throw in any 
comments they may have. There has been a tremendous amount of interest 
in having Zoe released as an open source project and we think that this 
will be the best way to manage all the different voices.

Right now we need to know which of you are interested in contributing 
code to Zoe and how interested you are. If you are interested in being 
on the core team of developers and helping zoe become the best e-mail 
client out there, we want to hear from you. Members of the core team 
will be expected to watch the mailing lists,  regularly contribute to 
the codebase, and review and integrate contributions by other 
developers. If you would like to be considered as a member of the core 
team we would like you to e-mail a resume, if you have one, some code 
snippets, and a note letting us know what you are interested in doing 
with zoe.  i.e is there a particular set of tools you would like to 
implement, or enhance. Email these to Kate at [EMAIL PROTECTED] She 
will be helping me out and coordinating the SourceForge project for zoe.

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: [OT] Zoe open source

2002-06-03 Thread petite_abeille


On Monday, June 3, 2002, at 04:44 PM, Peter Carlson wrote:

 Good luck with your project.

 It looks very exciting and refreshing. I haven't tried it yet, but the
 screen shots look useful and beautiful.

Thanks.

 I hope that you will stay active in the Lucne user community and 
 contribute
 any new features in Lucene back into the core or sandbox projects.

Sure. There is already a kind of generic persistency layer build on top 
of Lucene. People interested can take a look at the alt.dev.szobject 
package under Frameworks.

Cheers,

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




[OT] An Open Letter

2002-05-27 Thread petite_abeille

FYI.

Begin forwarded message:

 From: Alex Horovitz [EMAIL PROTECTED]
 Date: Mon May 27, 2002  01:58:27 PM Europe/Zurich
 To: [EMAIL PROTECTED], [EMAIL PROTECTED], 
 [EMAIL PROTECTED], [EMAIL PROTECTED]
 Cc: Steve Jobs [EMAIL PROTECTED], [EMAIL PROTECTED], toni Trujillo-Vian 
 [EMAIL PROTECTED], Bob Fraser [EMAIL PROTECTED], WebObjects 
 [EMAIL PROTECTED]
 Subject: [OT]  An Open Letter

 An open letter to Apple on why many people want an open source 
 WebObjects and EOF.

 ---Reader's Digest Version-

 Four reasons why Apple should open source WO/EOF:

 REASON #1: WO/EOF cannot be legitimate extensions of the Apple brand, 
 its value to the marketplace is only achieved through independence from 
 the Apple brand proper. Placing WO/EOF under an open source license 
 allows Apple to retain control. It also allows legitimacy to and 
 adoption by those who would not normally accept or adopt an Apple 
 product in this space.

 REASON #2: Because debugging is highly parallelizable, an open source 
 WO/EOF will increase the number of debuggers and therefor increase the 
 stability of the product over the long run by applying the skills of 
 many more engineers than Apple could ever hope to support as employees. 
 With a large enough user/co-developer community, all bugs can be 
 quickly quantified and understood allowing a fix to become obvious to 
 at least one member of the community.

 REASON #3: If Apple will treat the WO/EOF user community as if we were 
 their most valuable resource in terms of current and future development 
 of the product, we will become their most valuable resource. Trusting 
 us enough to share the source in an open source fashion, will benefit 
 Apple (and the application server market) in ways they cannot even 
 begin to imagine.

 REASON #4: Because there is no accounting for taste. That was the first 
 lesson of applied microeconomics my college professor taught me, and it 
 holds true today. Apple, as smart and cutting edge as it may be, cannot 
 anticipate the ways in which WO/EOF will be utilized or improved upon 
 by people in the field. Open source allows for faster innovation and 
 the ability to capture truly useful and novel ideas.

 -Unabridged Version--

 The first question Apple must address is one of business sense. Does it 
 make good business sense to open source any technology, let alone 
 WO/EOF. We have some evidence that at least in one case, Darwin, it 
 made sense to open source a key Apple technology.

 Now granted, this is an attempt to position Mac OS X against Linux in 
 some key market segments. That being said, a case was effectively made 
 and bought off on by key Apple people. Can we do the same for 
 WebObjects? Sure we can.

 WebObjects is not a clear legitimate extension of the Apple brand. I 
 suspect that everyone knows this to be true. I also suspect that this 
 gives Apple some pause in terms of being able to evangelize/market the 
 product at the level which would allow it to attain a respectable 
 position in the application server market. And, before you say that a 
 large company like Apple can't really afford to open source a software 
 project like WO/EOF, consider that IBM has done it for WebSphere.

 An open sourced WO/EOF could avoid the traditional problems Apple faces 
 in the area of brand extension. This is because Apple Engineering 
 enjoys legitimacy as an outstanding software organization. As 
 technologies, WO/EOF both enjoy reputations for being excellent 
 products. However, in terms of adoption, they suffer due to the 
 disconnect between the enterprise application server market and Apple's 
 traditional self branding.

 REASON #1: WO/EOF cannot be legitimate extensions of the Apple brand, 
 its value to the marketplace is only achieved through independence from 
 the Apple brand proper. Placing WO/EOF under an open source license 
 allows Apple to retain control. It also allows legitimacy to and 
 adoption by those who would not normally accept or adopt an Apple 
 product in this space.

 From experience, we all know there has _never_ been a bug free release 
 of WO/EOF.  Apple's WO/EOF customers face the same challenge in this 
 respect: given the new release, what bugs will it have that will 
 prevent me from moving to that release; and, what bugs in my current 
 release does it fix that would encourage me to move to that release. 
 Also from experience we know there to be a significant time in between 
 releases.

 The non-open source development style is the culprit here.  Apple is 
 passionate about release good stable software. Before a product can go 
 out the door there is an extensive amount of QA and testing. This being 
 the case, and with a goal of minimizing shipping bugs and maximizing 
 stability of releases, it takes time to get to a point where 
 collectively Apple feels it can ship the product.

 The experience of the open source community is quite the opposite. 
 

source code available

2002-05-27 Thread petite_abeille

For entertainment purpose only, ZOË's source code is available at:

http://guests.evectors.it/zoe/

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: indexing PDF files

2002-05-03 Thread petite_abeille


On Wednesday, May 1, 2002, at 05:41 PM, Otis Gospodnetic wrote:

 Wouldn't you want to convert to XML instead and use XSLT to transform
 the XML representation to any desired format by just applying a style
 sheet?
 Sounds like less work with bigger document type coverage.

Sounds good... But what does it mean? I'm not that familiar with any of 
the XML, XSLT hype so I don't really understand what you are getting 
at... I just want to convert any type of document to text for indexing 
purpose... I'm not planning to do anything else with it... However, 
converting everything to PDF as a first step allow you to provide a 
preview of any documents even if you happen not to understand the 
original format (eg MS Office)...

PA


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: indexing PDF files

2002-05-03 Thread petite_abeille

On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote:

 Can I assume none of the poeple on the lucene user group had 
 implemented indexing a pdf document using lucene.

Who knows...?!? In any case, it's not public knowledge...

  If some one has.. Please help me by providing the solution.

I use to believe in Santa Claus also... ;-)

All that said, there seems to be a real demand to do something about pdf 
to text conversion (in java preferably). I'm willing to invest some time 
and brain cell to nail it down, but I'm note sure where to start...

I'm aware of the PJ library, but it's really a pig as far as resources 
goes. Anything else?

Any (concrete) pointer appreciated.

Thanks.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

2002-05-01 Thread petite_abeille

On Wednesday, May 1, 2002, at 12:41 AM, Dmitry Serebrennikov wrote:

 - the number of files that Lucene uses depends on the number of 
 segments in the index and the number of *stored* fields
 - if your fields are not stored but only indexed, they do not require 
 separate files. Otherwise, an .fnn file is created for each field.

Ok. That's good as all my fields are indexed but not stored in Lucene. 
Only one field is stored in any one index: the uuid of an object (as a 
Keyword).

 - if at least one document uses a given field name in an index, that 
 index requires the .fnn file for that field

Ok. So, in theory, more homogeneous index should use less files all 
things being equal?

 - index segments are created when documents are added to the index. For 
 each 10 docs you get a new segment.
 - optimizing the index removes all segments are replaces them with one 
 new segment that contains all of the documents
 - optimization is done periodically as more documents are added 
 (controlled by IndexWriter.mergeFactor), but can be done manually 
 whenever needed

Ok. When doing the optimization, are there any temporary files getting 
created?

 With all this, I think Lucene does use too many files...

That's my impression also...

 Some additional info: there is a field on IndexWriter called 
 infoStream. If this is set to a PrintStream (such as System.out), 
 various diagnostic messages about the merging process will be printed 
 to that stream.

Yep. I guess I overlooked that.

 You might find this helpful in tuning the merge parameters.

Just to make sure: using a small merge factor (eg 2) will reduce the 
number of files or just optimize (aka merge) the index more often?

 Hope this helps.
 Good luck.

Thanks. Very helpful indeed :-)

R.



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: indexing PDF files

2002-05-01 Thread petite_abeille

On Tuesday, April 30, 2002, at 10:46 PM, Otis Gospodnetic wrote:

 Hm, this should be a FAQ.

Maybe it should... ;-)

 Check Lucene contributions page, there are some starting points there,

Well, this seems to be a very popular request... In fact I need 
something like that also. Unfortunately, there seems to be no 
authoritative answer as far as converting pdf files to text in a pure 
Java environment... Maybe I'm missing something here as usual?

Also, on a related note, what would be a good approach to convert any 
random document into pdf? I was thinking to have a two steps process for 
document indexing in Lucene:

- First, convert everything to pdf (with Acrobat or something)
- Second, convert pdf to text and index it.

Any practical suggestions about how to do that in a pure Java 
environment very welcome.

Thanks :-)

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Homogeneous vs Heterogeneous indexes (was: FileNotFoundException)

2002-04-30 Thread petite_abeille

On Tuesday, April 30, 2002, at 01:57 AM, Steven J. Owens wrote:

 Just be glad you aren't doing this on Solaris with JDK 1.1.6

I know... In fact I'm looking forward to port my stuff to 1.4... As my 
app is very much IO bond I'm really excited by this nio madness... :-)

 Yes and no.  Setting ulimit to a reasonable number of open files is not 
 only not a patch, it's the right way to do it.

Of course... Nothing is really black or white... What I wanted to say is 
that -as a first strike- *I* prefer not to mess around with system 
parameters.

  I understand where you're coming from, really, and in a certain way, 
 it makes sense

Thanks. I already feel less alone... ;-)

 BUT... sometimes the impulse for clean, good design takes you too far 
 down a blind alley.

Sure. At the end of the day, everything is a tradeoff...

  Sometimes there is no elegant solution. Sometimes there is no best 
 way, only one of a limited set of options with different tradeoffs.

Absolutely.

 Most serious applications have to have some sort of OS variable 
 tweaking, you're just used to having it done invisibly and painlessly.

Agree. In fact that's my first desktop application for nearly a decade. 
I usually work on large scale system. And let me tell you, it's a very 
different pair of sleeves... ;-)

  You could figure out the right way to set the system configuration 
 on install or launch.

One of my design goal is to try to avoid these sort of tweakings as 
much as I can.

  You could look at the alternative techniques for indexing in Lucene

That's another one of those nasty tradeoffs... ;-) Memory is even more 
precious than file descriptors in my situation... Specially with a jvm 
that have this funky notion of constraining your memory usage...

 if there's anything you're doing wrong (perhaps opening files and not 
 closing them, and leaving them for the garbage collector to eventually 
 get around to closing?)

Sure. I went through all those sanity checks. Also, in my case, the 
garbage collector is my friend as I'm using the java.lang.ref API 
extensively.

  or if you have a pessimal usage pattern that exacerbates the situation.

U...?!? You lost me here... What's a pessimal usage pattern?

 if you can come up with a scheme to run Lucene indexing with modified 
 code for keeping track of file resources.

Sure, there are many thing that one could do... However, I have to 
balance how much time I want to invest into any one of those allays. One 
thing I really like about Lucene is it very simple API and usage. So far 
it has worked out pretty well for me as I'm using it pretty extensively. 
And I seem to have found -at last- a good balance between the different 
constrains I'm operating under.

 an anomalous situation (use on a client/desktop machine)

Anomalous situation?!?! Ummm... Lucene is just an API... Hopefully 
it's not bundled with some dogma attached to it... However, I'm kind 
of starting to wander about that considering some of the -very 
defensive- responses I got to my postings... Oh, well... I will just go 
back to my cave... :-(

 could configure lucene to be careful about how many files it keeps open 
 at any given time.

That will be great! On a somewhat related note, I have decided to stick 
with the com.lucene package for the time being I was pretty excited 
when the rc stuff came out, but it just didn't work out for me. My 
resources problem just went from bad to worse. And also, I have two 
issues with the release candidate: locking and reference counting.

Locking. I don't have anything against locking per see. However, I 
really don't like how it's implemented in the rc. Using files just do 
not work for me. It creates too many problems when something goes wrong 
(eg the app is killed without warning and I have to clean up all those 
locks by myself). What about using sockets or something to rendez-vous 
on an index? Or at a bare minimum, be able to disable the locking all 
together. I understand that most people are using Lucene under a very 
different setup that I do, but nevertheless it should not hurt to make 
it configurable. Anyway, it does not work for older jvm as noted in the 
source code. Last, but not least, I'm always get very scared when I see 
some platform dependent code somewhere (eg if version 1 then ) ;-)

Reference counting. Well, as noted in a comment in the source code, the 
reference API is really the way to go... And trying to be backward 
compatible to version 0.9 is somehow missing the forest for the tree... 
Just my two cents in any case. And yes, I'm well aware that I can fix 
all these issue by myself... And start to contribute to Lucene instead 
of just ranting left and right... But also keep in mind that I'm just a 
humble Lucene user. And there seem to be a very clear distinction 
between user and developer in Lucene's world... ;-)

Thanks for your response in any case. I hope I didn't offend too many 
people with my ramblings ;-)

PA.


--
To 

Re: rc4 and FileNotFoundException: an update

2002-04-29 Thread petite_abeille

 I don't know what environment you're using Lucene in. However, we had 
 this too
 many open files problem on our Solaris box, and increasing the number 
 of file
 descriptors through the ulimit -n command fixed it.

Thanks. That should help. However, I have a little desktop app and it 
will be very cumbersome to require users to change some system 
parameters just to run it... :-(

Thanks in any case.

PA


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: too many open files in system

2002-04-29 Thread petite_abeille

 On Tuesday, 9. April 2002 14:08, you wrote:
 root wrote:
 Doesn't Lucene releases the filehandles??

 because I get too many open files in system after running lucene a
 while!

 Are you closing the readers and writers after you've finished using 
 them?

 cheers,

 Chris


 Yes I close the readers and writers!


By the way, did you ever solved this problem? I want through that thread 
and everybody seem to be passing the buck to somebody else... :-(

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: FileNotFoundException: code example

2002-04-29 Thread petite_abeille

  I would add some logging to the code

You lost me here... Where should I add some logging?

  to get more idea of which Lucene methods are
 actually being called, when, in what sequence.

I typical sequence looks like that:

- search()
- deleteIndexWithID()
- indexValuesWithID()

PA


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: too many open files in system

2002-04-29 Thread petite_abeille

 how many open files you think can be used at your process??

Not sure. It varies with usage pattern. I will check it out in any case.

 cat /proc/sys/fs/file-max

cat: /proc/sys/fs/file-max: No such file or directory

 echo 5  /proc/sys/fs/file-max

Unfortunately, I cannot use this kind of quick fix as my app is a 
desktop app and can access the user account only.

PA.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




  1   2   >