Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
Otis Gospodnetic wrote:
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, but I always hear people complaining about the speed.  A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.
 

Also... MySQL full text search isn't perfect. If you're not a java 
programmer it would be difficult to hack on. Another downside is that FT 
in MySQL only works with MyISAM tables which aren't transaction aware 
and use global tables locks (not fun).

I'm sure though that MySQL would do a better job at online index 
maintenance than Lucene. It falls down a bit in this area...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
David Sitsky wrote:
On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote:
 

You are right.
Since there are C++ and now C ports of Lucene, it would be interesting
to integrate them directly with DBs, so that the RDBMS full-text search
under the hood is actually powered by one of the Lucene ports.
   

Or to see Lucene + Derby (100% JAVA embedded database donated from IBM 
currently in Apache incubation) integrated together... that would be 
really nice and powerful.

Does anyone know if there are any integration plans?
 

Don't forget BerkeleyDB Java  Edition... that would be interesting too...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.
You know... it looks like the problem is that TermInfosReader uses 
INDEX_INTERVAL during seeks and is probably just jumping RIGHT past the 
offsets that I need.

If this is going to be a practical way of reducing Lucene memory 
footprint for HUGE indexes then its going to need a way to change this 
value based on the current index thats being opened.

Is there anyway to determine the INDEX_INTERVAL from the file?It 
looks according to:

http://jakarta.apache.org/lucene/docs/fileformats.html
That the .tis file (which according to the docs the .tii file is very 
similar to the .tis file ) should have this data:

So according to this:
TermInfoFile (.tis)-- TIVersion, TermCount, IndexInterval, 
SkipInterval, TermInfos

The only problem is that the .tii and .tis files I have on disk don't 
have a constant preamble and doesnt' look like there's an index interval 
here...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Term Weights and Clustering

2005-02-24 Thread Dawid Weiss
Hi Owen,
I'm from the Carrot2 project, so I feel called to the blackboard:
One source for how to do this is the thesis of Stanislaw Osinski and 
others like it:
http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
And the Carrot2 project which uses similar techniques.
http://www.cs.put.poznan.pl/dweiss/carrot/
Staszek Osinski is the author of Lingo, the best clustering algorithm 
available in Carrot2 -- we still work together in that project... In 
other words, Carrot2 doesn't use 'similar' techniques. It uses _the_ 
techniques described in the above thesis (and other various papars, see 
my Web page).

My problem is simple: I need a fairly clear discussion on exactly how to 
generate the labels, and to assign documents to them.  The thesis is 
quite good, but I'm not sure I can reduce it to practice in the 2-3 days 
I have to evaluate it!  Lucene has made the TDM easy to calculate, but I 
basically don't know what to do next!
You can use Carrot2 directly for that. There are a few options. One 
thing is you can directly feed your input collection to the clustering 
component (it will take a while, but should work) -- you need to write a 
custom input component, but it is a very simple thing to do and I'm sure 
if you write to Carrot2 mailing list there will be somebody willing to 
help (like myself or Staszek ;).

Another option is: use Lucene to index your documents. Set up Carrot2 to 
use Lucene (described somewhere on this list, see David Spencer's message).

a quick way to get a demo on the air?  For example, I don't seem to be 
able to ask Carrot2 to do a Google site search.  
Yep, there is a problem with it. Post a bug report to carrot2 bugzilla, 
please. I'll investigate it when I have time.

simply aim Carrot2 at my collection with a very general search and see 
what clusters it discovers.  This may be a gross misuse of Carrot2's 
clustering anyway, so could easily be a blind alley.
It kind of is because carrot2 clustering components work primarily with 
_short_, scarce information sources, such as snippets. We don't intend 
to work on large, raw documents collections... Having said that, a 1200 
documents isn't that much and you should be able to get your clusters.

D.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


ngramj

2005-02-24 Thread Gusenbauer Stefan
Does anyone know a good tutorial or the javadoc for ngramj because i 
need it for guessing the language of the documents which should be indexed?
thx
stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ngramj

2005-02-24 Thread petite_abeille
On Feb 24, 2005, at 14:50, Gusenbauer Stefan wrote:
Does anyone know a good tutorial or the javadoc for ngramj because i  
need it for guessing the language of the documents which should be  
indexed?
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/ 
languageidentifier/

Cheers
--
PA, Onnay Equitursay
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Custom filters document numbers

2005-02-24 Thread Vanlerberghe, Luc
An IndexReader will always see the same set of documents.
Even if another process deletes some documents, adds new ones or
optimizes the complete index, your IndexReader instance will not see
those changes.

If you detect that the Lucene index changed (e.g. by calling
IndexReader.getCurrentVersion(...) once in a while), you should close
and reopen your 'current' IndexReader and recalculate any data that
relies on the Lucene document numbers.

Regards, Luc.

-Original Message-
From: Stanislav Jordanov [mailto:[EMAIL PROTECTED] 
Sent: donderdag 24 februari 2005 14:18
To: Lucene Users List
Subject: Custom filters  document numbers

Given an IndexReader a custom filter is supposed to create a bit set,
that maps each document numbers to {'visible', 'invisible'} On the other
hand, it is stated that Lucene is allowed to change document numbers.
Is it guaranteed that this BitSet's view of document numbers won't
change while the BitSet is still in use (or perhaps the corresponding
IndexReader is still opened) ?

And another (more low-level) question.
When Lucene may change document numbers?
Is it only when the index is optimized after there has been a delete
operation?

Regards: StJ


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Custom filters document numbers

2005-02-24 Thread Stanislav Jordanov
The first statement is clear to me:
I know that an IndexReader sees a 'snapshot' of the document set that was
taken in the moment of the Reader's creation.

What I don't know is whether this 'snapshot' has also its doc numbers fixed
or they may change asynchronously.
And another thing I don't know is what are the index operations that may
cause the (doc - doc number) mapping to change.
Is it only after delete or there are other ocasions, or I'd better not count
on this at all.

StJ

- Original Message - 
From: Vanlerberghe, Luc [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, February 24, 2005 4:07 PM
Subject: RE: Custom filters  document numbers


 An IndexReader will always see the same set of documents.
 Even if another process deletes some documents, adds new ones or
 optimizes the complete index, your IndexReader instance will not see
 those changes.

 If you detect that the Lucene index changed (e.g. by calling
 IndexReader.getCurrentVersion(...) once in a while), you should close
 and reopen your 'current' IndexReader and recalculate any data that
 relies on the Lucene document numbers.

 Regards, Luc.

 -Original Message-
 From: Stanislav Jordanov [mailto:[EMAIL PROTECTED]
 Sent: donderdag 24 februari 2005 14:18
 To: Lucene Users List
 Subject: Custom filters  document numbers

 Given an IndexReader a custom filter is supposed to create a bit set,
 that maps each document numbers to {'visible', 'invisible'} On the other
 hand, it is stated that Lucene is allowed to change document numbers.
 Is it guaranteed that this BitSet's view of document numbers won't
 change while the BitSet is still in use (or perhaps the corresponding
 IndexReader is still opened) ?

 And another (more low-level) question.
 When Lucene may change document numbers?
 Is it only when the index is optimized after there has been a delete
 operation?

 Regards: StJ


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.
It looks like you're using a pre-1.4 version of Lucene.  Since 1.4 this 
is no longer called TermInfosWriter.INDEX_INTERVAL, but rather 
TermInfosWriter.indexInterval.

Is this setting incompatible with older indexes burned with the lower 
value?
Prior to 1.4, yes.  After 1.4, no.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sorted search

2005-02-24 Thread Daniel Naber
On Thursday 24 February 2005 19:01, Yura Smolsky wrote:

       sort.setSort( new SortField[] { new SortField (modified,
 SortField.STRING, true) } );

You should store the date as a number, e.g. days since 1970 (or weeks if 
that is precise enough) and then tell the sort that it's an integer. 
DateField always stores the date in milliseconds which leads to a large 
number of terms, it also turns the date into a string, both makes searching 
and especially sorting slower.

Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sorted search

2005-02-24 Thread Erik Hatcher
Sorting by String uses up lots more RAM than a numeric sort.  If you 
use a numeric (yet lexicographically orderable) date format (e.g. 
MMDD) you'll see better performance most likely.

Erik
On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote:
Hello, lucene-user.
I have index with many documents, more than 40 Mil.
Each document has DateField (It is time stamp of document)
I need the most recent results only. I use single instance of 
IndexSearcher.
When I perform sorted search on this index:
  Sort sort = new Sort();
  sort.setSort( new SortField[] { new SortField (modified, 
SortField.STRING, true) } );
  Hits hits =
searcher.search(QueryParser.parse(good, content,
  StandardAnalyzer()), sort);

then search speed is not good.
Today I have tried search without sort by modified, but with sort by
Relevance. Speed was much better!
I think that Sort by DateField is very slow. Maybe I do something
wrong about this kind of sorted search? Can you give me advices about
this?
Thanks.
Yura Smolsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re[2]: sorted search

2005-02-24 Thread Yura Smolsky
Hello, Erik.

if i need to store hour and minute then I need to place date into
following integer format:
MMDDHHII
?
Will it be faster than current solution?
And will I have ability to do Ranged queries (from Date A to Date B)?

EH Sorting by String uses up lots more RAM than a numeric sort.  If you
EH use a numeric (yet lexicographically orderable) date format (e.g. 
EH MMDD) you'll see better performance most likely.

EH Erik


EH On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote:

 Hello, lucene-user.

 I have index with many documents, more than 40 Mil.
 Each document has DateField (It is time stamp of document)

 I need the most recent results only. I use single instance of 
 IndexSearcher.
 When I perform sorted search on this index:
   Sort sort = new Sort();
   sort.setSort( new SortField[] { new SortField (modified, 
 SortField.STRING, true) } );
   Hits hits =
 searcher.search(QueryParser.parse(good, content,
   StandardAnalyzer()), sort);

 then search speed is not good.

 Today I have tried search without sort by modified, but with sort by
 Relevance. Speed was much better!

 I think that Sort by DateField is very slow. Maybe I do something
 wrong about this kind of sorted search? Can you give me advices about
 this?

 Thanks.

 Yura Smolsky.



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]


EH -
EH To unsubscribe, e-mail: [EMAIL PROTECTED]
EH For additional commands, e-mail:
EH [EMAIL PROTECTED]





Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.

It looks like you're using a pre-1.4 version of Lucene.  Since 1.4 
this is no longer called TermInfosWriter.INDEX_INTERVAL, but rather 
TermInfosWriter.indexInterval.
Yes... we're trying to be conservative and haven't migrated yet.  Though 
doing so might be required for this move I think...

Is this setting incompatible with older indexes burned with the lower 
value?

Prior to 1.4, yes.  After 1.4, no.
What happens after 1.4?  Can I take indexes burned with 256 (a greater 
value) in 1.3 and open them up correctly with 1.4?

Kevin
PS.  Once I get this working I'm going to create a wiki page documenting 
this process.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re[2]: sorted search

2005-02-24 Thread Yura Smolsky
Hello, Erik.

about memory usage...
DateField takes string of 9 bytes in memory ('000ic64p7')
How much memory will be taken by this string?

How much memory will be taken by integer?

EH Sorting by String uses up lots more RAM than a numeric sort.  If you
EH use a numeric (yet lexicographically orderable) date format (e.g. 
EH MMDD) you'll see better performance most likely.

EH Erik


EH On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote:

 Hello, lucene-user.

 I have index with many documents, more than 40 Mil.
 Each document has DateField (It is time stamp of document)

 I need the most recent results only. I use single instance of 
 IndexSearcher.
 When I perform sorted search on this index:
   Sort sort = new Sort();
   sort.setSort( new SortField[] { new SortField (modified, 
 SortField.STRING, true) } );
   Hits hits =
 searcher.search(QueryParser.parse(good, content,
   StandardAnalyzer()), sort);

 then search speed is not good.

 Today I have tried search without sort by modified, but with sort by
 Relevance. Speed was much better!

 I think that Sort by DateField is very slow. Maybe I do something
 wrong about this kind of sorted search? Can you give me advices about
 this?

 Thanks.

 Yura Smolsky.



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]


EH -
EH To unsubscribe, e-mail: [EMAIL PROTECTED]
EH For additional commands, e-mail:
EH [EMAIL PROTECTED]





Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote:
Is this setting incompatible with older indexes burned with the lower 
value?
Prior to 1.4, yes.  After 1.4, no.
What happens after 1.4?  Can I take indexes burned with 256 (a greater 
value) in 1.3 and open them up correctly with 1.4?
Not without hacking things.  If your 1.3 indexes were generated with 256 
then you can modify your version of Lucene 1.4+ to use 256 instead of 
128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 today).

Prior to 1.4 this was a constant, hardwired into the index format.  In 
1.4 and later each index segment stores this value as a parameter.  So 
once 1.4 has re-written your index you'll no longer need a modified version.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote:
Not without hacking things.  If your 1.3 indexes were generated with 
256 then you can modify your version of Lucene 1.4+ to use 256 instead 
of 128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 
today).

Prior to 1.4 this was a constant, hardwired into the index format.  In 
1.4 and later each index segment stores this value as a parameter.  So 
once 1.4 has re-written your index you'll no longer need a modified 
version.
Thanks for the feedback doug. 

This makes more sense now. I didn't understand why the website 
documented the fact that the .tii file was soring the index interval.

I think I'm going to investigate just moving to 1.4 ...  I need to do it 
anyway.  Might as well bite the bullet now.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-24 Thread Kevin A. Burton
Whats the desired pattern of using of TermInfosWriter.indexInterval ?
Do I have to compile my own version of Lucene to change this?   The last 
API was public static final but this is not public nor static. 

I'm wondering if we should just make this a value that can be set at 
runtime.  Considering the memory savings for larger installs this 
can/will be important.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Not entire document being indexed?

2005-02-24 Thread [EMAIL PROTECTED]
Hi Otis
Thanks for the reply, what exactly should I be looking for with Luke?
What would setting the max value to maxInteger do? Is this some 
arbitrary value or...?

-pedja
Otis Gospodnetic said the following on 2/24/2005 2:24 PM:
Use Luke to peek in your index and find out what really got indexed.
You could also try the extreme case and set that max value to the max
Integer.
Otis
--- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 

Hi everyone
I'm having a bizzare problem with a few of the documents here that do
not seem to get indexed entirely.
I use textmining WordExtractor to convert M$ Word to plain text and
then 
index that text.
For example one document which is about 230KB in size when converted
to 
plain text, when indexed and
later searched for a pharse in the last 2-3 paragraphs returns no
hits, 
yet searching anything above those
paragraphs works just fine. WordExtractor does convert the entire 
document to text, I've checked that.

I've tried increasing the number of terms per field from default
10,000 
to 20,000 with writer.maxFieldLength
but that didnt make any difference, still cant find phrases from the 
last 2-3 paragraphs.

Any ideas as to why this could be happening and how I could rectify
it?
thanks,
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]