RE: GETVALUES +SEARCH

2004-11-30 Thread Karthik N S
Hi Guys


Apologies...



   Is there any API in Lucene Which can retrieve all the searched Values in
single fetch

   into some sort of an 'Array'   WITHOUT using this [ below ] Looping
process [ This would make

   the Search and display more Faster ].

 for (int i = 0; i < hits.length();i++) {
  Document doc = hits.doc(i);
  String path  = doc.get("path");
.
 }



Thx in Advance
Karthik


-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 30, 2004 8:06 PM
To: Lucene Users List
Subject: Re: GETVALUES +SEARCH



On Nov 30, 2004, at 7:10 AM, Karthik N S wrote:
> On Search API the command  [ package
> org.apache.lucene.document.Document ]
>
> Will this'public final String[] getValues(String name)' return
> me
> all the docs with out looping  thru ?

getValues(fieldName) returns a String[] of the values of the field.
It's similar to getValue(fieldName).  If you index a field multiple
times:

doc.add(Field.Keyword("keyword", "one"));
doc.add(Field.Keyword("keyword", "two"));

getValue("keyword") will return "one", but getValues("keyword") will
return a String[] {"one", "two"}

If you want to retrieve all documents, use IndexReader's various API
methods.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: literal search in quotes on non-tokenized field

2004-11-30 Thread Erik Hatcher
On Nov 30, 2004, at 6:01 PM, Allen Atamer wrote:
It doesn't work that way.  A TermQuery must match *exactly* what was
indexed (either directly as a Keyword, or as tokens emitted from the
analyzer).  Since you're building the query up yourself from, I'm
assuming, user input, you may need to pre-process what the user 
entered
to get the right term to query on.  Only the term origi would match.
Yeah but it doesn't. The exact text in the database is ORIGI.
But you lowercased what you indexed (in the code you sent).
 Keyword
doesn't work if you supply more than one word.
Depends on what you mean by "doesn't work".  It works as expected. 
Keyword fields are not tokenized and thus a TermQuery on it has to be 
exactly the value you supplied.  But it sounds like you've got a handle 
on the situation now.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: literal search in quotes on non-tokenized field

2004-11-30 Thread Allen Atamer
Erik,


> -Original Message-
> > Here's a log of the parsed query before going to the searcher:
> >
> > Parsed query: (Build:"origi") for the first search
> > Parsed query: (Build:origi) for the second search
> 
> What do you mean by "parsed", since below you say you're not using
> QueryParser/Analyzer.


Sorry, that's residual log text. The lines of code are 

BooleanQuery totalQuery = new BooleanQuery();

.. logic to build totalQuery ...

log.debug("Parsed query: " + totalQuery.toString());
dbSearchHits = searcher.search(totalQuery);


> > Right now we're not using a query parser / analyzer system to build the
> > query. We're building the query up.
> > The query mentioned above is a TermQuery object
> 
> Let me hopefully clarify what you've said you've indexed (I'm not
> using quotes on purpose) origi, but you're doing a TermQuery on "origi"
> (with the quotes) and expecting it to match?
> 
> It doesn't work that way.  A TermQuery must match *exactly* what was
> indexed (either directly as a Keyword, or as tokens emitted from the
> analyzer).  Since you're building the query up yourself from, I'm
> assuming, user input, you may need to pre-process what the user entered
> to get the right term to query on.  Only the term origi would match.

Yeah but it doesn't. The exact text in the database is ORIGI. Keyword
doesn't work if you supply more than one word. In fact we're doing it wrong.
Fields with a small number of terms should not be indexed as keyword, but
tokenized. I'm going to change the indexing strategy to only use keyword
when there's one and only one keyword in the data itself. Fields with two to
three words will be tokenized with the NoTokenizingTokenizer that was posted
earlier, and fields with four or more words will be tokenized with
MyTokenizer.

All we need to do for searching keyword fields is remove the double quotes
to be consistent with searching in a tokenized field. Then use QueryParser
to parse the tokenized fields with the appropriate parser for the field.
This should solve the problem.

Thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity matrix - more clear

2004-11-30 Thread Chris Hostetter
: A possible solution would be to initialize in turn each document as a
: query, do a search using an IndexSearcher and to take from the search
: result the similarity between the query (which is in fact a document)
: and all the other documents. This is highly redundant, because the
: similarity between a pair of documents is computed multiple times.

A simpler aproach that i can think of would be to iterate over a complete
TermEnum of hte index, and for each Term, get the corisponding TermDocs
enumerator to list every document that contains that term.  Assuming that
every pair of docs initially has a similarity of "0" this would allow you
to incriment the similarity of each pair everytime you find a term that
multiple docs have in common.  (the amount you incriment the score for
each pair could be based on TermEnum.docFreq() and TermDocs.freq()).

A very simple approach might be something like...

   IndexReader r = ...;
   int[][] scores = new int[r.maxDocs()][r.maxDocs()];
   TermEnum enumerator = r.terms();
   TermDocs termDocs = r.termDocs();
   do {
  Term term = enumerator.term();
  if (term != null) {
 termDocs.seek(enumerator.term());
 Map docs = new HashMap();
 while (termDocs.next()) {
docs.put(termDocs.doc(),termDoc.freq());
 }
 for (Iterator i = docs.keySet().iterator(); i.hasNext();) {
for (Iterator j = docs.keySet().iterator(); j.hasNext();) {
   ii == i.next();
   jj = j.next();
   if (ii < jj) {
  continue; // do each pair only once
   }
   scores[jj][ii] += (docs.get(ii) + docs.get(jj)) / 2
}
 }
  } else {
 break;
  }
   } while (enumerator.next());


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: literal search in quotes on non-tokenized field

2004-11-30 Thread Erik Hatcher
On Nov 30, 2004, at 4:42 PM, Allen Atamer wrote:
A search in quotes on a field named Build with the query "\"orig\"" 
does not
work but the query "origi" yields 62 hits

I have run indexing on the field with the following method
doc.add(Field.Keyword(data.getColumnName(j),
fieldValue.toString().toLowerCase()));
so even though the original data has "ORIGI" in the "Build" field, 
lowercase
is not the problem

Here's a log of the parsed query before going to the searcher:
Parsed query: (Build:"origi") for the first search
Parsed query: (Build:origi) for the second search
What do you mean by "parsed", since below you say you're not using 
QueryParser/Analyzer.

Right now we're not using a query parser / analyzer system to build the
query. We're building the query up.
The query mentioned above is a TermQuery object
Let me hopefully clarify what you've said you've indexed (I'm not 
using quotes on purpose) origi, but you're doing a TermQuery on "origi" 
(with the quotes) and expecting it to match?

It doesn't work that way.  A TermQuery must match *exactly* what was 
indexed (either directly as a Keyword, or as tokens emitted from the 
analyzer).  Since you're building the query up yourself from, I'm 
assuming, user input, you may need to pre-process what the user entered 
to get the right term to query on.  Only the term origi would match.

I'm in the process of building a search system for some library 
archives and the single hardest thing about my work is building the 
user interface and making it work like users would like.  Humanities 
academics are by far the most particular crowd ever.  My suggestion of 
making a single text box that they type words into to query which I 
would lowercase and tokenize into a Boolean OR query was shot down 
quickly as being insufficient.   :)  Case-sensitive and insensitive 
searches are desired with all sort of other accommodations.   In fact, 
they suggested I write an article on my experience of building a search 
interface.  The underlying indexing and searching is trivial compared 
to the interface!

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


literal search in quotes on non-tokenized field

2004-11-30 Thread Allen Atamer
Here is a problem I am experiencing with Lucene searches on non-tokenized
fields:
 
A search in quotes on a field named Build with the query "\"orig\"" does not
work but the query "origi" yields 62 hits
 
I have run indexing on the field with the following method
 
doc.add(Field.Keyword(data.getColumnName(j),
fieldValue.toString().toLowerCase()));
 
so even though the original data has "ORIGI" in the "Build" field, lowercase
is not the problem
 
Here's a log of the parsed query before going to the searcher:
 
Parsed query: (Build:"origi") for the first search
Parsed query: (Build:origi) for the second search
 
Right now we're not using a query parser / analyzer system to build the
query. We're building the query up.
The query mentioned above is a TermQuery object
 
Thanks 


Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
Thanx for the replies to you all.
I was looking for someone with the same experiences as mine ones, but it seems 
that I'll have to
test this myself.
I'll try out my ideas and the most interesting ideas from you guys.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does Lucene perform ranking in the retrieved set?

2004-11-30 Thread Paul Elschot
On Tuesday 30 November 2004 18:46, Xiangyu Jin wrote:
> 
> THis might be a stupid question.
> 
> When perform retrieval for a query, deos Lucene first get
> a subset of candidate matches and then perform the ranking
> on the set? That is, similarity calculation is performed only
> on a subset of the docuemnts to the query.

Yes, Lucene uses  an inverted index for this.

> If so, from which module could I get those candidate docs,
> then I can perform my own similarity calculations (since
> I might need to rewrite the normalization factor, so
> only modify the "similarity" model seems will not
> work).

To change the normalisation you may consider implementing
your own Weight:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Weight.html
For some example implementations of Weight the Lucene source
code in the org.apache.lucene.search package is the best resource.

Using your own Weight also requires a subclass of Query that returns
this weight in the createWeight() method.

> Or, is there document describe the produre of how Lucene
> perform search?

This describes the scoring:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
See also the DefaultSimilarity.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does QueryParser uses Analyzer ?

2004-11-30 Thread Otis Gospodnetic
QueryParser does use Analyzer, see this:

  static public Query parse(String query, String field, Analyzer
analyzer)
   throws ParseException {
QueryParser parser = new QueryParser(field, analyzer); <<<
return parser.parse(query);
  }

Otis
P.S.
Use lucene-user list, please.


--- Ricardo Lopes <[EMAIL PROTECTED]> wrote:

> Does the QueryParser class really uses the Analyzer passed to the
> parse 
> method ?
> 
> I look at the code and i dont the object beeing used anywhere in the 
> class. The problem is that i am writting an application with lucene
> that 
> searches using a foreign language with latin characters, the indexing
> 
> works fine, but the search aparently doesn't call the Analyzer.
> 
> Here is an example:
> i have a file that contains the following word: memória
> if i search for: memoria (without the puntuation charecter in the o)
> it 
> finds the word, which is correct
> if i search for: memória (the exact same word) it doesn't find the
> word, 
> because the QueryParser splits the word to "mem ria", but if the 
> analyzer were called the "ó" would be replaced to "o". I guess the 
> analyzer isn't called, is this right?
> 
> Thanks in advance,
> Ricardo Lopes
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Does Lucene perform ranking in the retrieved set?

2004-11-30 Thread Xiangyu Jin

THis might be a stupid question.

When perform retrieval for a query, deos Lucene first get
a subset of candidate matches and then perform the ranking
on the set? That is, similarity calculation is performed only
on a subset of the docuemnts to the query.

If so, from which module could I get those candidate docs,
then I can perform my own similarity calculations (since
I might need to rewrite the normalization factor, so
only modify the "similarity" model seems will not
work).

Or, is there document describe the produre of how Lucene
perform search?

thanks

xiangyu jin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity matrix - more clear

2004-11-30 Thread Otis Gospodnetic
Hello,

I don't think Lucene can spit out the similarity matrix for you, but
perhaps you can use Lucene's Term Vector support to help you build the
matrix yourself:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html

The other relevant sections of the Lucene API to look at are:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int)
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader,%20boolean)
...

This should let you tell Lucene to compute and store term vectors
during indexing, and then you will be able to retrieve a Term Vector
for each Document in the index/collection.  Armed with this data you
should be able to compute similarities between Documents with TV dot
products/cosines, which should be enough for you to build your
similarity matrix.

This sounds like something that would be nice to have in the Lucene
Sandbox, so if you end up with some code that you are allowed to share,
please contribute it back to Lucene.

Otis

--- Roxana Angheluta <[EMAIL PROTECTED]> wrote:

> Dear all,
> 
> Yesterday I've asked a question about geting the similarity matrix of
> a 
> collection of documents from an index, but I got only one answer, so 
> perhaps my question was not very clear.
> 
> I will try to reformulate:
> 
> I want to use Lucene to have efficient access to an index of a 
> collection of documents. My final purpose is to cluster documents. 
> Therefore I need to have for each pair of documents a number
> signifying 
> the similarity between them.
> A possible solution would be to initialize in turn each document as a
> 
> query, do a search using an IndexSearcher and to take from the search
> 
> result the similarity between the query (which is in fact a document)
> 
> and all the other documents. This is highly redundant, because the 
> similarity between a pair of documents is computed multiple times.
> 
> I was wondering whether there is a simpler way to do it, since the
> index 
> file contains all the information needed. Can anyone help me here?
> 
> Thanks,
> roxana
> 
> PS I know about the project Carrot2, which deals with document 
> clustering, but I think is not appropriate for me because of 2
> reasons:
> 1) I need to keep the index on the disk for further reusage
> 2) I need to be able to search efficiently in the index
> I thought Lucene can help me here, am I wrong?
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Otis Gospodnetic
Hello,

> Lucene indexing completes in 13-15 hours on the desktop system while
> it completes in about 29-33
> hours on the notebook.
> 
> Now, combine it with the DROP INDEX tests completing in the same
> amount of time on both and find
> out why is the search only slightly faster :)
> 
> > Until then, all your measurements are subjective and you
> > don't gain much by comparing the two indexing processes.
> 
> I'm worried about searching. Indexing is a lot faster on the desktop
> config.

This tells you that your problem is not the disk itself, and not the
fielsystem.  The bottleneck is elsewhere.

Why not run your search under a profiler?  That will tell you where the
JVM is spending its time.  It may even be in some weird InetAddress
call, like another person already pointed out.

Otis


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene's ranking function VS Standard VSM model

2004-11-30 Thread Xiangyu Jin

 I have seen different versions of Lucene's ranking function
from the similarity document and Lucene user list.

Since I need to get document-doucment similaries,
so what I do is to issue the document as query directly.
I found it is different if we issue "computer computer"
to Lucene vers we issue it to standard VSM. The latter one
will treat "computer computer" as "computer" but Lucene
doesn't.

In order to illustrate my question more clear, I write
a more formalized document

http://www.cs.virginia.edu/~xj3a/lucene_ranking.pdf

so that there is no ambiguity of those formulas.

I am not asure whether I understand correctly, but the
major reason comes from Lucene's query parser. It defaults
each term appear once. If we issue a query term multiple
times in the query string, it will result in some un-expected
results.

For detail information, pls refer to the attached link.

thanks

xiangyu jin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: What is the best file system for Lucene?

2004-11-30 Thread Armbrust, Daniel C.
As I understand hyperthreading, this is not true: 

>Also, unless you take your hyperthreading off, with just one index you are
>searching with just one half of the CPU - so your desktop is actually using
>a 1.5GHz CPU for the search.

You still have the full speed of the processor available - the processor itself 
just keeps switching between different threads of execution.  Some people have 
noted that some (single threaded) applications will run 5-10% slower when 
hyperthreading is turned on - but that depends on the app.  It certainly won't 
be running at half speed.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: What is the best file system for Lucene?

2004-11-30 Thread Armbrust, Daniel C.
You may want to give the IBM JVM a try - I've found it faster in some cases...

http://www-106.ibm.com/developerworks/java/jdk/linux140/


Dan 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity matrix - more clear

2004-11-30 Thread Xiangyu Jin


I also have the same task as you do. According to my understanding,
suppose their are N documents, your approach will take N^2 similarity
calculations.

Although there are N(N-1)/2 distinct document pairs,
the similarity calculation (according to my understanding) in Lucene is
asymmetric, so this means you have to calculate N(N-1) similaries.
Therefore, seems your approach is not so redundant since you have to
calculate O(N^2) order of similarities.

On Tue, 30 Nov 2004, Roxana Angheluta wrote:

> Dear all,
>
> Yesterday I've asked a question about geting the similarity matrix of a
> collection of documents from an index, but I got only one answer, so
> perhaps my question was not very clear.
>
> I will try to reformulate:
>
> I want to use Lucene to have efficient access to an index of a
> collection of documents. My final purpose is to cluster documents.
> Therefore I need to have for each pair of documents a number signifying
> the similarity between them.
> A possible solution would be to initialize in turn each document as a
> query, do a search using an IndexSearcher and to take from the search
> result the similarity between the query (which is in fact a document)
> and all the other documents. This is highly redundant, because the
> similarity between a pair of documents is computed multiple times.
>
> I was wondering whether there is a simpler way to do it, since the index
> file contains all the information needed. Can anyone help me here?
>
> Thanks,
> roxana
>
> PS I know about the project Carrot2, which deals with document
> clustering, but I think is not appropriate for me because of 2 reasons:
> 1) I need to keep the index on the disk for further reusage
> 2) I need to be able to search efficiently in the index
> I thought Lucene can help me here, am I wrong?
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: GETVALUES +SEARCH

2004-11-30 Thread Erik Hatcher
On Nov 30, 2004, at 7:10 AM, Karthik N S wrote:
On Search API the command  [ package  
org.apache.lucene.document.Document ]

Will this'public final String[] getValues(String name)' return 
me
all the docs with out looping  thru ?
getValues(fieldName) returns a String[] of the values of the field.  
It's similar to getValue(fieldName).  If you index a field multiple 
times:

doc.add(Field.Keyword("keyword", "one"));
doc.add(Field.Keyword("keyword", "two"));
getValue("keyword") will return "one", but getValues("keyword") will 
return a String[] {"one", "two"}

If you want to retrieve all documents, use IndexReader's various API 
methods.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


similarity matrix - more clear

2004-11-30 Thread Roxana Angheluta
Dear all,
Yesterday I've asked a question about geting the similarity matrix of a 
collection of documents from an index, but I got only one answer, so 
perhaps my question was not very clear.

I will try to reformulate:
I want to use Lucene to have efficient access to an index of a 
collection of documents. My final purpose is to cluster documents. 
Therefore I need to have for each pair of documents a number signifying 
the similarity between them.
A possible solution would be to initialize in turn each document as a 
query, do a search using an IndexSearcher and to take from the search 
result the similarity between the query (which is in fact a document) 
and all the other documents. This is highly redundant, because the 
similarity between a pair of documents is computed multiple times.

I was wondering whether there is a simpler way to do it, since the index 
file contains all the information needed. Can anyone help me here?

Thanks,
roxana
PS I know about the project Carrot2, which deals with document 
clustering, but I think is not appropriate for me because of 2 reasons:
1) I need to keep the index on the disk for further reusage
2) I need to be able to search efficiently in the index
I thought Lucene can help me here, am I wrong?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
> simply load your index into a
> RAMDirectory instead of using FSDirectory. 

I have 3GByte RAM and my index is 3GByte big currently. (it'll be soon about 
4GByte)
So, I have to find out this another way.

> First off, 1.8GHz Pentium-M machines are supposed to run at about the
> speed of a 2.4GHz machine.  The clock speeds on the mobile chips are
> lower, but they tend to perform much better than rated.   I recommend
> you take a general benchmark of both machines testing both disk speed
> and cpu speed to get a baseline performance comparision.

I think that it a good general benchmark that almost everything runs at least 
twice as fast on the
3.0GHz P4 except lucene search.

I can tell one more interesting info:
I have a MySQL table with ~20million records.
I throw a DROP INDEX on that table, MySQL rebuilds the whole huge table into a 
tempfile.
It completes in 30 minutes on both systems.
It doesn't matter again that the 15kRPM U320 HDD is 2x-3x as fast.
Very surprising again.
Hmm... reiserfs must be very-very slow, or I'm completly lost :)

> I also suggest turning of HT for your benchmarks and performance testing.

I'll try this later and I really hope it won't be the reason.

> Secondly, while the second machine appears to be twice as fast, the
> disk could actually perform slower on the Linux box, especially if the
> notebook drive has a big (8M) cache like most 7200RPM ata disk drives
> do. 

Both drives have 8M cache.

> I imagine that if you hit the index with lots of simultaneous
> searches, that the Linux box would hold its own for much longer than
> the XP box simply due to the random seek performance of the scsi disk
> combined with scsi command queueing.

Are you saying that SCSI command queuing wastes more time than a 15kRPM 3.9ms 
HDD can gain over a
7.2kRPM 8-9ms HDD?
It sounds terrible and I hope it isn't true.

> RAM speed is a factor too.  Is the p4 a xeon processor?  The older HT
> xeons have a much slower bus than the newer p4-m processors.  Memory
> speed will be affected accordingly.

It is not a Xeon, just a P4 3.0GHz HT.

> I haven't heard of a hard disk referred to as a winchester disk in a
> very long time :)

;)

> Once you have an idea of how the two machines actually compare
> performance-wise, you can then judge how they perform index
> operations.

Lucene indexing completes in 13-15 hours on the desktop system while it 
completes in about 29-33
hours on the notebook.

Now, combine it with the DROP INDEX tests completing in the same amount of time 
on both and find
out why is the search only slightly faster :)

> Until then, all your measurements are subjective and you
> don't gain much by comparing the two indexing processes.

I'm worried about searching. Indexing is a lot faster on the desktop config.

Regards,
Sanyi




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
> How large is the index?   If it's less than a couple of GByte then it 
> will be entirely in memory

It is 3GBytes big and it will grow a lot.
I have to search from the HDD which is very fast compared to the notebook's HDD.

Average seek time:
Notebook: 8-9ms
Desktop: 3.9ms

Data read:
Notebook: max. ~20MBytes/sec
Desktop: 60-80MBytes/sec

So, if the bottleneck is the HDD, it has to be 2x-3x faster on the desktop 
system.
Except if reiserfs is a lot slower than NTFS.

> For example (and this is only an example) looking up a hostname in the 
> DNS will take about the same time on almost any machine you can get hold of.

Ok, but I have very simple and pure tests and everything is measured 
part-by-part.
..and every parts speeds up a lot on the desltop system, except the lucene 
search part.

> You don't say how you're measuring search performance and you don't say 
> what you're seeing.

I call my java program from command line on both systems, like:
search hello
Then it searches for bravo and collects the elapsed milliseconds between every 
call to anything.
Then it displays the results. It is very simple.

> Also, what's the load on the system while you're 
> running the tests?   gkrellm on Linux is very useful as an overall view 
> -- are you CPU bound, are you seeing lots of disk traffic?   Is the 
> system actually more-or-less idle?

Thanx for the hint. Since my search searches for only 30 hits, it completes too 
fastly to let me
monitor it real-time.
Anyway, if reiserfs will prove to be fast enough, I'll search for other reasons 
and will perform
longer tests for real-time monitoring.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Justin Swanhart
As a generalisation, SuSE itself is not a lot slower than Windows XP. 
I also very much doubt that filesystem is a factor.  If you want to
test w/out filesystem involvement, simply load your index into a
RAMDirectory instead of using FSDirectory.  That precludes filesystem
overhead in searches.

There are quite a number of factors involved that could be affecting
performance.

First off, 1.8GHz Pentium-M machines are supposed to run at about the
speed of a 2.4GHz machine.  The clock speeds on the mobile chips are
lower, but they tend to perform much better than rated.   I recommend
you take a general benchmark of both machines testing both disk speed
and cpu speed to get a baseline performance comparision.  I also
suggest turning of HT for your benchmarks and performance testing.

Secondly, while the second machine appears to be twice as fast, the
disk could actually perform slower on the Linux box, especially if the
notebook drive has a big (8M) cache like most 7200RPM ata disk drives
do.  I imagine that if you hit the index with lots of simultaneous
searches, that the Linux box would hold its own for much longer than
the XP box simply due to the random seek performance of the scsi disk
combined with scsi command queueing.

RAM speed is a factor too.  Is the p4 a xeon processor?  The older HT
xeons have a much slower bus than the newer p4-m processors.  Memory
speed will be affected accordingly.

I haven't heard of a hard disk referred to as a winchester disk in a
very long time :)

Once you have an idea of how the two machines actually compare
performance-wise, you can then judge how they perform index
operations.  Until then, all your measurements are subjective and you
don't gain much by comparing the two indexing processes.

Justin

On Tue, 30 Nov 2004 02:04:46 -0800 (PST), Sanyi <[EMAIL PROTECTED]> wrote:
> Hi!
> 
> I'm testing Lucene 1.4.2 on two very different configs, but with the same 
> index.
> I'm very surprised by the results: Both systems are searching at about the 
> same speed, but I'd
> expect (and I really need) to run Lucene a lot faster on my stronger config.
> 
> Config #1 (a notebook):
> WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester
> 
> Config #2 (a desktop PC):
> SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
> 15000RPM U320 SCSI
> winchester
> 
> You can see that the hardware of #2 is at least twice better/faster than #1.
> I'm searching the reason and the solution to take advantage of the better 
> hardware compared to the
> poor notebook.
> Currently #2 can't amazingly outperform the notebook (#1).
> 
> The question is: What can be worse in #2 than on the poor notebook?
> 
> I can imagine only software problems.
> Which are the sotware parts then?
> 1. The OS
> Is SuSE 9.1 a LOT slower than WinXP pro?
> 2. The file system
> Is reisefs a LOT slower than NTFS?
> 
> Regards,
> Sanyi
> 
> __
> Do you Yahoo!?
> Yahoo! Mail - You care about security. So do we.
> http://promotions.yahoo.com/new_mail
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: AW: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
> The notebook is quite good, e.g. the Pentium-M might be faster than
> your Pentium 4. At least it has a similar speed, because of it better
> internal design. Never compare cpus of different types by their
> frequency. 

Ok, this might be true, but:

All of my other tests where the CPU is involved, are running a LOT faster on 
the desktop PC with
the 3GHz P4.
Even other JAVA parts are running a LOT faster. (twice as fast nearly)
So, we can't even say that the JAVA VM takes no advantage of the 3GHz P4 
compared to the 1.8GHz
Pentium-M.
Everything is a LOT faster, except searching with lucene. (which is also a bit 
faster, but
slightly)

> Maybe your index is small enough to fit into the cache provided by the 
> operating systems. So you wouldn't recognize any difference between your
> hard disks.

It is a 3GByte index and I always reboot between tests, so cahcing is not the 
case.

> I don't think so. I'm using Windows 2000 pro and SuSE 9.0 and 
> (from my memory) Linux seems to be sightly faster, but I can't
> provide any benchmark now.

Are you using reiserfs with SuSE?

Regards,
Sanyi



__ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
> Could you try XP on your desktop

Sure, but I'll only do that I run out of ideas.

> so your desktop is actually using
> a 1.5GHz CPU for the search.

No, this is not true. It uses a 3.0GHz P4 then.
(HT means that you have two 3.0GHz P4s)

So, it is still surprising to me.

Regards,
Sanyi




__ 
Do you Yahoo!? 
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Justin Swanhart
On Tue, 30 Nov 2004 12:07:46 -, Pete Lewis <[EMAIL PROTECTED]> wrote:
> Also, unless you take your hyperthreading off, with just one index you are
> searching with just one half of the CPU - so your desktop is actually using
> a 1.5GHz CPU for the search.  So, taking account of this its not too
> surprising that they are searching at comparable speeds.
> 
> HTH
> Pete

Actually, that isn't how hyperthreading works.  The "second" CPU in a
hyperthreaded system should only run threads when the "main" cpu is
waiting on another task, like a memory access.  The second, or sub CPU
is only a virtual processor.  There aren't really two chips on board. 
New multicore processors will actually have more than one processor 
in one chip.

Problems can arise when you are using a HT processor on an operating
system that doesn't know about HT technology.  The OS should only
schedule jobs to run on the sub CPU under very specific circumstances.
 This is one of the major reasons for the scheduler overhaul in Linux
2.6.  The default scheduler in 2.4 would assign threads to the sub CPU
that shouldn't have been, and those threads would suffer from resource
starvation.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread John Haxby

Sanyi wrote:
I'm testing Lucene 1.4.2 on two very different configs, but with the same index.
I'm very surprised by the results: Both systems are searching at about the same 
speed, but I'd
expect (and I really need) to run Lucene a lot faster on my stronger config.
Config #1 (a notebook):
WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester
Config #2 (a desktop PC):
SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
15000RPM U320 SCSI
winchester
You can see that the hardware of #2 is at least twice better/faster than #1.
I'm searching the reason and the solution to take advantage of the better 
hardware compared to the
poor notebook.
Currently #2 can't amazingly outperform the notebook (#1).
 

How large is the index?   If it's less than a couple of GByte then it 
will be entirely in memory after you've done a few searches on the Linux 
box.  You can force it into memory by cat'ing all the index files on to 
/dev/null a couple of times (cat * > /dev/null).   A 3GHz system should 
now perform dramatically faster than a 1.5GHz system no matter what the 
file system. (And it's still 3GHz whether or not hyperthreading is 
turned on -- hyperthreading simply makes use of some under-used silicon 
to give you somewhere between 1 and 2 CPUs.  In some pathlogical cases 
it can give you less than one CPU, but I don't think lucene falls into 
the category.  And it's going to be a helluva lot faster than any 
Pentium M because it has a nice healthy cache.)

However, I don't believe that the hardware, OS or file system have 
anything to do with it.   Normally if you're seeing similar performance 
on widely differing platforms you're seeing latency somewhere else.   
For example (and this is only an example) looking up a hostname in the 
DNS will take about the same time on almost any machine you can get hold of.

You don't say how you're measuring search performance and you don't say 
what you're seeing.   Also, what's the load on the system while you're 
running the tests?   gkrellm on Linux is very useful as an overall view 
-- are you CPU bound, are you seeing lots of disk traffic?   Is the 
system actually more-or-less idle?

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


AW: What is the best file system for Lucene?

2004-11-30 Thread Wolf-Dietrich.Materna
Hello,
Sanyi [mailto:[EMAIL PROTECTED] wrote:
> I'm testing Lucene 1.4.2 on two very different configs, but 
> with the same index.
> I'm very surprised by the results: Both systems are searching 
> at about the same speed, but I'd expect (and I really need) 
> to run Lucene a lot faster on my stronger config.
> 
> Config #1 (a notebook):
> WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester
> 
> Config #2 (a desktop PC):
> SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz 
> P4s), 3GByte RAM, 15000RPM U320 SCSI winchester
> 
> You can see that the hardware of #2 is at least twice 
> better/faster than #1.
> I'm searching the reason and the solution to take advantage 
> of the better hardware compared to the poor notebook.
> Currently #2 can't amazingly outperform the notebook (#1).
> 
> The question is: What can be worse in #2 than on the poor notebook?
The notebook is quite good, e.g. the Pentium-M might be faster than
your Pentium 4. At least it has a similar speed, because of it better
internal design. Never compare cpus of different types by their
frequency. 
Use benchmarks, e.g. SpecInt_2000 
to
compare cpus, but keep in mind that these ratings will be different from

your "real world" application. 
SPECint2000(base) rating of a [EMAIL PROTECTED],06Ghz: 1085,
 Details:

SPECint2000(base) rating of Pentium M [EMAIL PROTECTED]: 1541 (!)
  Details:

Note: this is a workstation using a faster version of your notebook cpu.
I haven't found any Pentium M system with 1,8Ghz in the list.

Maybe your index is small enough to fit into the cache provided by the 
operating systems. So you wouldn't recognize any difference between your

hard disks.

> I can imagine only software problems.
> Which are the sotware parts then?
> 1. The OS. Is SuSE 9.1 a LOT slower than WinXP pro?
> 2. The file system. Is reisefs a LOT slower than NTFS?
I don't think so. I'm using Windows 2000 pro and SuSE 9.0 and 
(from my memory) Linux seems to be sightly faster, but I can't
provide any benchmark now.
You should re-run your tests on the same hardware.
Regards,
Wolf-Dietrich

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Pete Lewis
Hi Sanyi

Could you try XP on your desktop - that would take some variables out.  The
problem is that you are comparing OS, as well as filesystems, as well as
different hardware configs.

Also, unless you take your hyperthreading off, with just one index you are
searching with just one half of the CPU - so your desktop is actually using
a 1.5GHz CPU for the search.  So, taking account of this its not too
surprising that they are searching at comparable speeds.

HTH
Pete

- Original Message - 
From: "Sanyi" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 30, 2004 11:28 AM
Subject: Re: What is the best file system for Lucene?


> > Interesting, what are your merge settings
>
> Sorry, I didn't mention that I was talking about search performance.
> I'm using the same, fully optimized index on both systems.
> (I've generated both indexes with the same code from the same database on
the actual OS)
>
> > which JDK are you using?
>
> I'm using the same Sun JDK on both systems.
> I've tried so far:
> j2sdk1.4.2_04 _05 and _06.
> I didn't notice speed differences between these subversions.
> Do you know about significant speed differences between them I should
notice?
>
> > Have you tried with hyperthreading turned off on #2?
>
> No, but I will try it if the problem isn't in the file system.
> I hope that the reason of slowness is reiserfs, because it is the easiest
to change.
>
> What file systems are you people using Lucene on? And what are your
experiences?
>
> Regards,
> Sanyi
>
>
>
>
> __
> Do you Yahoo!?
> The all-new My Yahoo! - What will yours do?
> http://my.yahoo.com
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



GETVALUES +SEARCH

2004-11-30 Thread Karthik N S

Hi Guys


Apologies.




On Search API the command  [ package  org.apache.lucene.document.Document ]

Will this'public final String[] getValues(String name)' return me
all the docs with out looping  thru ?

Please Explaine with example.



Thx in advance



  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SEARCH CRITERIA

2004-11-30 Thread Nader Henein
they probably create a list of similar results by doing some sort of 
data mining on the search criteria that people use in succession, so for 
example someone, or they have a list of searches that are too general (a 
search for the word kid is at best stupid) but you can't call your users 
stupid so you try to guess what they're searching for based on other 
searches conducted  (kid rock, kid games, star wars kid, karate kid ) 
that contain the initial search string "kid". You can use fuzzy search 
in Lucene, but that won't do that really, the short answer is DIY 
depending on your needs.

My two galiuns
Nader Henein
Karthik N S wrote:
Hi Guys
Apologies.
On yahoo and Altavista ,if searched upon a word like 'kid'  returns the
search with
similar as below.
  Also try: kid rock, kid games, star wars kid, karate kid   More...

 How to obtain the similar search criteria using Lucene.
Thx in advance
Warm regards
Karthik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


SEARCH CRITERIA

2004-11-30 Thread Karthik N S

Hi Guys

Apologies.


On yahoo and Altavista ,if searched upon a word like 'kid'  returns the
search with

similar as below.


   Also try: kid rock, kid games, star wars kid, karate kid   More...



  How to obtain the similar search criteria using Lucene.


Thx in advance


Warm regards
Karthik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: What is the best file system for Lucene?

2004-11-30 Thread sg
> What file systems are you people using Lucene on? And what are your
> experiences?

http://www.apple.com/xsan/

Actually it is a beta version and have some small issues but it is very fast 
and easy to manage in case you get it installed. 
The installation it self is tricky since it is very dependend on your network 
setup and need a well working dns, routings etc.
However it is fast as the wind. :-)

HTH
Stefan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread Sanyi
> Interesting, what are your merge settings

Sorry, I didn't mention that I was talking about search performance.
I'm using the same, fully optimized index on both systems.
(I've generated both indexes with the same code from the same database on the 
actual OS)

> which JDK are you using?

I'm using the same Sun JDK on both systems.
I've tried so far:
j2sdk1.4.2_04 _05 and _06.
I didn't notice speed differences between these subversions.
Do you know about significant speed differences between them I should notice?

> Have you tried with hyperthreading turned off on #2?

No, but I will try it if the problem isn't in the file system.
I hope that the reason of slowness is reiserfs, because it is the easiest to 
change.

What file systems are you people using Lucene on? And what are your experiences?

Regards,
Sanyi




__ 
Do you Yahoo!? 
The all-new My Yahoo! - What will yours do?
http://my.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the best file system for Lucene?

2004-11-30 Thread John Moylan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Interesting, what are your merge settings, which JDK are you
using?(there are big differences between versions). Have you tried with
hyperthreading turned off on #2? - if so did it fare any differently?
Regards,
John
Sanyi wrote:
| Hi!
|
| I'm testing Lucene 1.4.2 on two very different configs, but with the
same index.
| I'm very surprised by the results: Both systems are searching at about
the same speed, but I'd
| expect (and I really need) to run Lucene a lot faster on my stronger
config.
|
| Config #1 (a notebook):
| WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester
|
| Config #2 (a desktop PC):
| SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte
RAM, 15000RPM U320 SCSI
| winchester
|
| You can see that the hardware of #2 is at least twice better/faster
than #1.
| I'm searching the reason and the solution to take advantage of the
better hardware compared to the
| poor notebook.
| Currently #2 can't amazingly outperform the notebook (#1).
|
| The question is: What can be worse in #2 than on the poor notebook?
|
| I can imagine only software problems.
| Which are the sotware parts then?
| 1. The OS
| Is SuSE 9.1 a LOT slower than WinXP pro?
| 2. The file system
| Is reisefs a LOT slower than NTFS?
|
| Regards,
| Sanyi
|
|
|   
|   
| __
| Do you Yahoo!?
| Yahoo! Mail - You care about security. So do we.
| http://promotions.yahoo.com/new_mail
|
| -
| To unsubscribe, e-mail: [EMAIL PROTECTED]
| For additional commands, e-mail: [EMAIL PROTECTED]
|
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQFBrFLWHgDLUzVQ7OARAraNAJ96DcMxVGYZQCmbjTpnaNJHlBEDRwCfcYoa
1UVJ37tcsNRp2m7h42265QA=
=BP6l
-END PGP SIGNATURE-
**
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RTÉ may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
**
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

What is the best file system for Lucene?

2004-11-30 Thread Sanyi
Hi!

I'm testing Lucene 1.4.2 on two very different configs, but with the same index.
I'm very surprised by the results: Both systems are searching at about the same 
speed, but I'd
expect (and I really need) to run Lucene a lot faster on my stronger config.

Config #1 (a notebook):
WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester

Config #2 (a desktop PC):
SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 
15000RPM U320 SCSI
winchester

You can see that the hardware of #2 is at least twice better/faster than #1.
I'm searching the reason and the solution to take advantage of the better 
hardware compared to the
poor notebook.
Currently #2 can't amazingly outperform the notebook (#1).

The question is: What can be worse in #2 than on the poor notebook?

I can imagine only software problems.
Which are the sotware parts then?
1. The OS
Is SuSE 9.1 a LOT slower than WinXP pro?
2. The file system
Is reisefs a LOT slower than NTFS?

Regards,
Sanyi




__ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]