Re: Searching with words that contain % , / and the like

2005-01-27 Thread Robinson Raju
Hi Jason , 
yes , the doc'n does mention escaping . but thats only for special
characters used in queries , right ?
but i've tried 'escaping' too.
to answer ur question , am sure it is not HTTP request which is eating it up. 

Query query = MultiFieldQueryParser.parse(test/s,
 value, analyzer);

  query has value:test

  am using StandardAnalyzer


On Thu, 27 Jan 2005 17:53:39 +1100, Jason Polites
[EMAIL PROTECTED] wrote:
 Lucene doco mentions escaping, but doesn't include the / char...
 
 --
 Lucene supports escaping special characters that are part of the query
 syntax. The current list special characters are
 
 + -  || ! ( ) { } [ ] ^  ~ * ? : \
 
 To escape these character use the \ before the character. For example to
 search for (1+1):2 use the query:
 
 \(1\+1\)\:2
 --
 
 You could try escaping it anyway?
 
 Are you sure it's not an HTTP request which is screwing with the parameter?
 
 
 - Original Message -
 From: Robinson Raju [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Thursday, January 27, 2005 5:42 PM
 Subject: Searching with words that contain % , / and the like
 
  Hi ,
 
  Is there a way to search for words that contain / or % .
  if my query is test/s , it is just taken as test
  if my query is test/p , it is just taken as test p
  has anyone done this / faced such an issue ?
 
  Regards
  Robin
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 


-- 
Regards,
Robin
9886394650
The merit of an action lies in finishing it to the end

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching with words that contain % , / and the like

2005-01-27 Thread Chris Lamprecht
Without looking at the source, my guess is that StandardAnalyzer (and
StandardTokenizer) is the culprit.  The StandardAnalyzer grammar (in
StandardTokenizer.jj) is probably defined so x/y parses into two
tokens, x and y.  s is a default stopword (see
StopAnalyzer.ENGLISH_STOP_WORDS), so it gets filtered out, while p
does not.

To get what you want, you can use a WhitespaceAnalyzer, write your own
custom Analyzer or Tokenizer, or modify the StandardTokenizer.jj
grammar to suit your needs.  WhitespaceAnalyzer is much simpler than
StandardAnalyzer, so you may see some other things being tokenized
differently.

-Chris

On Thu, 27 Jan 2005 12:12:16 +0530, Robinson Raju
[EMAIL PROTECTED] wrote:
 Hi ,
 
 Is there a way to search for words that contain / or % .
 if my query is test/s , it is just taken as test
 if my query is test/p , it is just taken as test p
 has anyone done this / faced such an issue ?
 
 Regards
 Robin
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: text highlighting

2005-01-27 Thread Youngho Cho
Hello,

When I used the code with CJKAnalyzer and search English Text 
(Because the text is mixed with Korean and English )
sometimes the return Stirng is none.
Others works well.

Is the code analyzer dependancy ?

Thanks.

Youngho

---  Test Code ( Just copy of the Book code ) -

private static final String HIGH_LIGHT_OPEN = span class=\highlight\;
private static final String HIGH_LIGHT_CLOSE = /span;

public static String highLight(String value, String queryString)
throws IOException
{
if (StringUtils.isEmpty(value) || StringUtils.isEmpty(queryString))
{
return value;
}

TermQuery query = new TermQuery(new Term(h, queryString));
QueryScorer scorer = new QueryScorer(query);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(HIGH_LIGHT_OPEN,
HIGH_LIGHT_CLOSE);
Highlighter highlighter = new Highlighter(formatter, scorer);

Fragmenter fragmenter = new SimpleFragmenter(50);

highlighter.setTextFragmenter(fragmenter);

TokenStream tokenStream = new CJKAnalyzer().tokenStream(h,
new StringReader(value));

return highlighter.getBestFragments(tokenStream, value, 5, ...);
}

- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, January 27, 2005 8:37 AM
Subject: Re: text highlighting


 Also, there are some examples in the Lucene in Action source code (grab  
 it from http://www.lucenebook.com) (see HighlightIt.java).
 
 Erik
 
 On Jan 26, 2005, at 5:52 PM, markharw00d wrote:
 
  Michael Celona wrote:
 
  Does any have a working example of the highlighter class found in the
  sandbox?
 
 
  There are several in the accompanying Junit test:
  http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
  contributions/highlighter/src/test/org/apache/lucene/search/highlight/
 
 
  Cheers
  Mark
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Re: text highlighting

2005-01-27 Thread Youngho Cho
More test result

if the text contains  ... Family ...
Than

family query string woks OK.
But if the query stirng is Family than the highlighter return none.


Thanks.

Youngho

- Original Message - 
From: Youngho Cho [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Cc: Che Dong [EMAIL PROTECTED]
Sent: Thursday, January 27, 2005 6:10 PM
Subject: Re: text highlighting


 Hello,
 
 When I used the code with CJKAnalyzer and search English Text 
 (Because the text is mixed with Korean and English )
 sometimes the return Stirng is none.
 Others works well.
 
 Is the code analyzer dependancy ?
 
 Thanks.
 
 Youngho
 
 ---  Test Code ( Just copy of the Book code ) -
 
 private static final String HIGH_LIGHT_OPEN = span 
 class=\highlight\;
 private static final String HIGH_LIGHT_CLOSE = /span;
 
 public static String highLight(String value, String queryString)
 throws IOException
 {
 if (StringUtils.isEmpty(value) || StringUtils.isEmpty(queryString))
 {
 return value;
 }
 
 TermQuery query = new TermQuery(new Term(h, queryString));
 QueryScorer scorer = new QueryScorer(query);
 SimpleHTMLFormatter formatter = new 
 SimpleHTMLFormatter(HIGH_LIGHT_OPEN,
 HIGH_LIGHT_CLOSE);
 Highlighter highlighter = new Highlighter(formatter, scorer);
 
 Fragmenter fragmenter = new SimpleFragmenter(50);
 
 highlighter.setTextFragmenter(fragmenter);
 
 TokenStream tokenStream = new CJKAnalyzer().tokenStream(h,
 new StringReader(value));
 
 return highlighter.getBestFragments(tokenStream, value, 5, ...);
 }
 
 - Original Message - 
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene Users List lucene-user@jakarta.apache.org
 Sent: Thursday, January 27, 2005 8:37 AM
 Subject: Re: text highlighting
 
 
  Also, there are some examples in the Lucene in Action source code (grab  
  it from http://www.lucenebook.com) (see HighlightIt.java).
  
  Erik
  
  On Jan 26, 2005, at 5:52 PM, markharw00d wrote:
  
   Michael Celona wrote:
  
   Does any have a working example of the highlighter class found in the
   sandbox?
  
  
   There are several in the accompanying Junit test:
   http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
   contributions/highlighter/src/test/org/apache/lucene/search/highlight/
  
  
   Cheers
   Mark
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

Re: text highlighting

2005-01-27 Thread mark harwood
sometimes the return Stirng is none.
Is the code analyzer dependancy ?

When the highlighter.getBestFragments returns nothing
this is because there was no match found for query
terms in the TokenStream supplied.
This is nearly always because of Analyzer issues.
Check the post-analysis tokens produced for the query
and the tokens produced in the TokenStream passed to
the highlighter. The highlighter simply looks for
matches in the two sources of terms and uses the token
offsets to select the best sections of the supplied
text.

Cheers
Mark





___ 
ALL-NEW Yahoo! Messenger - all new features - even more fun! 
http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



LuceneRAR nearing first release

2005-01-27 Thread Joseph Ottinger
https://lucenerar.dev.java.net

LuceneRAR is now working on two containers, verified: The J2EE 1.4 RI and
Orion. Websphere testing is underway, with JBoss to follow.

LuceneRAR is a resource adapter for Lucene, allowing J2EE components to
look up an entry in a JNDI tree, using that reference to add and search
for documents. It's much like RemoteSearcher would be, except using JNDI
semantics for communication instead of RMI, which is a little more elegant
in a J2EE environment (where JNDI communication is very common).

LuceneRAR was created to allow J2EE components to legitimately use the
filesystem indexes (for speed) while not violating J2EE's suggestion to
not rely on filesystem access. It also allows distributed access to the
index (as remote servers would simply establish a JNDI connection to the
LuceneRAR home.)

Please take a look at it, if you're interested; the feature set isn't
complete, but it's workable. There's a sample application that allows
creation, searches, and statistical data about the search included in the
distribution.

Any comments are welcomed.

---
Joseph B. Ottinger http://enigmastation.com
IT Consultant[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Different Documents (with fields) in one index?

2005-01-27 Thread Karl Koch
Hello all,

perhaps not such a sophisticated question: 

I would like to have a very diverse set of documents in one index. Depending
on the inside of text documents, I would like to put part of the text in
different fields. This means in the searches, when searching a particular
field, some of those documents won't be addressed at all.

Is it possible to have different kinds of Documents with different index
fields in ONE index? Or do I need one index for each set?

Karl

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Different Documents (with fields) in one index?

2005-01-27 Thread Otis Gospodnetic
Karl,

This is completely fine.  You can have documents with different fields
in the same index.

Otis

--- Karl Koch [EMAIL PROTECTED] wrote:

 Hello all,
 
 perhaps not such a sophisticated question: 
 
 I would like to have a very diverse set of documents in one index.
 Depending
 on the inside of text documents, I would like to put part of the text
 in
 different fields. This means in the searches, when searching a
 particular
 field, some of those documents won't be addressed at all.
 
 Is it possible to have different kinds of Documents with different
 index
 fields in ONE index? Or do I need one index for each set?
 
 Karl
 
 -- 
 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
 +++ GMX - die erste Adresse für Mail, Message, More +++
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Different Documents (with fields) in one index?

2005-01-27 Thread Aad Nales
Nope,
it is very possible. We have an index that holds the search info for 
documents, messages in discussion threads, filled in forms etc. etc. 
each having their own structure.

cheers,
Aad
Karl Koch wrote:
Hello all,
perhaps not such a sophisticated question: 

I would like to have a very diverse set of documents in one index. Depending
on the inside of text documents, I would like to put part of the text in
different fields. This means in the searches, when searching a particular
field, some of those documents won't be addressed at all.
Is it possible to have different kinds of Documents with different index
fields in ONE index? Or do I need one index for each set?
Karl
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Index Layout Question

2005-01-27 Thread Jerry Jalenak
I am in the process of indexing about 1.5 million documents, and have
started down the path of indexing these by month.  Each month has between
100,000 and 200,000 documents.  From a performance standpoint, is this the
right approach?  This allows me to use MultiSearcher (or
ParallelMultiSearcher), but I'm not sure if the performance gains are really
there.  Would one monolithic index be better?

Thanks.

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reloading an index

2005-01-27 Thread Greg Gershman
I have an index that is frequently updated.  When
indexing is completed, an event triggers a new
Searcher to be opened.  When the new Searcher is
opened, incoming searches are redirected to the new
Searcher, the old Searcher is closed and nulled, but I
still see about twice the amount of memory in use well
after the original searcher has been closed.   Is
there something else I can do to get this memory
reclaimed?  Should I explicitly call garbarge
collection?  Any ideas?

Thanks.

Greg Gershman 



__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Layout Question

2005-01-27 Thread Ian Soboroff
Jerry Jalenak [EMAIL PROTECTED] writes:

 I am in the process of indexing about 1.5 million documents, and have
 started down the path of indexing these by month.  Each month has between
 100,000 and 200,000 documents.  From a performance standpoint, is this the
 right approach?  This allows me to use MultiSearcher (or
 ParallelMultiSearcher), but I'm not sure if the performance gains are really
 there.  Would one monolithic index be better?

Depends on your search infrastructure.  Doug Cutting has sent out some
basic optimization guidelines on this list which should be in the
archives... simply, you need to think about how many CPUs and spindles
are involved.

1.5m documents isn't a challenge for Lucene to index or search on a
single machine with a monolithic index.  I indexed about 1.6m web
pages in 22 hours on a single machine with all data local, and search
with a single IndexSearcher was instantaneous.  We've also done some
testing with a larger collection (25m pages) and
ParallelMultiSearchers on several machines, and likewise on a fast
network haven't felt a slowdown, but we haven't actually benchmarked
it.

Ian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Reloading an index

2005-01-27 Thread Cocula Remi
Make sure that the older searcher is not referenced elsewhere otherwise the 
garbage collector should 
delete it.
Just remember that the Garbage collector runs when memory is needed but not 
immediatly after changing a reference to null.


-Message d'origine-
De : Greg Gershman [mailto:[EMAIL PROTECTED]
Envoyé : jeudi 27 janvier 2005 17:29
À : lucene-user@jakarta.apache.org
Objet : Reloading an index


I have an index that is frequently updated.  When
indexing is completed, an event triggers a new
Searcher to be opened.  When the new Searcher is
opened, incoming searches are redirected to the new
Searcher, the old Searcher is closed and nulled, but I
still see about twice the amount of memory in use well
after the original searcher has been closed.   Is
there something else I can do to get this memory
reclaimed?  Should I explicitly call garbarge
collection?  Any ideas?

Thanks.

Greg Gershman 



__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boosting Questions

2005-01-27 Thread Luke Shannon
Hi All;

I just want to make sure I have the right idea about boosting.

So if I boost a document (Document A) after I index it (lets say a score of
2.0) Lucene will now consider this document relativly more important than
other documents in the index with a boost factor less than 2.0. This boost
factor will also be applied to all the fields in the Document A. Therefore,
if I do a TermQuery on a field that all my documents share (title), in the
returned Hits (assuming Document A was among the return documents), Document
A will score higher than other documents with a lower boost factor because
the title field in A would have been boosted with all its other fields.
Correct?

Now if at indexing time I decided to boost a particular field, lets say
address in Document A (this is a field which all documents have) the boost
factor is only applied to the address field of Document A. Nothing else is
boosted by this operation. This means if a TermQuery on the address field
returns Document A along with a collection of other documents, Document A
will score higher than the others because of boosting. Correct?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting Questions

2005-01-27 Thread Otis Gospodnetic
Luke,

Boosting is only one of the factors involved in Document/Query scoring.
 Assuming that by applying your boosts to Document A or a single field
of Document A increases the total score enough, yes, that Document A
may have the highest score.  But just because you boost a single
Document and not others, it does not mean it will emerge at the top.
You should check out the Explanation class, which can dump all scoring
factors in text or HTML format.

Otis


--- Luke Shannon [EMAIL PROTECTED] wrote:

 Hi All;
 
 I just want to make sure I have the right idea about boosting.
 
 So if I boost a document (Document A) after I index it (lets say a
 score of
 2.0) Lucene will now consider this document relativly more important
 than
 other documents in the index with a boost factor less than 2.0. This
 boost
 factor will also be applied to all the fields in the Document A.
 Therefore,
 if I do a TermQuery on a field that all my documents share (title),
 in the
 returned Hits (assuming Document A was among the return documents),
 Document
 A will score higher than other documents with a lower boost factor
 because
 the title field in A would have been boosted with all its other
 fields.
 Correct?
 
 Now if at indexing time I decided to boost a particular field, lets
 say
 address in Document A (this is a field which all documents have)
 the boost
 factor is only applied to the address field of Document A. Nothing
 else is
 boosted by this operation. This means if a TermQuery on the address
 field
 returns Document A along with a collection of other documents,
 Document A
 will score higher than the others because of boosting. Correct?
 
 Thanks,
 
 Luke
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boosting Questions

2005-01-27 Thread Luke Shannon
Thanks Otis.

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Thursday, January 27, 2005 12:11 PM
Subject: Re: Boosting Questions


 Luke,
 
 Boosting is only one of the factors involved in Document/Query scoring.
  Assuming that by applying your boosts to Document A or a single field
 of Document A increases the total score enough, yes, that Document A
 may have the highest score.  But just because you boost a single
 Document and not others, it does not mean it will emerge at the top.
 You should check out the Explanation class, which can dump all scoring
 factors in text or HTML format.
 
 Otis
 
 
 --- Luke Shannon [EMAIL PROTECTED] wrote:
 
  Hi All;
  
  I just want to make sure I have the right idea about boosting.
  
  So if I boost a document (Document A) after I index it (lets say a
  score of
  2.0) Lucene will now consider this document relativly more important
  than
  other documents in the index with a boost factor less than 2.0. This
  boost
  factor will also be applied to all the fields in the Document A.
  Therefore,
  if I do a TermQuery on a field that all my documents share (title),
  in the
  returned Hits (assuming Document A was among the return documents),
  Document
  A will score higher than other documents with a lower boost factor
  because
  the title field in A would have been boosted with all its other
  fields.
  Correct?
  
  Now if at indexing time I decided to boost a particular field, lets
  say
  address in Document A (this is a field which all documents have)
  the boost
  factor is only applied to the address field of Document A. Nothing
  else is
  boosted by this operation. This means if a TermQuery on the address
  field
  returns Document A along with a collection of other documents,
  Document A
  will score higher than the others because of boosting. Correct?
  
  Thanks,
  
  Luke
  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



XML index

2005-01-27 Thread Karl Koch
Hi,

I want to use kXML with Lucene to index XML files. I think it is possible to
dynamically assign Node names as Document fields and Node texts as Text
(after using an Analyser). 

I have seen some XML indexing in the Sandbox. Is anybody here which has done
something with a thin pull parser (perhaps even kXML)? Does anybody know of
a project or some sourcecode available which covers this topic?

Karl

 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index Layout Question

2005-01-27 Thread Jerry Jalenak
That's good to know.

I'm indexing on 11 fields (9 keyword, 2 text).  The documents themselves are
between 1K to 2K in size.

Is there a point at which IndexSearcher performance begins to fall off?  (in
term of # of index records?)

Jerry Jalenak
Senior Programmer / Analyst, Web Publishing
LabOne, Inc.
10101 Renner Blvd.
Lenexa, KS  66219
(913) 577-1496

[EMAIL PROTECTED]


-Original Message-
From: Ian Soboroff [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 27, 2005 10:31 AM
To: Lucene Users List
Subject: Re: Index Layout Question


Jerry Jalenak [EMAIL PROTECTED] writes:

 I am in the process of indexing about 1.5 million documents, and have
 started down the path of indexing these by month.  Each month has between
 100,000 and 200,000 documents.  From a performance standpoint, is this the
 right approach?  This allows me to use MultiSearcher (or
 ParallelMultiSearcher), but I'm not sure if the performance gains are
really
 there.  Would one monolithic index be better?

Depends on your search infrastructure.  Doug Cutting has sent out some
basic optimization guidelines on this list which should be in the
archives... simply, you need to think about how many CPUs and spindles
are involved.

1.5m documents isn't a challenge for Lucene to index or search on a
single machine with a monolithic index.  I indexed about 1.6m web
pages in 22 hours on a single machine with all data local, and search
with a single IndexSearcher was instantaneous.  We've also done some
testing with a larger collection (25m pages) and
ParallelMultiSearchers on several machines, and likewise on a fast
network haven't felt a slowdown, but we haven't actually benchmarked
it.

Ian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


This transmission (and any information attached to it) may be confidential and
is intended solely for the use of the individual or entity to which it is
addressed. If you are not the intended recipient or the person responsible for
delivering the transmission to the intended recipient, be advised that you
have received this transmission in error and that any use, dissemination,
forwarding, printing, or copying of this information is strictly prohibited.
If you have received this transmission in error, please immediately notify
LabOne at the following email address: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML index

2005-01-27 Thread Otis Gospodnetic
Hello Karl,

Grab the source code for Lucene in Action, it's got code that parses
and indexes XML with DOM and SAX.  You can see the coverage of that
stuff here: 
http://lucenebook.com/search?query=indexing+XML+section%3A7*
I haven't used kXML, but I imagine the LIA code should get you going
quickly and you are free to adapt the code to work with kXML for you.

Otis

--- Karl Koch [EMAIL PROTECTED] wrote:

 Hi,
 
 I want to use kXML with Lucene to index XML files. I think it is
 possible to
 dynamically assign Node names as Document fields and Node texts as
 Text
 (after using an Analyser). 
 
 I have seen some XML indexing in the Sandbox. Is anybody here which
 has done
 something with a thin pull parser (perhaps even kXML)? Does anybody
 know of
 a project or some sourcecode available which covers this topic?
 
 Karl
 
  
 
 -- 
 Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-27 Thread Doug Cutting
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?
You can increase TermInfosWriter.indexInterval.  You'll need to re-write 
the .tii file for this to take effect.  The simplest way to do this is 
to use IndexWriter.addIndexes(), adding your index to a new, empty, 
directory.  This will of course take a while for a 60GB index...

Doubling TermInfosWriter.indexInterval should half the Term memory usage 
and double the time required to look up terms in the dictionary.  With 
an index this large the the latter is probably not an issue, since 
processing term frequency and proximity data probably overwhelmingly 
dominate search performance.

Perhaps we should make this public by adding an IndexWriter method?
Also, you can list the size of your .tii file by using the main() from 
CompoundFileReader.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Sort Performance Problems across large dataset

2005-01-27 Thread Doug Cutting
Peter Hollas wrote:
Currently we can issue a simple search query and expect a response back 
in about 0.2 seconds (~3,000 results) with the Lucene index that we have 
built. Lucene gives a much more predictable and faster average query 
time than using standard fulltext indexing with mySQL. This however 
returns result in score order, and not alphabetically.

To sort the resultset into alphabetical order, we added the species 
names as a seperate keyword field, and sorted using it whilst querying. 
This solution works fine, but is unacceptable since a query that returns 
thousands of results can take upwards of 30 seconds to sort them.
Are you using a Lucene Sort?  If you reuse the same IndexReader (or 
IndexSearcher) then perhaps the first query specifying a Sort will take 
30 seconds (although that's much slower than I'd expect), but subsequent 
searches that sort on the same field should be nearly as fast as results 
sorted by score.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


query term frequency

2005-01-27 Thread Jonathan Lasko
What do I call to get the term frequencies for terms in the Query?  I 
can't seem to find it in the Javadoc...
Thanks.

Jonathan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: query term frequency

2005-01-27 Thread David Spencer
Jonathan Lasko wrote:
What do I call to get the term frequencies for terms in the Query?  I 
can't seem to find it in the Javadoc...
Do you mean the # of docs that have a term?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#docFreq(org.apache.lucene.index.Term)
Thanks.
Jonathan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Disk space used by optimize

2005-01-27 Thread Kauler, Leto S

Just a quick question:  after writing an index and then calling
optimize(), is it normal for the index to expand to about three times
the size before finally compressing?

In our case the optimise grinds the disk, expanding the index into many
files of about 145MB total, before compressing down to three files of
about 47MB total.  That must be a lot of disk activity for the people
with multi-gigabyte indexes!

Regards,
Leto

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



LuceneReader.delete (term t) Failure ?

2005-01-27 Thread akedar
Hi,

I am trying to delete a document from Lucene index using:

 Term aTerm = new Term( uid, path );
 aReader.delete( aTerm );
 aReader.close();

If the variable path=xxx/foo.txt then I am able to delete the document.  

However, if path variable has - in the string, the delete method does not work

  e.g. path=xxx-yyy/foo.txt  // Does Not work!!


Can I get around this problem.  I cannot subsitute minus character with '.' as 
it has other implications.  

is this a bug ? I am using Lucene 1.4-final version.

Thanks for the help
Atul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



google mini? who needs it when Lucene is there

2005-01-27 Thread jian chen
Hi,

I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)

The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...

It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.

I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.

Jian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
Hello,

Yes, that is how optimize works - copies all existing index segments
into one unified index segment, thus optimizing it.

see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space

However, three times the space sounds a bit too much, or I make a
mistake in the book. :)

You said you end up with 3 files - .cfs is one of them, right?

Otis


--- Kauler, Leto S [EMAIL PROTECTED] wrote:

 
 Just a quick question:  after writing an index and then calling
 optimize(), is it normal for the index to expand to about three times
 the size before finally compressing?
 
 In our case the optimise grinds the disk, expanding the index into
 many
 files of about 145MB total, before compressing down to three files of
 about 47MB total.  That must be a lot of disk activity for the people
 with multi-gigabyte indexes!
 
 Regards,
 Leto
 
 CONFIDENTIALITY NOTICE AND DISCLAIMER
 
 Information in this transmission is intended only for the person(s)
 to whom it is addressed and may contain privileged and/or
 confidential information. If you are not the intended recipient, any
 disclosure, copying or dissemination of the information is
 unauthorised and you should delete/destroy all copies and notify the
 sender. No liability is accepted for any unauthorised use of the
 information contained in this transmission.
 
 This disclaimer has been automatically added.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Xiaohong Yang \(Sharon\)
Hi,
 
I agree that Google mini is quite expensive.  It might be similar to the 
desktop version in quality.  Anyone knows google's ratio of index to text?   Is 
it true that Lucene's index is about 500 times the original text size (not 
including image size)?  I don't have one installed, so I cannot measure.
 
Best,
 
Sharon

jian chen [EMAIL PROTECTED] wrote:
Hi,

I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)

The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...

It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.

I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.

Jian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disk space used by optimize

2005-01-27 Thread Kauler, Leto S
Our copy of LIA is in the mail ;)

Yes the final three files are: the .cfs (46.8MB), deletable (4 bytes),
and segments (29 bytes).

--Leto



 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
 
 Hello,
 
 Yes, that is how optimize works - copies all existing index 
 segments into one unified index segment, thus optimizing it.
 
 see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space
 
 However, three times the space sounds a bit too much, or I 
 make a mistake in the book. :)
 
 You said you end up with 3 files - .cfs is one of them, right?
 
 Otis
 
 
 --- Kauler, Leto S [EMAIL PROTECTED] wrote:
 
  
  Just a quick question:  after writing an index and then calling 
  optimize(), is it normal for the index to expand to about 
 three times 
  the size before finally compressing?
  
  In our case the optimise grinds the disk, expanding the index into 
  many files of about 145MB total, before compressing down to three 
  files of about 47MB total.  That must be a lot of disk activity for 
  the people with multi-gigabyte indexes!
  
  Regards,
  Leto

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the LF of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: google mini? who needs it when Lucene is there

2005-01-27 Thread John Wang
I think Google mini also includes crawling and a server wrapper. So it
is not entirely an 1-to-1 comparison.

Of couse extending lucene to have those features are not at all
difficult anyway.

-John


On Thu, 27 Jan 2005 16:04:54 -0800 (PST), Xiaohong Yang (Sharon)
[EMAIL PROTECTED] wrote:
 Hi,
 
 I agree that Google mini is quite expensive.  It might be similar to the 
 desktop version in quality.  Anyone knows google's ratio of index to text?   
 Is it true that Lucene's index is about 500 times the original text size (not 
 including image size)?  I don't have one installed, so I cannot measure.
 
 Best,
 
 Sharon
 
 jian chen [EMAIL PROTECTED] wrote:
 Hi,
 
 I was searching using google and just found that there was a new
 feature called google mini. Initially I thought it was another free
 service for small companies. Then I realized that it costs quite some
 money ($4,995) for the hardware and software. (I guess the proprietary
 software costs a whole lot more than actual hardware.)
 
 The nice feature is that, you can only index up to 50,000 documents
 with this price. If you need to index more, sorry, send in the
 check...
 
 It seems to me that any small biz will be ripped off if they install
 this google mini thing, compared to using Lucene to implement a easy
 to use search software, which could search up to whatever number of
 documents you could image.
 
 I hope the lucene project could get exposed more to the enterprise so
 that people know that they have not only cheaper but more importantly,
 BETTER alternatives.
 
 Jian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LuceneReader.delete (term t) Failure ?

2005-01-27 Thread Erik Hatcher
How did you index the uid field?  Field.Keyword?  If not, that may be 
the problem in that the field was analyzed.  For a key field like this, 
it needs to be unanalyzed/untokenized.

Erik
On Jan 27, 2005, at 6:21 PM, [EMAIL PROTECTED] wrote:
Hi,
I am trying to delete a document from Lucene index using:
 Term aTerm = new Term( uid, path );
 aReader.delete( aTerm );
 aReader.close();
If the variable path=xxx/foo.txt then I am able to delete the 
document.

However, if path variable has - in the string, the delete method 
does not work

  e.g. path=xxx-yyy/foo.txt  // Does Not work!!
Can I get around this problem.  I cannot subsitute minus character 
with '.' as
it has other implications.

is this a bug ? I am using Lucene 1.4-final version.
Thanks for the help
Atul
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: text highlighting

2005-01-27 Thread Youngho Cho
Thanks for your reply.

I use QueryParser instead of  TermQuery.
And all works good !.

Thanks.

Youngho

- Original Message - 
From: mark harwood [EMAIL PROTECTED]
To: lucene-user@jakarta.apache.org
Sent: Thursday, January 27, 2005 7:05 PM
Subject: Re: text highlighting


 sometimes the return Stirng is none.
 Is the code analyzer dependancy ?
 
 When the highlighter.getBestFragments returns nothing
 this is because there was no match found for query
 terms in the TokenStream supplied.
 This is nearly always because of Analyzer issues.
 Check the post-analysis tokens produced for the query
 and the tokens produced in the TokenStream passed to
 the highlighter. The highlighter simply looks for
 matches in the two sources of terms and uses the token
 offsets to select the best sections of the supplied
 text.
 
 Cheers
 Mark
 
 
 
 
 
 ___ 
 ALL-NEW Yahoo! Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Erik Hatcher
I've often said that there is a business to be had in packaging up 
Lucene (and now Nutch) into a cute little box with user friendly 
management software to search your intranet.  SearchBlox is already 
there (except they don't include the box).

I really hope that an application like SearchBlox/Zilverline can be 
created as part of the Lucene project itself, replacing the sad demos 
that currently ship with Lucene.  I've got so many things on my plate 
that I don't foresee myself getting to this as soon as I'd like, but I 
would most definitely support and contribute what time I could to such 
an effort.  If the web UI used Tapestry, I'd be very inclined to dig in 
hardcore to it.  Any other web UI technology would likely turn me off.  
One of these days I'll Tapestry-ify Nutch just for grins and submit it 
as a replacement for the JSPs.

And I'm even more sold on it if Mac Mini's are involved!  :)
Erik
On Jan 27, 2005, at 7:16 PM, David Spencer wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the LF of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Reloading an index

2005-01-27 Thread Chris Lamprecht
I just ran into a similar issue.  When you close an IndexSearcher, it
doesn't necessarily close the underlying IndexReader.  It depends
which constructor you used to create the IndexSearcher.  See the
constructors javadocs or source for the details.

In my case, we were updating and optimizing the index from another
process, and reopening IndexSearchers.  We would eventually run out of
disk space because it was leaving open file handles to deleted files,
so the disk space was never being made available, until the JVM
processes ended.  If you're under linux, try running the 'lsof'
command to see if there are any handles to files marked (deleted).

-Chris

On Thu, 27 Jan 2005 08:28:30 -0800 (PST), Greg Gershman
[EMAIL PROTECTED] wrote:
 I have an index that is frequently updated.  When
 indexing is completed, an event triggers a new
 Searcher to be opened.  When the new Searcher is
 opened, incoming searches are redirected to the new
 Searcher, the old Searcher is closed and nulled, but I
 still see about twice the amount of memory in use well
 after the original searcher has been closed.   Is
 there something else I can do to get this memory
 reclaimed?  Should I explicitly call garbarge
 collection?  Any ideas?
 
 Thanks.
 
 Greg Gershman
 
 __
 Do you Yahoo!?
 Meet the all-new My Yahoo! - Try it today!
 http://my.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
I discuss this with myself a lot inside my head... :)
Seriously, I agree with Erik.  I think this is a business opportunity.
How many people are hating me now and going shh?  Raise your
hands!

Otis

--- David Spencer [EMAIL PROTECTED] wrote:

 This reminds me, has anyone every discussed something similar:
 
 - rackmount server ( or for coolness factor, that mini mac)
 - web i/f for config/control
 
 - of course the server would have the following s/w:
 -- web server
 -- lucene / nutch
 
 Part of the work here I think is having a decent web i/f to configure
 
 the thing and to customize the LF of the search results.
 
 
 
 jian chen wrote:
  Hi,
  
  I was searching using google and just found that there was a new
  feature called google mini. Initially I thought it was another
 free
  service for small companies. Then I realized that it costs quite
 some
  money ($4,995) for the hardware and software. (I guess the
 proprietary
  software costs a whole lot more than actual hardware.)
  
  The nice feature is that, you can only index up to 50,000
 documents
  with this price. If you need to index more, sorry, send in the
  check...
  
  It seems to me that any small biz will be ripped off if they
 install
  this google mini thing, compared to using Lucene to implement a
 easy
  to use search software, which could search up to whatever number of
  documents you could image.
  
  I hope the lucene project could get exposed more to the enterprise
 so
  that people know that they have not only cheaper but more
 importantly,
  BETTER alternatives.
  
  Jian
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Disk space used by optimize

2005-01-27 Thread Otis Gospodnetic
Have you tried using the multifile index format?  Now I wonder if there
is actually a difference in disk space cosumed by optimize() when you
use multifile and compound index format...

Otis

--- Kauler, Leto S [EMAIL PROTECTED] wrote:

 Our copy of LIA is in the mail ;)
 
 Yes the final three files are: the .cfs (46.8MB), deletable (4
 bytes),
 and segments (29 bytes).
 
 --Leto
 
 
 
  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
  
  Hello,
  
  Yes, that is how optimize works - copies all existing index 
  segments into one unified index segment, thus optimizing it.
  
  see hit #1:
 http://www.lucenebook.com/search?query=optimize+disk+space
  
  However, three times the space sounds a bit too much, or I 
  make a mistake in the book. :)
  
  You said you end up with 3 files - .cfs is one of them, right?
  
  Otis
  
  
  --- Kauler, Leto S [EMAIL PROTECTED] wrote:
  
   
   Just a quick question:  after writing an index and then calling 
   optimize(), is it normal for the index to expand to about 
  three times 
   the size before finally compressing?
   
   In our case the optimise grinds the disk, expanding the index
 into 
   many files of about 145MB total, before compressing down to three
 
   files of about 47MB total.  That must be a lot of disk activity
 for 
   the people with multi-gigabyte indexes!
   
   Regards,
   Leto
 
 CONFIDENTIALITY NOTICE AND DISCLAIMER
 
 Information in this transmission is intended only for the person(s)
 to whom it is addressed and may contain privileged and/or
 confidential information. If you are not the intended recipient, any
 disclosure, copying or dissemination of the information is
 unauthorised and you should delete/destroy all copies and notify the
 sender. No liability is accepted for any unauthorised use of the
 information contained in this transmission.
 
 This disclaimer has been automatically added.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Otis Gospodnetic
500 times the original data?  Not true! :)

Otis

--- Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote:

 Hi,
  
 I agree that Google mini is quite expensive.  It might be similar to
 the desktop version in quality.  Anyone knows google's ratio of index
 to text?   Is it true that Lucene's index is about 500 times the
 original text size (not including image size)?  I don't have one
 installed, so I cannot measure.
  
 Best,
  
 Sharon
 
 jian chen [EMAIL PROTECTED] wrote:
 Hi,
 
 I was searching using google and just found that there was a new
 feature called google mini. Initially I thought it was another free
 service for small companies. Then I realized that it costs quite some
 money ($4,995) for the hardware and software. (I guess the
 proprietary
 software costs a whole lot more than actual hardware.)
 
 The nice feature is that, you can only index up to 50,000 documents
 with this price. If you need to index more, sorry, send in the
 check...
 
 It seems to me that any small biz will be ripped off if they install
 this google mini thing, compared to using Lucene to implement a easy
 to use search software, which could search up to whatever number of
 documents you could image.
 
 I hope the lucene project could get exposed more to the enterprise so
 that people know that they have not only cheaper but more
 importantly,
 BETTER alternatives.
 
 Jian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reloading an index

2005-01-27 Thread Chris Hostetter

: processes ended.  If you're under linux, try running the 'lsof'
: command to see if there are any handles to files marked (deleted).

:  Searcher, the old Searcher is closed and nulled, but I
:  still see about twice the amount of memory in use well
:  after the original searcher has been closed.   Is
:  there something else I can do to get this memory
:  reclaimed?  Should I explicitly call garbarge
:  collection?  Any ideas?

In addition to the previous advice, keep in mind that depending on the
implimentation of your JVM, it may never actually free memory back to
the OS.  And even the JVMs that can, only do it after a GC which results
in a ratio of unused/used memory that they deem worthy of freeing (usually
based on tunning parameters)

assuming you are using a Sun JVM, take a look at...

http://java.sun.com/docs/hotspot/gc1.4.2/index.html

...and search for MinHeapFreeRatio and MaxHeapFreeRatio


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Chris Lamprecht
As they say, nothing lasts forever ;)

I like the idea.  If a project like this gets going, I think I'd be
interested in helping.

The Google mini looks very well done (they have two demos on the web
page).  For $5000, it's probably a very good solution for many
businesses.  If the demos are accurate, it seems like you almost
literally plug it in, configure a few things using the web interface,
and you're in business.   Demos are at
http://www.google.com/enterprise/mini/product_tours_demos.html

-chris

On Thu, 27 Jan 2005 17:40:53 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 I discuss this with myself a lot inside my head... :)
 Seriously, I agree with Erik.  I think this is a business opportunity.
 How many people are hating me now and going shh?  Raise your
 hands!
 
 Otis
 
 --- David Spencer [EMAIL PROTECTED] wrote:
 
  This reminds me, has anyone every discussed something similar:
 
  - rackmount server ( or for coolness factor, that mini mac)
  - web i/f for config/control
 
  - of course the server would have the following s/w:
  -- web server
  -- lucene / nutch
 
  Part of the work here I think is having a decent web i/f to configure
 
  the thing and to customize the LF of the search results.
 
 
 
  jian chen wrote:
   Hi,
  
   I was searching using google and just found that there was a new
   feature called google mini. Initially I thought it was another
  free
   service for small companies. Then I realized that it costs quite
  some
   money ($4,995) for the hardware and software. (I guess the
  proprietary
   software costs a whole lot more than actual hardware.)
  
   The nice feature is that, you can only index up to 50,000
  documents
   with this price. If you need to index more, sorry, send in the
   check...
  
   It seems to me that any small biz will be ripped off if they
  install
   this google mini thing, compared to using Lucene to implement a
  easy
   to use search software, which could search up to whatever number of
   documents you could image.
  
   I hope the lucene project could get exposed more to the enterprise
  so
   that people know that they have not only cheaper but more
  importantly,
   BETTER alternatives.
  
   Jian
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
  [EMAIL PROTECTED]
  
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread Jason Polites
I think everyone agrees that this would be a very neat application of 
opensource technology like Lucene... however (opens drawer, pulls out 
devil's advocate hat, places on head)... there are several complexities here 
not addressed by Lucene (et. al).  Not because Lucene isn't damn fantastic, 
just because it's not its job.

One of the big ones is security.  Enterprise search is no good if it doesn't 
match up with the authentication and authorization paradigms existing in the 
organisation.  How useful is it to return a whole bunch of search results 
for documents to which you don't have access? Not to mention the issues 
around whether you are even authorized to know it exists.

The other prickly one is file types.  It's all well and good to index HTML, 
XML and text but when you start looking at PDF, MS Office (OLE docs, PSTs, 
Outlook MSG files, MS Project files etc), Lotus Notes databases etc etc, 
things begin to look less simple and far less elegant than a nice clean 
lucene rackmount.  Sure there are great projects like Apache POI but they 
are still have a bit of a way to go before they mature to a point of really 
solving these problems.  After which time Microsoft will probably be rolling 
out Longhorn and everyone may need to start from scratch.

This is not to say that it's not a great idea, but as with most great ideas 
the challenge is not the formation of the idea, but its implementation.

I think a great first step would be to start developing good, reliable, 
opensource extensions to Lucene which strive to solve some of these issues.

end rant.
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Friday, January 28, 2005 12:40 PM
Subject: Re: rackmount lucene/nutch - Re: google mini? who needs it when 
Lucene is there


I discuss this with myself a lot inside my head... :)
Seriously, I agree with Erik.  I think this is a business opportunity.
How many people are hating me now and going shh?  Raise your
hands!
Otis
--- David Spencer [EMAIL PROTECTED] wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure
the thing and to customize the LF of the search results.

jian chen wrote:
 Hi,

 I was searching using google and just found that there was a new
 feature called google mini. Initially I thought it was another
free
 service for small companies. Then I realized that it costs quite
some
 money ($4,995) for the hardware and software. (I guess the
proprietary
 software costs a whole lot more than actual hardware.)

 The nice feature is that, you can only index up to 50,000
documents
 with this price. If you need to index more, sorry, send in the
 check...

 It seems to me that any small biz will be ripped off if they
install
 this google mini thing, compared to using Lucene to implement a
easy
 to use search software, which could search up to whatever number of
 documents you could image.

 I hope the lucene project could get exposed more to the enterprise
so
 that people know that they have not only cheaper but more
importantly,
 BETTER alternatives.

 Jian


-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re: LuceneReader.delete (term t) Failure ?

2005-01-27 Thread akedar
Erik,

I am using the keyword field 
doc.add(Field.Keyword(uid, pathRelToArea));
anything else I can check on ?

thanks
atul 

PS we worked together for Darden project 


 
 From: Erik Hatcher [EMAIL PROTECTED]
 Date: 2005/01/27 Thu PM 07:46:40 EST
 To: Lucene Users List lucene-user@jakarta.apache.org
 Subject: Re: LuceneReader.delete (term t) Failure ?
 
 How did you index the uid field?  Field.Keyword?  If not, that may be 
 the problem in that the field was analyzed.  For a key field like this, 
 it needs to be unanalyzed/untokenized.
 
   Erik
 
 On Jan 27, 2005, at 6:21 PM, [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I am trying to delete a document from Lucene index using:
 
   Term aTerm = new Term( uid, path );
   aReader.delete( aTerm );
   aReader.close();
 
  If the variable path=xxx/foo.txt then I am able to delete the 
  document.
 
  However, if path variable has - in the string, the delete method 
  does not work
 
e.g. path=xxx-yyy/foo.txt  // Does Not work!!
 
 
  Can I get around this problem.  I cannot subsitute minus character 
  with '.' as
  it has other implications.
 
  is this a bug ? I am using Lucene 1.4-final version.
 
  Thanks for the help
  Atul
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LuceneReader.delete (term t) Failure ?

2005-01-27 Thread Erik Hatcher
Could you work up a self-contained RAMDirectory-using example that 
demonstrates this issue?

Erik
On Jan 27, 2005, at 9:10 PM, [EMAIL PROTECTED] wrote:
Erik,
I am using the keyword field
doc.add(Field.Keyword(uid, pathRelToArea));
anything else I can check on ?
thanks
atul
PS we worked together for Darden project

From: Erik Hatcher [EMAIL PROTECTED]
Date: 2005/01/27 Thu PM 07:46:40 EST
To: Lucene Users List lucene-user@jakarta.apache.org
Subject: Re: LuceneReader.delete (term t) Failure ?
How did you index the uid field?  Field.Keyword?  If not, that may 
be
the problem in that the field was analyzed.  For a key field like 
this,
it needs to be unanalyzed/untokenized.

Erik
On Jan 27, 2005, at 6:21 PM, [EMAIL PROTECTED] wrote:
Hi,
I am trying to delete a document from Lucene index using:
 Term aTerm = new Term( uid, path );
 aReader.delete( aTerm );
 aReader.close();
If the variable path=xxx/foo.txt then I am able to delete the
document.
However, if path variable has - in the string, the delete method
does not work
  e.g. path=xxx-yyy/foo.txt  // Does Not work!!
Can I get around this problem.  I cannot subsitute minus character
with '.' as
it has other implications.
is this a bug ? I am using Lucene 1.4-final version.
Thanks for the help
Atul
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: google mini? who needs it when Lucene is there

2005-01-27 Thread jian chen
Overall, even if google mini gives a lot of cool features compared to
a bare-born lucene project, what is good with the 50,000 documents
limit. It is useless with that limit. That is just their way of trying
to turn it into another cash cow.

Jian


On Thu, 27 Jan 2005 17:45:03 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 500 times the original data?  Not true! :)
 
 Otis
 
 --- Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote:
 
  Hi,
 
  I agree that Google mini is quite expensive.  It might be similar to
  the desktop version in quality.  Anyone knows google's ratio of index
  to text?   Is it true that Lucene's index is about 500 times the
  original text size (not including image size)?  I don't have one
  installed, so I cannot measure.
 
  Best,
 
  Sharon
 
  jian chen [EMAIL PROTECTED] wrote:
  Hi,
 
  I was searching using google and just found that there was a new
  feature called google mini. Initially I thought it was another free
  service for small companies. Then I realized that it costs quite some
  money ($4,995) for the hardware and software. (I guess the
  proprietary
  software costs a whole lot more than actual hardware.)
 
  The nice feature is that, you can only index up to 50,000 documents
  with this price. If you need to index more, sorry, send in the
  check...
 
  It seems to me that any small biz will be ripped off if they install
  this google mini thing, compared to using Lucene to implement a easy
  to use search software, which could search up to whatever number of
  documents you could image.
 
  I hope the lucene project could get exposed more to the enterprise so
  that people know that they have not only cheaper but more
  importantly,
  BETTER alternatives.
 
  Jian
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
Jason Polites wrote:
I think everyone agrees that this would be a very neat application of 
opensource technology like Lucene... however (opens drawer, pulls out 
devil's advocate hat, places on head)... there are several complexities 
here not addressed by Lucene (et. al).  Not because Lucene isn't damn 
fantastic, just because it's not its job.

One of the big ones is security.  Enterprise search is no good if it 
doesn't match up with the authentication and authorization paradigms 
existing in the organisation.  How useful is it to return a whole bunch 
of search results for documents to which you don't have access? Not to 
mention the issues around whether you are even authorized to know it 
exists.
I was gonna mention this - you beat me to the punch.  I suspect that 
LDAP/JNDI itegration is a start, but you need hooks for an arbitrary 
auth plugin. And once we address this it might be the case that a user 
has to *log in* to the search server.  We have Verity where I work and 
this is all the case, along w/ the fact that a sale seems to involve 
mandatory consulting work (not that that's bad, but if you're trying to 
ship a shrink wrapped search engine in a box then this is an issue).

The other prickly one is file types.  It's all well and good to index 
HTML, XML and text but when you start looking at PDF, MS Office (OLE 
docs, PSTs, Outlook MSG files, MS Project files etc), Lotus Notes 
databases etc etc, things begin to look less simple and far less elegant 
than a nice clean lucene rackmount.  Sure there are great projects like 
Apache POI but they are still have a bit of a way to go before they 
mature to a point of really solving these problems.  After which time 
Microsoft will probably be rolling out Longhorn and everyone may need to 
start from scratch.
Also need http://jcifs.samba.org/ so you can spider windows file shares.
This is not to say that it's not a great idea, but as with most great 
ideas the challenge is not the formation of the idea, but its 
implementation.
Indeed.
I think a great first step would be to start developing good, reliable, 
opensource extensions to Lucene which strive to solve some of these issues.

end rant.
- Original Message - From: Otis Gospodnetic 
[EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Friday, January 28, 2005 12:40 PM
Subject: Re: rackmount lucene/nutch - Re: google mini? who needs it when 
Lucene is there


I discuss this with myself a lot inside my head... :)
Seriously, I agree with Erik.  I think this is a business opportunity.
How many people are hating me now and going shh?  Raise your
hands!
Otis
--- David Spencer [EMAIL PROTECTED] wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure
the thing and to customize the LF of the search results.

jian chen wrote:
 Hi,

 I was searching using google and just found that there was a new
 feature called google mini. Initially I thought it was another
free
 service for small companies. Then I realized that it costs quite
some
 money ($4,995) for the hardware and software. (I guess the
proprietary
 software costs a whole lot more than actual hardware.)

 The nice feature is that, you can only index up to 50,000
documents
 with this price. If you need to index more, sorry, send in the
 check...

 It seems to me that any small biz will be ripped off if they
install
 this google mini thing, compared to using Lucene to implement a
easy
 to use search software, which could search up to whatever number of
 documents you could image.

 I hope the lucene project could get exposed more to the enterprise
so
 that people know that they have not only cheaper but more
importantly,
 BETTER alternatives.

 Jian


-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Search results excerpt similar to Google

2005-01-27 Thread Ben
Hi

Is it hard to implement a function that displays the search results
excerpts similar to Google?

Is it just string manipulations or there are some logic behind it? I
like their excerpts.

Thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
Xiaohong Yang (Sharon) wrote:
Hi,
 
I agree that Google mini is quite expensive.  It might be similar to the desktop version in quality.  Anyone knows google's ratio of index to text?   Is it true that Lucene's index is about 500 times the original text size (not including image size)?  I don't have one installed, so I cannot measure.
500:1 for Lucene?  I don't think so.
In my wikipedia search engine the data in the MySQL DB I index from is 
approx 1.0 GB (sum of lengths of title and body), while the Lucene index 
of just these 2 fields is 250MB, thus in this case the Lucene index is 
25% of the corpus size.


 
Best,
 
Sharon

jian chen [EMAIL PROTECTED] wrote:
Hi,
I was searching using google and just found that there was a new
feature called google mini. Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The nice feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search results excerpt similar to Google

2005-01-27 Thread Jason Polites
I think they do a proximity result based on keyword matches.  So... If you 
search for lucene and the document returned has this word at the very 
start and the very end of the document, then you will see the two sentences 
(sequences of words) surrounding the two keyword matches, one from the start 
of the document and one from the end.

How you determine which words from the result you include in the summary is 
up to you.  The problem with this it that in Lucene-land you have to store 
the content of the document inside in index verbatim (so you can get 
arbitrary portions of it out).  This means your index will be larger than it 
really needs to be.

I usually just store the first 255 characters in the index and use this as a 
summary.  It's not as good as Google, but it seems to work ok.

- Original Message - 
From: Ben [EMAIL PROTECTED]
To: Lucene lucene-user@jakarta.apache.org
Sent: Friday, January 28, 2005 5:08 PM
Subject: Search results excerpt similar to Google


Hi
Is it hard to implement a function that displays the search results
excerpts similar to Google?
Is it just string manipulations or there are some logic behind it? I
like their excerpts.
Thanks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]