[ANNOUNCE] : Lucene Server

2004-09-23 Thread Cocula Remi
I am glad to introduce a new project on SourceForge that is related to Lucene.

Lucene Server is a java server application for simply create and manage Jakarta 
Lucene Indexes. It is designed to help you integrate Lucene in distributed 
environnements.
The first release 0.1 is available for download.
Hope it will be usefull for somebody.
http://sourceforge.net/projects/luceneserver/

Remi COCULA.



Strange search results with wildcard - Bug?

2004-09-23 Thread Ulrich Mayring
Hi all,
first, here's how to reproduce the problem:
Go to http://www.denic.de/en/special/index.jsp and enter obscure 
service in the search field. You'll get 132 hits. Now enter obscure 
service* - and you only get 1 hit.

The above website is running Lucene 1.3rc3, but I was able to reproduce 
this locally with 1.4.1. Here are my local results with controlled 
pseudo documents, perhaps you can see a pattern:

searching for 00700* gets two documents:
007001 action and 007002 handle
searching for handle gets two documents:
007002 handle and 011010 handle
searching for 00700* handle gets two documents:
007002 handle and 011010 handle
But where is 007001 action?
searching for handle 00700* gets two documents:
007001 action and 007002 handle
But where is 001010 handle?
We're using the MultiFieldQueryParser and the Snowball Stemmers, if that 
makes any difference.

Many thanks in advance for any pointers,
Ulrich
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Strange search results with wildcard - Bug?

2004-09-23 Thread Morus Walter
Ulrich Mayring writes:
 Hi all,
 
 first, here's how to reproduce the problem:
 
 Go to http://www.denic.de/en/special/index.jsp and enter obscure 
 service in the search field. You'll get 132 hits. Now enter obscure 
 service* - and you only get 1 hit.
 
 The above website is running Lucene 1.3rc3, but I was able to reproduce 
 this locally with 1.4.1. Here are my local results with controlled 
 pseudo documents, perhaps you can see a pattern:
 
 searching for 00700* gets two documents:
 007001 action and 007002 handle
 
 
 searching for handle gets two documents:
 007002 handle and 011010 handle
 
 
 searching for 00700* handle gets two documents:
 007002 handle and 011010 handle
 But where is 007001 action?
 
 
 searching for handle 00700* gets two documents:
 007001 action and 007002 handle
 But where is 001010 handle?
 
 
 We're using the MultiFieldQueryParser and the Snowball Stemmers, if that 
 makes any difference.
 
Your number/handle samples look ok to me if the default operator is AND.

Note that wildcard expressions are not analyzed so if service is 
stemmed to anything different from service, it's not surprising that
service* doesn't find it.

I think you should look at a) what's the analyzed form of your terms
and b) how does the rewritten query look like (there's a rewrite method
for query that expands wildcard queries into basic queries).

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Strange search results with wildcard - Bug?

2004-09-23 Thread Ulrich Mayring
Morus Walter wrote:
Your number/handle samples look ok to me if the default operator is AND.
But it's OR ;-)
Using AND explicitly I get different results and using OR explicitly I 
get the same results as documented.

Note that wildcard expressions are not analyzed so if service is 
stemmed to anything different from service, it's not surprising that
service* doesn't find it.
Ok, I didn't know that, but it makes sense. Perhaps the phenomenon on 
the live pages is different from my local test installation. I was just 
looking for a comparable case on our live pages, but the real problem is 
in pages that I'm just developing locally and which look similar to the 
number/handle example.

I think you should look at a) what's the analyzed form of your terms
and b) how does the rewritten query look like (there's a rewrite method
for query that expands wildcard queries into basic queries).
Will do, thank you very much. However, how do I get at the analyzed form 
of my terms?

Ulrich
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Strange search results with wildcard - Bug?

2004-09-23 Thread Morus Walter
Ulrich Mayring writes:
 
 Will do, thank you very much. However, how do I get at the analyzed form 
 of my terms?
 
Instanciate the analyzer, create a token stream feeding your input,
loop over the tokens, output the results.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiSearcher + Sort

2004-09-23 Thread Karthik N S


Guys


Apologies


Am I doing Wrong or is ther a bug with Lucene on Linux  O/s  When using  '
MultiSearcher with Sort '

Please Somebody Reply me ASAP

Tested both Lucene-1.4-final.jar,Lucene-1.4.1.jar

hits = multiSearcher.search(query,sortField);


Exception raised  on Linux O/s Only  [ On Windows it Works Perfectly ]


Query String  : (contents:gifts contents:articles) (path:gifts
path:articles) (modified:gifts modified:articles) (filename:gifts
filename:articles) (bookid:gifts bookid:articles) (creation:gifts
creation:articles) (chapNme:gifts chapNme:articles) (itmName:gifts
itmName:articles) (urltext:gifts urltext:articles) (itemCode:gifts
itemCode:articles) (itemPrice:gifts itemPrice:articles) (pageid:gifts
pageid:articles)

--- EXCEPTION START-
The Exception Raised file = SearchCreateArrayDataFiles.createArray1
Centralized Boolean Factor =false
  SYSTEM IS STOPPING COMPILATION
-- EXCEPTION END-

---

java.lang.RuntimeException: no terms in field bookid - cannot determine sort
type
at
org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:319)
at
org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedHitQu
eue.java:326)
at
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSorted
HitQueue.java:167)
at
org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java
:58)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:118)
at
org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:141)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.init(Hits.java:51)
at org.apache.lucene.search.Searcher.search(Searcher.java:41)

---
-


/*at
com.controlnet.indexing.search.SearchCreateArrayDataFiles.createArray1(Searc
hCreateArrayDataFiles.java:263)
  *at
com.controlnet.indexing.search.SearchCreateArrayDataFiles.main(SearchCreateA
rrayDataFiles.java:308)
  */




  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Clustering lucene's results

2004-09-23 Thread Dawid Weiss
Dear all,
I saw a post about an attempt to integrate Carrot2 with Lucene. It was a 
while ago, so I'm curious if any outcome has been achieved.

Anyway, as the project coordinator I can offer my help with such 
integration; if you're looking for some ready-to-use code then there is 
a clustering plugin for Nutch that integrates one of the clustering 
algorithms from Carrot2 with Nutch; I'm sure porting it to Lucene 
wouldn't be a big problem.

Ragards,
Dawid
_
List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl )
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-23 Thread Bastian Grimm [Eastbeam GmbH]
hmm ok,
but how will i be able to set different boosts to fields, if this value 
is not stored?! i dont really understand why i can set a boost factor 
and it is not stored and used.
what i want to do, is to weight my searchable index fields (type: 
Field.UnStored) with a different factors for those fields and if am not 
totally wrong this is done with set boost when i create the doc and 
write it to the index... or is there another way to do this?

thanks, bastian

Daniel Naber wrote:
See the documentation for getBoost:
Note: this value is not stored directly with the document in the index. 
Documents returned from IndexReader.document(int) and Hits.doc(int) may 
thus not have the same value present as when this field was indexed.

Regards
Daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-23 Thread Erik Hatcher
The boost is not thrown away, but rather combined with the length 
normalization factor during indexing.  So while your actual boost value 
is not stored directly in the index, it is taken into consideration for 
scoring appropriately.

Erik
On Sep 23, 2004, at 8:17 AM, Bastian Grimm [Eastbeam GmbH] wrote:
hmm ok,
but how will i be able to set different boosts to fields, if this 
value is not stored?! i dont really understand why i can set a boost 
factor and it is not stored and used.
what i want to do, is to weight my searchable index fields (type: 
Field.UnStored) with a different factors for those fields and if am 
not totally wrong this is done with set boost when i create the doc 
and write it to the index... or is there another way to do this?

thanks, bastian

Daniel Naber wrote:
See the documentation for getBoost:
Note: this value is not stored directly with the document in the 
index. Documents returned from IndexReader.document(int) and 
Hits.doc(int) may thus not have the same value present as when this 
field was indexed.

Regards
Daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Clustering lucene's results

2004-09-23 Thread William W
Hi Dawid,
I would like to use Carrot2 with lucene. Do you have examples ?
Thanks a lot,
William.

From: Dawid Weiss [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Clustering lucene's results
Date: Thu, 23 Sep 2004 13:36:03 +0200
Dear all,
I saw a post about an attempt to integrate Carrot2 with Lucene. It was a 
while ago, so I'm curious if any outcome has been achieved.

Anyway, as the project coordinator I can offer my help with such 
integration; if you're looking for some ready-to-use code then there is a 
clustering plugin for Nutch that integrates one of the clustering 
algorithms from Carrot2 with Nutch; I'm sure porting it to Lucene wouldn't 
be a big problem.

Ragards,
Dawid
_
List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl )
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Clustering lucene's results

2004-09-23 Thread Dawid Weiss
Hi William,
No, I don't have examples because I never used Lucene directly. If you 
provide me with a sample index and an API that executes a query on this 
index (I need document titles, summaries, or snippets and an anchor 
(identifier), can be an URL).

Send me such a snippet and I'll try to write the integration code with 
Lucene. It is only a matter of writing a simple InputComponent instance 
and this is really trivial (see Nutch's plugin code).

Dawid
William W wrote:
Hi Dawid,
I would like to use Carrot2 with lucene. Do you have examples ?
Thanks a lot,
William.

From: Dawid Weiss [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Clustering lucene's results
Date: Thu, 23 Sep 2004 13:36:03 +0200
Dear all,
I saw a post about an attempt to integrate Carrot2 with Lucene. It was 
a while ago, so I'm curious if any outcome has been achieved.

Anyway, as the project coordinator I can offer my help with such 
integration; if you're looking for some ready-to-use code then there 
is a clustering plugin for Nutch that integrates one of the clustering 
algorithms from Carrot2 with Nutch; I'm sure porting it to Lucene 
wouldn't be a big problem.

Ragards,
Dawid
_
List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl )
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Express yourself instantly with MSN Messenger! Download today - it's 
FREE! hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 From - Thu
_
List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl )
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-23 Thread Bastian Grimm [Eastbeam GmbH]
thanks for your reply, eric.
so i am right that its not possible to change the boost without 
reindexing all files? thats not good... or is it ok only to change the 
boosts an optimize the index to take changes effecting the index?

if not, will i be able to boost those fields in the searcher?
thanks, bastian
-
The boost is not thrown away, but rather combined with the length 
normalization factor during indexing.  So while your actual boost value 
is not stored directly in the index, it is taken into consideration for 
scoring appropriately.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: MultiSearcher + Sort

2004-09-23 Thread Wermus Fernando
Karthik,
I have a kind of similar problem. Test the following: when you
create a field, don't use Field(String), instead use Field(String, int)
where int is a constant for the field's type. May be this could help.


-Mensaje original-
De: Karthik N S [mailto:[EMAIL PROTECTED] 
Enviado el: Jueves, 23 de Septiembre de 2004 06:42 a.m.
Para: LUCENE
Asunto: MultiSearcher + Sort



Guys


Apologies


Am I doing Wrong or is ther a bug with Lucene on Linux  O/s  When using
'
MultiSearcher with Sort '

Please Somebody Reply me ASAP

Tested both Lucene-1.4-final.jar,Lucene-1.4.1.jar

hits = multiSearcher.search(query,sortField);


Exception raised  on Linux O/s Only  [ On Windows it Works Perfectly ]


Query String  : (contents:gifts contents:articles) (path:gifts
path:articles) (modified:gifts modified:articles) (filename:gifts
filename:articles) (bookid:gifts bookid:articles) (creation:gifts
creation:articles) (chapNme:gifts chapNme:articles) (itmName:gifts
itmName:articles) (urltext:gifts urltext:articles) (itemCode:gifts
itemCode:articles) (itemPrice:gifts itemPrice:articles) (pageid:gifts
pageid:articles)

--- EXCEPTION START-
The Exception Raised file = SearchCreateArrayDataFiles.createArray1
Centralized Boolean Factor =false
  SYSTEM IS STOPPING COMPILATION
-- EXCEPTION END-

---


java.lang.RuntimeException: no terms in field bookid - cannot determine
sort
type
at
org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:319)
at
org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedH
itQu
eue.java:326)
at
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSo
rted
HitQueue.java:167)
at
org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.
java
:58)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:118)
at
org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:141)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.init(Hits.java:51)
at org.apache.lucene.search.Searcher.search(Searcher.java:41)

---

-


/*at
com.controlnet.indexing.search.SearchCreateArrayDataFiles.createArray1(S
earc
hCreateArrayDataFiles.java:263)
  *at
com.controlnet.indexing.search.SearchCreateArrayDataFiles.main(SearchCre
ateA
rrayDataFiles.java:308)
  */




  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Questions related to closing the searcher

2004-09-23 Thread Aviran
The best way is to use IndexReader's getCurrentVersion() method to check
whether the index has changed. If it has, just get a new Searcher
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade
r.html#getCurrentVersion(java.lang.String)

Aviran

-Original Message-
From: Edwin Tang [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 22, 2004 11:38 AM
To: [EMAIL PROTECTED]
Subject: Fwd: Questions related to closing the searcher


Hello,

In my testing, it seems like if the searcher (in my
case ParallelMultiSearcher) is not closed, the
searcher will not pick up any new data that has been
added to the index since it was opened. I'm wondering
if this is a correct statement.

Assuming the above is true, I went about closing the
searcher with searcher.close(), then setting both the
searcher and QueryParser to null, then did a
System.gc(). The application will sleep for a set
period of time, then resumes to process another batch
of queries against the index. When the application
resumes, the following method is ran:

/**
 * Creates a [EMAIL PROTECTED] ParallelMultiSearcher} and
[EMAIL PROTECTED] QueryParser} if they
 * do not already exist.
 *
 * @return  0 if successful or the objects already
exist; -1 if failed.
 */
private int getSearcher() {
Analyzer analyzer;
IndexSearcher[] searchers;
int iReturn;
Vector vector;
if (logger.isDebugEnabled())
logger.debug(Entering getSearcher());
if (searcher == null || parser == null) {
analyzer = new
CIAnalyzer(utility.sStopWordsFile);
try {
vector = new Vector();
if (utility.bSearchAMX)
vector.add(new IndexSearcher(utility.amxIndexDir));
if (utility.bSearchCOMTEX)
vector.add(new IndexSearcher(utility.comtexIndexDir));
if (utility.bSearchDJNW)
vector.add(new IndexSearcher(utility.djnwIndexDir));
if (utility.bSearchMoreover)
vector.add(new IndexSearcher(utility.moreoverIndexDir));
searchers = (IndexSearcher[]) vector.toArray(new
IndexSearcher[vector.size()]);
searcher = new ParallelMultiSearcher(searchers);
parser = new QueryParser(body,
analyzer);
iReturn = 0;
} catch (IOException ioe) {
logger.error(Error creating
searcher, ioe);
iReturn = -1;
} catch (Exception e) {
logger.error(Unexpected error while
creating searcher, e);
iReturn = -1;
}
} else
iReturn = 0;
if (logger.isDebugEnabled())
logger.debug(Exitng getSearcher() with 
+ iReturn);
return iReturn;
} // End method getSearcher()

This seems to get me around the problem where the
searcher was not picking up new data from the index.
However, I would run out of memory after 8 iterations
of the application processing a batch query, sleeping,
process another batch query, sleep, etc.

I'm probably missing something completely obvious, but
I'm just not seeing it. Can someone please tell me
what I'm doing wrong?

Thanks,
Ed



__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Strange search results with wildcard - Bug?

2004-09-23 Thread Ulrich Mayring
Erik Hatcher wrote:
Look at AnalysisDemo referred to here:
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis
Keep in mind that phrase queries do not support wildcards - they are 
analyzed and any wildcard characters are likely stripped and cause 
tokens to split.
Ok, I did all that and identified a basic case:
If the user searches for 007001 handle, the MultiFieldQueryParser, 
which searches in the fields title and contents, changes that query to:

(title:007001 +title:handl) (contents:007001 +contents:handl)
So, actually it has nothing to do with the wildcard, the problem comes 
from the + modifier - where does it originate? Obviously, this way I can 
never find a document without the term handle, but with the number 007001.

Kind regards,
Ulrich
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problem with get/setBoost of document fields

2004-09-23 Thread Doug Cutting
You can change field boosts without re-indexing.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#setNorm(int,%20java.lang.String,%20byte)
Doug
Bastian Grimm [Eastbeam GmbH] wrote:
thanks for your reply, eric.
so i am right that its not possible to change the boost without 
reindexing all files? thats not good... or is it ok only to change the 
boosts an optimize the index to take changes effecting the index?

if not, will i be able to boost those fields in the searcher?
thanks, bastian
-
The boost is not thrown away, but rather combined with the length 
normalization factor during indexing.  So while your actual boost value 
is not stored directly in the index, it is taken into consideration for 
scoring appropriately.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Strange search results with wildcard - Bug?

2004-09-23 Thread Ulrich Mayring
Ulrich Mayring wrote:
If the user searches for 007001 handle, the MultiFieldQueryParser, 
which searches in the fields title and contents, changes that query to:

(title:007001 +title:handl) (contents:007001 +contents:handl)
Ok, I cleared this up, there was some invisible magic going on in the 
code, sorry for the inconvenience. Anyway:

field1:foo field2:bar AND field3:true
turns into
field1:foo +field2:bar +field3:true
If I lose the AND and use a + instead, then everything works as 
expected. Now, is this a bug or a feature that I haven't quite grasped? :)

Ulrich
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Clustering lucene's results

2004-09-23 Thread Andrzej Bialecki
Dawid Weiss wrote:
Hi William,
No, I don't have examples because I never used Lucene directly. If you 
provide me with a sample index and an API that executes a query on this 
index (I need document titles, summaries, or snippets and an anchor 
(identifier), can be an URL).
Hi Dawid :-)
I believe the approach to this component should be that you first 
initialize it by reading a mapping of Lucene index field names to 
logical names (metadata) like title, url, body, etc. The reason is 
that each index uses its own metadata schema, i.e. in Lucene-speak, the 
field names.

Moreover, when you execute a query you get just a document id plus its 
score. It's up to you to build a snippet. There is a code in the 
jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from 
the query and the hit list, take a look at this...

Send me such a snippet and I'll try to write the integration code with 
Lucene. It is only a matter of writing a simple InputComponent instance 
and this is really trivial (see Nutch's plugin code).
The basic usage scenario is that you open the IndexReader (either using 
directory name as a String or a Directory instance), and then create a 
Query instance, usually using QueryParser, and finally you search using 
IndexSearcher. You get a list of Hits, which you can use to get scores, 
and the contents of the documents. Take a look at the IndexFiles and 
SearchFiles classes in org.apache.lucene.demo package (under /src/demo).

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: demo HTML parser question

2004-09-23 Thread roy-lucene-user
Hi Fred,

We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.  That's just a guess, I would love to hear
from others about this.  Anyway, since it is a separate thread, a token error
could kill it and there is no way for the calling thread to know about it.

We had to create our own html parser since we only cared about grabbing the
entire text from the html document and also we wanted to avoid the extra
thread.  We also do a lot of SKIPping for minimal EOF errors (html documents
in email almost never follow standards).  For your html needs, you might want
to check out other JavaCC HTML parsers from the JavaCC web site.

Roy.

On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote
 Hi,
 
 I've been working with the HTML parser demo that comes with
 Lucene and I'm trying to understand why it's multi-threaded,
 and, more importantly, how to exit gracefully on errors.
 
 I've discovered if I throw an exception in the front-end static
 code (main(), etc.), the JVM hangs instead of exiting. Presumably
 this is because there are threads hanging around doing something.
 But I'm not sure what!
 
 Any pointers? I just want to exit gracefully on an error such as
 a required meta tag is missing or similar.
 
 Thanks,
 
 Fred
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



compiling 1.4 source

2004-09-23 Thread roy-lucene-user
Hi guys,

So we started upgrading to 1.4 and we need to add some of our own custom code.
 After compiling with ant, I noticed that the 1.4 ant script builds a jar
called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar.  I'm pretty sure I
did not download the wrong source.  Is this just a wrong name in the
properties or does the source code actually contain lucene 1.5 rc1 code?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: compiling 1.4 source

2004-09-23 Thread Erik Hatcher
If you obtained the 1.4.1 source distribution, then  you're fine and 
its simply an issue with the properties.  We keep the properties set to 
the _next_ version of Lucene (or as a beta/rc version label) to avoid 
the CVS HEAD codebase from building as a release label when it is very 
likely not the same.

If you obtained the source from CVS HEAD, you're using code that has 
been greatly modified since the 1.4.1 release.

Erik
On Sep 23, 2004, at 12:13 PM, [EMAIL PROTECTED] wrote:
Hi guys,
So we started upgrading to 1.4 and we need to add some of our own 
custom code.
 After compiling with ant, I noticed that the 1.4 ant script builds a 
jar
called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar.  I'm pretty 
sure I
did not download the wrong source.  Is this just a wrong name in the
properties or does the source code actually contain lucene 1.5 rc1 
code?

Roy.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Clustering lucene's results

2004-09-23 Thread Dawid Weiss
Hi Andrzej :)
Yep, ok, I'll take a look at it. After I come back from abroad (next 
week). I just wanted to save myself some time and have an already 
written code that fetches the information we need for clustering; you 
know what I mean, I'm sure. But I'll start from scratch when I get back.

D.
Andrzej Bialecki wrote:
Dawid Weiss wrote:
Hi William,
No, I don't have examples because I never used Lucene directly. If you 
provide me with a sample index and an API that executes a query on 
this index (I need document titles, summaries, or snippets and an 
anchor (identifier), can be an URL).

Hi Dawid :-)
I believe the approach to this component should be that you first 
initialize it by reading a mapping of Lucene index field names to 
logical names (metadata) like title, url, body, etc. The reason is 
that each index uses its own metadata schema, i.e. in Lucene-speak, the 
field names.

Moreover, when you execute a query you get just a document id plus its 
score. It's up to you to build a snippet. There is a code in the 
jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from 
the query and the hit list, take a look at this...

Send me such a snippet and I'll try to write the integration code with 
Lucene. It is only a matter of writing a simple InputComponent 
instance and this is really trivial (see Nutch's plugin code).

The basic usage scenario is that you open the IndexReader (either using 
directory name as a String or a Directory instance), and then create a 
Query instance, usually using QueryParser, and finally you search using 
IndexSearcher. You get a list of Hits, which you can use to get scores, 
and the contents of the documents. Take a look at the IndexFiles and 
SearchFiles classes in org.apache.lucene.demo package (under /src/demo).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Power Point Processing

2004-09-23 Thread Zhang, Lisheng
Hi,

Does anyone know a good tool to processing MS Power Point
file (*.ppt) into plain text so we can use lucene to index it?

I looked at jakarta/POI, and only see Word and Excel documents
can be processed, some JavaDoc pages mentioned ppt, but
status is not clear to me?

Thanks very much for helps, Lisheng

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustering lucene's results

2004-09-23 Thread William W
Hi Dawid,
The demos (under /src/demo) are very good. They have the basic usage 
scenario.
Thanks Andrzej.
William.


Dawid Weiss wrote:
Hi William,
No, I don't have examples because I never used Lucene directly. If you 
provide me with a sample index and an API that executes a query on this 
index (I need document titles, summaries, or snippets and an anchor 
(identifier), can be an URL).
Hi Dawid :-)
I believe the approach to this component should be that you first 
initialize it by reading a mapping of Lucene index field names to logical 
names (metadata) like title, url, body, etc. The reason is that each index 
uses its own metadata schema, i.e. in Lucene-speak, the field names.

Moreover, when you execute a query you get just a document id plus its 
score. It's up to you to build a snippet. There is a code in the 
jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from the 
query and the hit list, take a look at this...

Send me such a snippet and I'll try to write the integration code with 
Lucene. It is only a matter of writing a simple InputComponent instance 
and this is really trivial (see Nutch's plugin code).
The basic usage scenario is that you open the IndexReader (either using 
directory name as a String or a Directory instance), and then create a 
Query instance, usually using QueryParser, and finally you search using 
IndexSearcher. You get a list of Hits, which you can use to get scores, and 
the contents of the documents. Take a look at the IndexFiles and 
SearchFiles classes in org.apache.lucene.demo package (under /src/demo).

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Get ready for school! Find articles, homework help and more in the Back to 
School Guide! http://special.msn.com/network/04backtoschool.armx

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: demo HTML parser question

2004-09-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.
That's almost right.  I originally wrote it that way to avoid having to 
ever buffer the entire text of the document.  The document is indexed 
while it is parsed.  But, as observed, this has lots of problems and was 
probably a bad idea.

Could someone provide a patch that removes the multi-threading?  We'd 
simply use a StringBuffer in HTMLParser.jj to collect the text.  Calls 
to pipeOut.write() would be replaced with text.append().  Then have the 
HTMLParser's constructor parse the page before returning, rather than 
spawn a thread, and getReader() would return a StringReader.  The public 
API of HTMLParser need not change at all and lots of complex threading 
code would be thrown away.  Anyone interested in coding this?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Clustering lucene's results

2004-09-23 Thread Dawid Weiss
yeah... I know there have to be demos... I tried to be lazy, you know :)
Anyway, as I told Andrzej -- I'll take a look at it (and with a 
pleasure) after I come back. i don't think the delay will matter much. 
And if it does, ask Andrzej -- he has excellent experience with both 
projects -- he's just very shy by nature and doesn't talk much, hehe.

D.
William W wrote:
Hi Dawid,
The demos (under /src/demo) are very good. They have the basic usage 
scenario.
Thanks Andrzej.
William.


Dawid Weiss wrote:
Hi William,
No, I don't have examples because I never used Lucene directly. If 
you provide me with a sample index and an API that executes a query 
on this index (I need document titles, summaries, or snippets and an 
anchor (identifier), can be an URL).

Hi Dawid :-)
I believe the approach to this component should be that you first 
initialize it by reading a mapping of Lucene index field names to 
logical names (metadata) like title, url, body, etc. The reason is 
that each index uses its own metadata schema, i.e. in Lucene-speak, 
the field names.

Moreover, when you execute a query you get just a document id plus its 
score. It's up to you to build a snippet. There is a code in the 
jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from 
the query and the hit list, take a look at this...

Send me such a snippet and I'll try to write the integration code 
with Lucene. It is only a matter of writing a simple InputComponent 
instance and this is really trivial (see Nutch's plugin code).

The basic usage scenario is that you open the IndexReader (either 
using directory name as a String or a Directory instance), and then 
create a Query instance, usually using QueryParser, and finally you 
search using IndexSearcher. You get a list of Hits, which you can use 
to get scores, and the contents of the documents. Take a look at the 
IndexFiles and SearchFiles classes in org.apache.lucene.demo package 
(under /src/demo).

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Get ready for school! Find articles, homework help and more in the Back 
to School Guide! http://special.msn.com/network/04backtoschool.armx

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Document contents split among different Fields

2004-09-23 Thread Greg Langmead
I am working on extending Lucene to support documents with special islands
of an XML language, and I want to index the islands differently from the
text.  My current plan is to break the document's contents into two Fields,
one with all the text and one with all the special islands, and use a
different Analyzer on each Field.

In heading down this road, I realized that this approach breaks the whole
model of Token as it supports highlighting.  Token seems designed to store
offsets within a given Field, so if you break a document up into pieces, the
offsets are meaningless in terms of the original source document.

Am I right in saying that the design of Token's support for highlighting
really only supports having the entire document stored as one monolithic
contents Field?  Has anyone tackled indexing multiple content Fields
before that could shed some light?

Thanks,
Greg Langmead
Design Science, Inc., How Science Communicates
http://www.dessci.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document contents split among different Fields

2004-09-23 Thread Doug Cutting
Greg Langmead wrote:
Am I right in saying that the design of Token's support for highlighting
really only supports having the entire document stored as one monolithic
contents Field?
No, I don't think so.
Has anyone tackled indexing multiple content Fields
before that could shed some light?
Do you need highlights from all fields?  If so, then you can use:
  TextFragment[] getBestTextFragments(TokenStream, ...);
with a TokenStream for each field, then select the highest scoring 
fragments across all fields.  Would that work for you?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Document contents split among different Fields

2004-09-23 Thread Greg Langmead
Doug Cutting wrote:
 Do you need highlights from all fields?  If so, then you can use:
 
TextFragment[] getBestTextFragments(TokenStream, ...);
 
 with a TokenStream for each field, then select the highest scoring 
 fragments across all fields.  Would that work for you?

Thanks for the reply.  I can't find code like this in the lucene or
lucene-demo packages -- is this something implemented, or did you mean it as
an example?

Once I get a text fragment, are you proposing using it to do a secondary
search within the source document, to match the fragment?

I would like to do highlighting on content from either of my Fields, but I
think that even if I didn't I'd have the same problem, because I'll have
punched holes in the text Field and the positional data within the Field no
longer reflects the position in the source.

I think that if I want to pick the document apart into pieces like this,
then I need to do some work to restore global positional data, by
squirreling away the size of the holes I punch (the size of the XML islands,
from the text Field's point of view, and the size of the text runs, from the
island Field's point of view).  If I store a special textual escape within
the Field data that records the length of each gap, then I can read those
escapes when Tokenizing the Field and add the number stored therein to the
Token offset, restoring the global positional data.  Does that make sense?
I'm concerned this does violence to Lucene's model, which I've only been
studying for a couple of weeks now.

Greg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]