Re: Need help with filtering

2004-11-17 Thread Paul Elschot
On Wednesday 17 November 2004 01:20, Edwin Tang wrote:
 Hello,
 
 I have been using DateFilter to limit my search results to a certain date
 range. I am now asked to replace this filter with one where my search 
results
 have document IDs greater than a given document ID. This document ID is
 assigned during indexing and is a Keyword field.
 
 I've browsed around the FAQs and archives and see that I can either use
 QueryFilter or BooleanQuery. I've tried both approaches to limit the 
document
 ID range, but am getting the BooleanQuery.TooManyClauses exception in both
 cases. I've also tried bumping max number of clauses via 
setMaxClauseCount(),
 but that number has gotten pretty big.
 
 Is there another approach to this? ...

Recoding DateFilter to a DocumentIdFilter should be straightforward.

The trick is to use only one document enumerator at a time for all
terms. Document enumerators take buffer space, and that is the
reason why BooleanQuery has an exception for too many clauses.

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: COUNT SUBINDEX [IN MERGERINDEX]

2004-11-17 Thread Paul Elschot
On Wednesday 17 November 2004 07:10, Karthik N S wrote:
 Hi guy's
 
 
 Apologies.
 
 
   So  A Mergeed Index is again a Single [ addition of subIndexes... ),
 
  If that case , If One of the Field Types is of  type   'Field.Keyword'
 whic is Unique across the subIndexes [Before Merging].
 
  and If I want to Count this Unique Field in a MergerIndex  [After i'ts been
 Merged ] How do I do this Please.

IndexReader.numDocs() will give the number of docs in an index.

Lucene has no direct support for unique fields. After merging, if the
same unique field value occurs in both source indexes, the merged
index will contain two documents with that value.
In case one wants to merge into unique field values, the non unique
values in one of the source indexes need to be deleted before merging.

See IndexReader.termDocs(term) on how to get the document numbers
for (unique) terms via a TermDocs, and IndexReader.delete(docNum)
for deleting docs.

Regards,
Paul.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Locking Issues Resolved...I hope

2004-11-17 Thread jeichels

I was thinking that perhaps I can pre-stem words before sticking them in a 
search field in the database perhaps using Lucene stemming code, then try to 
use the Natural Language Search found in MySql 4.1.1.   I am confident the 
MySql product can't keep up with Lucene yet, but at least they hvae improved it 
some.  Not even sure if my hosting company will upgrade to 4.1.1 though.  Still 
looking for a lot of solutions to make Lucene sit in synch more nicely with 
MySql as the main database...aka an easy to use way of handling 



- Original Message -
From: Chris Lamprecht [EMAIL PROTECTED]
Date: Wednesday, November 17, 2004 1:38 am
Subject: Re: Index Locking Issues Resolved...I hope

 MySQL does offer a basic fulltext search (with MyISAM tables), but it
 doesn't really approach the functionality of Lucene, such as pluggable
 tokenizers, stemming, etc.  I think MS SQL server has fulltext search
 as well, but I have no idea if it's any good.
 
 See 
 http://www.google.com/search?hl=enlr=safe=offc2coff=1q=mysql+fulltext
  I have not seen clear yet because it is all new.   I wish a 
 database Text field could have this sort of mechanism built into 
 it.   MySql does not do this (what I am using), but I am going to 
 check into other databases now.  OJB will work with most all of 
 them so that would help if there is a database type of solution 
 that will allow that sleep at night thing to happen!!!
 
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: COUNT SUBINDEX [IN MERGERINDEX]

2004-11-17 Thread Karthik N S
Hi Guys


Apologies..

I am Still Confused.. ;(


Let me make it more simple Question


   On using Search from a  Index without any SearchWord,  I would like to
count  the total number of Documents present in it.

   [ I Only have the Field Types 'Field.Keyword' which stores the Unique
filename ]

   Will IndexReader.termDocs(term) give me the Count for the same.
   If so How To use it... Please

  Thx in advance.
Karthik



-Original Message-
From: Paul Elschot [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 17, 2004 2:02 PM
To: [EMAIL PROTECTED]
Subject: Re: COUNT SUBINDEX [IN MERGERINDEX]


On Wednesday 17 November 2004 07:10, Karthik N S wrote:
 Hi guy's


 Apologies.


   So  A Mergeed Index is again a Single [ addition of subIndexes... ),

  If that case , If One of the Field Types is of  type   'Field.Keyword'
 whic is Unique across the subIndexes [Before Merging].

  and If I want to Count this Unique Field in a MergerIndex  [After i'ts
been
 Merged ] How do I do this Please.

IndexReader.numDocs() will give the number of docs in an index.

Lucene has no direct support for unique fields. After merging, if the
same unique field value occurs in both source indexes, the merged
index will contain two documents with that value.
In case one wants to merge into unique field values, the non unique
values in one of the source indexes need to be deleted before merging.

See IndexReader.termDocs(term) on how to get the document numbers
for (unique) terms via a TermDocs, and IndexReader.delete(docNum)
for deleting docs.

Regards,
Paul.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



tool to check the index field

2004-11-17 Thread lingaraju
HI ALL

I am having  index file created by other people
Now  i want to know how many field are there in the index
Is there any third party tool to do this
I saw some where some GUI tool to do this but  forgot the name.

Regards
LingaRaju 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: tool to check the index field

2004-11-17 Thread Viparthi, Kiran (AFIS)
Try using : 

Luke : http://www.getopt.org/luke/
Limo : http://limo.sourceforge.net/

Regards,
Kiran.


-Original Message-
From: lingaraju [mailto:[EMAIL PROTECTED] 
Sent: 17 November 2004 16:00
To: Lucene Users List
Subject: tool to check the index field


HI ALL

I am having  index file created by other people
Now  i want to know how many field are there in the index
Is there any third party tool to do this
I saw some where some GUI tool to do this but  forgot the name.

Regards
LingaRaju 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: best ways of using IndexSearcher

2004-11-17 Thread Aviran
Yes, IndexSearcher is thread safe.

Aviran
http://www.aviransplace.com

-Original Message-
From: Abhay Saswade [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 16, 2004 15:16 PM
To: Lucene Users List
Subject: Re: best ways of using IndexSearcher


Hello,
Can I use single instance of IndexSearcher in multiple threads with sorting?
Thanks, Abhay

- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, June 28, 2004 8:51 PM
Subject: Re: best ways of using IndexSearcher


 Anson,

 Use a single instance of IndexSearcher and, if you want to always 
 'see' even the latest index changes (deletes and adds since you opened 
 the
 IndexSearcher) make sure to re-create the IndexSearcher when you detect
 that the index version has changed (see

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade
r.html#getCurrentVersion(org.apache.lucene.store.Directory))

 When you get the new IndexSearcher, leave the old instance alone - let 
 the GC take care of it, and don't call close() on it, in case 
 something in your application is still using that instance.

 This stuff is not really CPU intensive.  Disk I/O tends to be the 
 bottleneck.  If you are working with multiple indices, spread them 
 over multiple disks (not just partitions, real disks), if you can.

 Otis


 --- Anson Lau [EMAIL PROTECTED] wrote:
  Hi Guys,
 
  What's the recommended way of using IndexSearcher? Should 
  IndexSearcher be a singleton or pooled?  Would pooling provide a 
  more scalable solution by
  allowing you to decide how many IndexSearcher to use based on say how
  many
  CPU u have on ur server?
 
  Thanks,
 
  Anson


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Need help with filtering

2004-11-17 Thread Edwin Tang
Ah... recoding DateFilter. I will look into this today. Thanks for the help.

Ed

--- Paul Elschot [EMAIL PROTECTED] wrote:

 On Wednesday 17 November 2004 01:20, Edwin Tang wrote:
  Hello,
  
  I have been using DateFilter to limit my search results to a certain date
  range. I am now asked to replace this filter with one where my search 
 results
  have document IDs greater than a given document ID. This document ID is
  assigned during indexing and is a Keyword field.
  
  I've browsed around the FAQs and archives and see that I can either use
  QueryFilter or BooleanQuery. I've tried both approaches to limit the 
 document
  ID range, but am getting the BooleanQuery.TooManyClauses exception in both
  cases. I've also tried bumping max number of clauses via 
 setMaxClauseCount(),
  but that number has gotten pretty big.
  
  Is there another approach to this? ...
 
 Recoding DateFilter to a DocumentIdFilter should be straightforward.
 
 The trick is to use only one document enumerator at a time for all
 terms. Document enumerators take buffer space, and that is the
 reason why BooleanQuery has an exception for too many clauses.
 
 Regards,
 Paul
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Whitespace Analyzer not producing expected search results

2004-11-17 Thread lee . a . carroll


Thanks for the suggestions Erik. Displaying the query string is really
usefull
and this is what i've found.

I issue a search using the search term

ResponseHelper.writeNoCachingHeaders\(response\);

The search is parsed using a query parser and produces the following query
string

+contents:ResponseHelper.writeNoCachingHeaders(response);

This looks good and finds two documents

I then try a search using the term

ResponseHelper.writeNoCachingHeaders\(*\);

now I'm expecting this to be a wider search term and it should find at
least two, possibly more docs?

the query parser produces the query

+contents:responsehelper.writenocachingheaders(*);

wow the query has lost its case and no docs get returned.

Why does the query parser do this (my analyzer is the provided whitespace
one).

Any ideas to get around this ?

Thanks Lee C


Try using a TermQuery instead of QueryParser to see if you get the
results you expect.  Exact case matters.

Also, when troubleshooting issues with QueryParser, it is helpful to
see what the actual Query returned is - try displaying its toString
output.

 Erik

On Nov 16, 2004, at 6:25 AM, [EMAIL PROTECTED] wrote:

 Hi,

 We have indexed a set of web files (jsp , js , xslt , java properties
 and
 html) using the lucene Whitespace Analyzer.
 The purpose is to allow developers to find where code / functions are
 used
 and defined across a large and dissperate
 content management repository. Hopefully to aid code re-use, easier
 refactoring and standards control.

 However when a query parser search is made using a whitespace analyser
 with
 a string known to be in an indexed file, the search returns zero hits.

 For example the string  jsp\:include page
 =\/path1/path2/path3/path4/file1.jsp\ / is
 searched for using the query parser (escaping the meta-chars)and an
 indexed
 document which contains
 the following text should be found ?

  // include HTML head
 %
  jsp:include page=/path1/path2/path3/path4/file1.jsp /

  script language=JavaScript src
 =/path1/path2/path3/file1.js/script
  !-- script

  I've taken a look at the FAQ advice regarding checking the effects of
 an
 analyser (in our case whitespace) but our test class returns the
 expected
 tokens for any given token stream. For Example this string  %
 mytoken1
 mytoken2 % is tokenised by the whitespace analyzer as [%] [mytoken1]
 [mytoken2] [%].

 I'm sure I've missed something but i can't see what it is. If anyone
 could
 shed any light on posible reasons for why we are getting zero hits for
 text
 strings which are in our indexed files I'd be really gratefull. See
 below
 for more info on index and search set up

 Thanks a lot Lee C

 File contents are  in a tokenised , indexed not stored field.
 Index uses the whitespace analyzer which comes with lucene

 Searches are performed using a boolean query. The boolean query is
 made up
 of a query parser which gets its search term from an html text box
 entered
 by the user and a prefix query which is used to limit search scope by
 directory paths.
 the search uses a whitespace analyzer, no filtering takes place

-

Get the best from British Airways at ba.com
http://www.ba.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: tool to check the index field

2004-11-17 Thread Luke Shannon
Try this:

http://www.getopt.org/luke/

Luke
- Original Message - 
From: lingaraju [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 17, 2004 10:00 AM
Subject: tool to check the index field


 HI ALL
 
 I am having  index file created by other people
 Now  i want to know how many field are there in the index
 Is there any third party tool to do this
 I saw some where some GUI tool to do this but  forgot the name.
 
 Regards
 LingaRaju 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index copy

2004-11-17 Thread Ravi
Whats the bestway to copy an index from one directory to another? I
tried opening an IndexWriter at the new location and used addIndexes to
read from the old index. But that was very slow. 

Thanks in advance,
Ravi.  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Whitespace Analyzer not producing expected search results

2004-11-17 Thread Erik Hatcher
On Nov 17, 2004, at 7:44 AM, [EMAIL PROTECTED] wrote:
I then try a search using the term
ResponseHelper.writeNoCachingHeaders\(*\);
now I'm expecting this to be a wider search term and it should find at
least two, possibly more docs?
the query parser produces the query
+contents:responsehelper.writenocachingheaders(*);
wow the query has lost its case and no docs get returned.
Why does the query parser do this (my analyzer is the provided 
whitespace
one).

Any ideas to get around this ?
Because generally terms are lowercased when indexed by the analyzer 
(but not in your case with WhitespaceAnalyzer), QueryParser defaults to 
lowercasing wildcarded queries.  Wildcard query terms are not analyzed.

To get around this, construct an instance of QueryParser and turn the 
lowercasing of wildcard terms off:

QueryParser parser = new QueryParser(field, new 
StandardAnalyzer());
parser.setLowercaseWildcardTerms(false);

Use the instance of QueryParser instead of the static parse method from 
now on.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Whitespace Analyzer not producing expected search results

2004-11-17 Thread lee . a . carroll

Thanks a lot for the solution / explanation. Saved the day Erik.

Summary

Observation: Using a wild carded search term with queryParser and the
WhitespaceAnalyser returned no hits when when hits where expected.

Reason: This was caused by the default behaviour of queryParser to lower
case wildcarded search terms.

Resolution: To use an instance of query parser setting the instances
setLowercaseWildcardTerms to false.

Example:QueryParser parser = new QueryParser(field, new
StandardAnalyzer());
parser.setLowercaseWildcardTerms(false);



 Solution provided by Erik Hatcher




 
-

 Get the best from British Airways at ba.com
 http://www.ba.com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



index document pdf

2004-11-17 Thread Miguel Angel
Hi, i downloading pdfbox 0.6.4  , what add in the source code the
demo`s lucene 

-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildcardTermEnum skipping terms containing numbers?!

2004-11-17 Thread Yonik Seeley
test



__ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: WildcardTermEnum skipping terms containing numbers?!

2004-11-17 Thread Sanyi
Enumerating the terms using WildcardTermEnum and an IndexReader seems to be too 
buggy to use.
I'm now reimplementing my code using WildcardTermEnum.wildcardEquals which 
seems to be better so
far.

--- Sanyi [EMAIL PROTECTED] wrote:

 Hi!
 
 I have following problem with 1.4.2:
 I'm searching for c?ca (using StandardAnalyzer) and one of the hits looks 
 something like this:
 blabla c0ca c0la etc.. etc...
 (those big o-s are zero characters)
 Now, I'm enumerating the terms using WildcardTermEnum and all I get is:
 
 caca
 ccca
 ceca
 cica
 coca
 crca
 csca
 cuca
 cyca
 
 It doesn't know about c0ca at all.
 Is there any solution to come over this problem?
 
 Thanks,
 Sanyi
 
 
   
 __ 
 Do you Yahoo!? 
 The all-new My Yahoo! - Get yours free! 
 http://my.yahoo.com 
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Meet the all-new My Yahoo! - Try it today! 
http://my.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index copy

2004-11-17 Thread Justin Swanhart
You could lock your index for writes, then copy the file using
operating system copy commands.

Another way would be to lock your index, make a filesystem snapshot,
then unlock your index.  You can then safely copy the snapshot without
interupting further index operations.

On Wed, 17 Nov 2004 11:25:48 -0500, Ravi [EMAIL PROTECTED] wrote:
 Whats the bestway to copy an index from one directory to another? I
 tried opening an IndexWriter at the new location and used addIndexes to
 read from the old index. But that was very slow.
 
 Thanks in advance,
 Ravi.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Something missing !!???

2004-11-17 Thread abdulrahman galal
i noticed in the last period that alot of people disscus with each others 
about the bugs of lucene ...

but something is missing ... i consider lucene is an indexing tool for text 
files and so one ...

but there are alot of tools that makes this indexing like access ...
what about compression ... compressing original text files and its indexes 
and performing indexing on them like (MG) system which is effecient in 
compression and indexing ...

where all of that in Lucene  please help me
if these requierments satisfied in Lucene please anyone notify me and send 
link of the new version...

thanks alot ...
_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Something missing !!???

2004-11-17 Thread Justin Swanhart
The HEAD version of CVS supports gz compression.  You will need to
check it out using cvs if you want to use it.


On Wed, 17 Nov 2004 21:43:36 +0200, abdulrahman galal [EMAIL PROTECTED] wrote:
 i noticed in the last period that alot of people disscus with each others
 about the bugs of lucene ...
 
 but something is missing ... i consider lucene is an indexing tool for text
 files and so one ...
 
 but there are alot of tools that makes this indexing like access ...
 
 what about compression ... compressing original text files and its indexes
 and performing indexing on them like (MG) system which is effecient in
 compression and indexing ...
 
 where all of that in Lucene  please help me
 
 if these requierments satisfied in Lucene please anyone notify me and send
 link of the new version...
 
 thanks alot ...
 
 _
 Express yourself instantly with MSN Messenger! Download today it's FREE!
 http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



version documents

2004-11-17 Thread Luke Shannon
Hey all;

I have ran into an interesting case.

Our system has notes. These need to be indexed. They are xml files called 
default.xml and are easily parsed and indexed. No problem, have been doing it 
all week.

The problem is if someone edits the note, the system doesn't update the 
default.xml. It creates a new file, default_1.xml (every edit creates a new 
file with an incremented number, the sytem only displays the content from the 
highest number).

My problem is I index all the documents and end up with terms that were taken 
out of note several version ago still showing up in the query. From my point 
of view this makes sense because the files are still in the content. But to a 
user it is confusing because they have no idea every change they make to a note 
spans a new file and now the are seeing a term they removed from their note 2 
weeks ago showing up in a query.

I have started modifying my incremental update to be look for multiple version 
of the default.xml but it is more work than I thought and is going make things 
complex.

Maybe there is an easier way? If I just let it run and create the index, can 
somebody suggest a way I could easily scan the index folder ensuring only the 
default.xml with the highest number in its filename remains (only for folders 
were there is more than one default.xml file)? Or is this wishful thinking?

Thanks,

Luke

mergeFactor

2004-11-17 Thread Ravi
Can somebody explain the difference between the parameters minMergeDocs
and mergeFactor in IndexWriter. When I read the documentation, it looks
like both of them represent number of documents to be in buffer before
they are merged into a new segment. 

Thanks in advance,
Ravi.  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: version documents

2004-11-17 Thread Justin Swanhart
Split the filename into basefilename and version and make each a keyword.

Sort your query by version descending, and only use the first
basefile you encounter.

On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon
[EMAIL PROTECTED] wrote:
 Hey all;
 
 I have ran into an interesting case.
 
 Our system has notes. These need to be indexed. They are xml files called 
 default.xml and are easily parsed and indexed. No problem, have been doing it 
 all week.
 
 The problem is if someone edits the note, the system doesn't update the 
 default.xml. It creates a new file, default_1.xml (every edit creates a new 
 file with an incremented number, the sytem only displays the content from the 
 highest number).
 
 My problem is I index all the documents and end up with terms that were taken 
 out of note several version ago still showing up in the query. From my point 
 of view this makes sense because the files are still in the content. But to a 
 user it is confusing because they have no idea every change they make to a 
 note spans a new file and now the are seeing a term they removed from their 
 note 2 weeks ago showing up in a query.
 
 I have started modifying my incremental update to be look for multiple 
 version of the default.xml but it is more work than I thought and is going 
 make things complex.
 
 Maybe there is an easier way? If I just let it run and create the index, can 
 somebody suggest a way I could easily scan the index folder ensuring only the 
 default.xml with the highest number in its filename remains (only for folders 
 were there is more than one default.xml file)? Or is this wishful thinking?
 
 Thanks,
 
 Luke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: version documents

2004-11-17 Thread Luke Shannon
That is a good idea. Thanks!

- Original Message - 
From: Justin Swanhart [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, November 17, 2004 3:38 PM
Subject: Re: version documents


 Split the filename into basefilename and version and make each a
keyword.

 Sort your query by version descending, and only use the first
 basefile you encounter.

 On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon
 [EMAIL PROTECTED] wrote:
  Hey all;
 
  I have ran into an interesting case.
 
  Our system has notes. These need to be indexed. They are xml files
called default.xml and are easily parsed and indexed. No problem, have been
doing it all week.
 
  The problem is if someone edits the note, the system doesn't update the
default.xml. It creates a new file, default_1.xml (every edit creates a new
file with an incremented number, the sytem only displays the content from
the highest number).
 
  My problem is I index all the documents and end up with terms that were
taken out of note several version ago still showing up in the query. From my
point of view this makes sense because the files are still in the content.
But to a user it is confusing because they have no idea every change they
make to a note spans a new file and now the are seeing a term they removed
from their note 2 weeks ago showing up in a query.
 
  I have started modifying my incremental update to be look for multiple
version of the default.xml but it is more work than I thought and is going
make things complex.
 
  Maybe there is an easier way? If I just let it run and create the index,
can somebody suggest a way I could easily scan the index folder ensuring
only the default.xml with the highest number in its filename remains (only
for folders were there is more than one default.xml file)? Or is this
wishful thinking?
 
  Thanks,
 
  Luke
 

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene and SVD

2004-11-17 Thread DES
Hi
I need some kind of implementation of SVD (singular value decomposition) or 
LSI with Lucene engine. Have anyone any ideas how to create a query table 
for decomposition? The table must have documents as rows and terms as 
columns, if a term is presented in the docuement, the corresponding field 
contains 1 and a 0 if not. Then the SVD will be applied to this table, 
and with first 2 columns docuemnts will be displayed in a 2D-space.
Does anyone work on a project like this?

thank you and excuse for my language skills :)
Anton 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Considering intermediary solution before Lucene question

2004-11-17 Thread jeichels

Is there a way to use Lucene stemming and stop word removal without using the 
rest of the tool?   I am downloading the code now, but I imagine the answer 
might be deeply burried.  I would like to be able to send in a phrase and get 
back a collection of keywords if possible.

I am thinking of using an intermediary solution before moving fully to Lucene.  
I don't have time to spend a month making a carefully tested, administratable 
Lucene solution for my site yet, but I intend to do so over time.  Funny thing 
is the Lucene code likely would only take up a couple hundred of lines, but 
integration and administration would take me much more time.

In the meantime, I am thinking I could use perhaps Lucene steming and parsing 
of words, then stick each search word along with the associated primary key in 
an indexed MySql table.   Each record I would need to do this to is small with 
maybe only average 15 userful words.   I would be able to have an in-database 
solution though ranking, etc would not exist.   This is better then the exact 
word searching i have currently which is really bad.

By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have 
stemming and I am sure it is very slow compaired to Lucene.   Cpanel is still 
stuck on MySql 4.0.* so many people would not have access to even this basic 
ability in production systems for some time yet.

JohnE



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread Otis Gospodnetic
Yes, you can use just the Analysis part.  For instance, I use this for
http://www.simpy.com and I believe we also have this in the Lucene book
as part of the source code package:

/**
 * Gets Tokens extracted from the given text, using the specified
Analyzer.
 *
 * @param analyzer the codeAnalyzer/code to use
 * @param text the text to analyze
 * @param field the field to pass to the Analyzer for tokenization
 * @return an array of codeToken/codes
 * @exception IOException if an error occurs
 */
public static Token[] getTokens(Analyzer analyzer, String text,
String field)
throws IOException
{
TokenStream stream = analyzer.tokenStream(field, new
StringReader(text));
ArrayList tokenList = new ArrayList();
while (true) {
Token token = stream.next();
if (token == null)
break;
tokenList.add(token);
}
return (Token[]) tokenList.toArray(new Token[0]);
}

Otis

--- [EMAIL PROTECTED] wrote:

 
 Is there a way to use Lucene stemming and stop word removal without
 using the rest of the tool?   I am downloading the code now, but I
 imagine the answer might be deeply burried.  I would like to be able
 to send in a phrase and get back a collection of keywords if
 possible.
 
 I am thinking of using an intermediary solution before moving fully
 to Lucene.  I don't have time to spend a month making a carefully
 tested, administratable Lucene solution for my site yet, but I intend
 to do so over time.  Funny thing is the Lucene code likely would only
 take up a couple hundred of lines, but integration and administration
 would take me much more time.
 
 In the meantime, I am thinking I could use perhaps Lucene steming and
 parsing of words, then stick each search word along with the
 associated primary key in an indexed MySql table.   Each record I
 would need to do this to is small with maybe only average 15 userful
 words.   I would be able to have an in-database solution though
 ranking, etc would not exist.   This is better then the exact word
 searching i have currently which is really bad.
 
 By the way, MySql 4.1.1 has some Lucene type handling, but it too
 does not have stemming and I am sure it is very slow compaired to
 Lucene.   Cpanel is still stuck on MySql 4.0.* so many people would
 not have access to even this basic ability in production systems for
 some time yet.
 
 JohnE
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread jeichels
This is so cool Otis.  I was just to write this off of something in the FAQ, 
but this is better then what I was doing.

This rocks!!!  Thank you.

JohnE

P.S.:  I am assuming you use org.apache.lucene.analysis.Token?   There are 
three Token's under Lucene.



- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
Date: Wednesday, November 17, 2004 7:17 pm
Subject: Re: Considering intermediary solution before Lucene question

 Yes, you can use just the Analysis part.  For instance, I use this for
 http://www.simpy.com and I believe we also have this in the Lucene 
 bookas part of the source code package:
 
/**
 * Gets Tokens extracted from the given text, using the specified
 Analyzer.
 *
 * @param analyzer the codeAnalyzer/code to use
 * @param text the text to analyze
 * @param field the field to pass to the Analyzer for tokenization
 * @return an array of codeToken/codes
 * @exception IOException if an error occurs
 */
public static Token[] getTokens(Analyzer analyzer, String text,
 String field)
throws IOException
{
TokenStream stream = analyzer.tokenStream(field, new
 StringReader(text));
ArrayList tokenList = new ArrayList();
while (true) {
Token token = stream.next();
if (token == null)
break;
tokenList.add(token);
}
return (Token[]) tokenList.toArray(new Token[0]);
}
 
 Otis
 
 --- [EMAIL PROTECTED] wrote:
 
  
  Is there a way to use Lucene stemming and stop word removal without
  using the rest of the tool?   I am downloading the code now, but I
  imagine the answer might be deeply burried.  I would like to be able
  to send in a phrase and get back a collection of keywords if
  possible.
  
  I am thinking of using an intermediary solution before moving fully
  to Lucene.  I don't have time to spend a month making a carefully
  tested, administratable Lucene solution for my site yet, but I 
 intend to do so over time.  Funny thing is the Lucene code likely 
 would only
  take up a couple hundred of lines, but integration and 
 administration would take me much more time.
  
  In the meantime, I am thinking I could use perhaps Lucene 
 steming and
  parsing of words, then stick each search word along with the
  associated primary key in an indexed MySql table.   Each record I
  would need to do this to is small with maybe only average 15 userful
  words.   I would be able to have an in-database solution though
  ranking, etc would not exist.   This is better then the exact word
  searching i have currently which is really bad.
  
  By the way, MySql 4.1.1 has some Lucene type handling, but it too
  does not have stemming and I am sure it is very slow compaired to
  Lucene.   Cpanel is still stuck on MySql 4.0.* so many people would
  not have access to even this basic ability in production systems for
  some time yet.
  
  JohnE
  
  
  
  -
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread Chris Lamprecht
John,

It actually should be pretty easy to use just the parts of Lucene you
want (the analyzers, etc) without using the rest.  See the example of
the PorterStemmer from this article:

http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2

You could feed a Reader to the tokenStream() method of
PorterStemAnalyzer, and get back a TokenStream, from which you pull
the tokens using the next() method.



On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
 
 Is there a way to use Lucene stemming and stop word removal without using the 
 rest of the tool?   I am downloading the code now, but I imagine the answer 
 might be deeply burried.  I would like to be able to send in a phrase and get 
 back a collection of keywords if possible.
 
 I am thinking of using an intermediary solution before moving fully to 
 Lucene.  I don't have time to spend a month making a carefully tested, 
 administratable Lucene solution for my site yet, but I intend to do so over 
 time.  Funny thing is the Lucene code likely would only take up a couple 
 hundred of lines, but integration and administration would take me much more 
 time.
 
 In the meantime, I am thinking I could use perhaps Lucene steming and parsing 
 of words, then stick each search word along with the associated primary key 
 in an indexed MySql table.   Each record I would need to do this to is small 
 with maybe only average 15 userful words.   I would be able to have an 
 in-database solution though ranking, etc would not exist.   This is better 
 then the exact word searching i have currently which is really bad.
 
 By the way, MySql 4.1.1 has some Lucene type handling, but it too does not 
 have stemming and I am sure it is very slow compaired to Lucene.   Cpanel is 
 still stuck on MySql 4.0.* so many people would not have access to even this 
 basic ability in production systems for some time yet.
 
 JohnE
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Considering intermediary solution before Lucene question

2004-11-17 Thread jeichels
I thank you both.  I have it already partly implemented here.   It seems easy.

At least this should carry through my product until I can really get to use 
Lucene.  I am not sure how far I can take MySql with stemmed, indexed key 
words, but should give me maybe 6 monthes at least of something useful as 
opposed to impossible searching.  I need time and this might just be the trick.

Always I fight for simplicity, but it is hard when you have 2 databases that 
have to keep in synch.  If accuracy is important (people paying money) then 
handling all of the edge cases (such as the question that was just asked about 
if the machine goes down) are so important.  I understand this is beyond the 
scope of Lucene.

Thank you for the help.  This really is an interesting project.

JohnE



- Original Message -
From: Chris Lamprecht [EMAIL PROTECTED]
Date: Wednesday, November 17, 2004 7:08 pm
Subject: Re: Considering intermediary solution before Lucene question

 John,
 
 It actually should be pretty easy to use just the parts of Lucene you
 want (the analyzers, etc) without using the rest.  See the example of
 the PorterStemmer from this article:
 
 http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2
 
 You could feed a Reader to the tokenStream() method of
 PorterStemAnalyzer, and get back a TokenStream, from which you pull
 the tokens using the next() method.
 
 
 
 On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED]
 [EMAIL PROTECTED] wrote:
  
  Is there a way to use Lucene stemming and stop word removal 
 without using the rest of the tool?   I am downloading the code 
 now, but I imagine the answer might be deeply burried.  I would 
 like to be able to send in a phrase and get back a collection of 
 keywords if possible.
  
  I am thinking of using an intermediary solution before moving 
 fully to Lucene.  I don't have time to spend a month making a 
 carefully tested, administratable Lucene solution for my site yet, 
 but I intend to do so over time.  Funny thing is the Lucene code 
 likely would only take up a couple hundred of lines, but 
 integration and administration would take me much more time.
  
  In the meantime, I am thinking I could use perhaps Lucene 
 steming and parsing of words, then stick each search word along 
 with the associated primary key in an indexed MySql table.   Each 
 record I would need to do this to is small with maybe only average 
 15 userful words.   I would be able to have an in-database 
 solution though ranking, etc would not exist.   This is better 
 then the exact word searching i have currently which is really bad.
  
  By the way, MySql 4.1.1 has some Lucene type handling, but it 
 too does not have stemming and I am sure it is very slow compaired 
 to Lucene.   Cpanel is still stuck on MySql 4.0.* so many people 
 would not have access to even this basic ability in production 
 systems for some time yet.
  
  JohnE
  
  -
 
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 ---
 --
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index copy

2004-11-17 Thread Ravi
 
Thanks. I was looking for an o/s independent way of copying. Probably I
can use BufferedInputStream and BufferedOutputStream classes to copy the
index to a different location.
 
-Original Message-
From: Justin Swanhart [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 17, 2004 2:35 PM
To: Lucene Users List
Subject: Re: Index copy

You could lock your index for writes, then copy the file using operating
system copy commands.

Another way would be to lock your index, make a filesystem snapshot,
then unlock your index.  You can then safely copy the snapshot without
interupting further index operations.

On Wed, 17 Nov 2004 11:25:48 -0500, Ravi [EMAIL PROTECTED] wrote:
 Whats the bestway to copy an index from one directory to another? I 
 tried opening an IndexWriter at the new location and used addIndexes 
 to read from the old index. But that was very slow.
 
 Thanks in advance,
 Ravi.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]