Re: Need help with filtering
On Wednesday 17 November 2004 01:20, Edwin Tang wrote: Hello, I have been using DateFilter to limit my search results to a certain date range. I am now asked to replace this filter with one where my search results have document IDs greater than a given document ID. This document ID is assigned during indexing and is a Keyword field. I've browsed around the FAQs and archives and see that I can either use QueryFilter or BooleanQuery. I've tried both approaches to limit the document ID range, but am getting the BooleanQuery.TooManyClauses exception in both cases. I've also tried bumping max number of clauses via setMaxClauseCount(), but that number has gotten pretty big. Is there another approach to this? ... Recoding DateFilter to a DocumentIdFilter should be straightforward. The trick is to use only one document enumerator at a time for all terms. Document enumerators take buffer space, and that is the reason why BooleanQuery has an exception for too many clauses. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: COUNT SUBINDEX [IN MERGERINDEX]
On Wednesday 17 November 2004 07:10, Karthik N S wrote: Hi guy's Apologies. So A Mergeed Index is again a Single [ addition of subIndexes... ), If that case , If One of the Field Types is of type 'Field.Keyword' whic is Unique across the subIndexes [Before Merging]. and If I want to Count this Unique Field in a MergerIndex [After i'ts been Merged ] How do I do this Please. IndexReader.numDocs() will give the number of docs in an index. Lucene has no direct support for unique fields. After merging, if the same unique field value occurs in both source indexes, the merged index will contain two documents with that value. In case one wants to merge into unique field values, the non unique values in one of the source indexes need to be deleted before merging. See IndexReader.termDocs(term) on how to get the document numbers for (unique) terms via a TermDocs, and IndexReader.delete(docNum) for deleting docs. Regards, Paul. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Locking Issues Resolved...I hope
I was thinking that perhaps I can pre-stem words before sticking them in a search field in the database perhaps using Lucene stemming code, then try to use the Natural Language Search found in MySql 4.1.1. I am confident the MySql product can't keep up with Lucene yet, but at least they hvae improved it some. Not even sure if my hosting company will upgrade to 4.1.1 though. Still looking for a lot of solutions to make Lucene sit in synch more nicely with MySql as the main database...aka an easy to use way of handling - Original Message - From: Chris Lamprecht [EMAIL PROTECTED] Date: Wednesday, November 17, 2004 1:38 am Subject: Re: Index Locking Issues Resolved...I hope MySQL does offer a basic fulltext search (with MyISAM tables), but it doesn't really approach the functionality of Lucene, such as pluggable tokenizers, stemming, etc. I think MS SQL server has fulltext search as well, but I have no idea if it's any good. See http://www.google.com/search?hl=enlr=safe=offc2coff=1q=mysql+fulltext I have not seen clear yet because it is all new. I wish a database Text field could have this sort of mechanism built into it. MySql does not do this (what I am using), but I am going to check into other databases now. OJB will work with most all of them so that would help if there is a database type of solution that will allow that sleep at night thing to happen!!! --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: COUNT SUBINDEX [IN MERGERINDEX]
Hi Guys Apologies.. I am Still Confused.. ;( Let me make it more simple Question On using Search from a Index without any SearchWord, I would like to count the total number of Documents present in it. [ I Only have the Field Types 'Field.Keyword' which stores the Unique filename ] Will IndexReader.termDocs(term) give me the Count for the same. If so How To use it... Please Thx in advance. Karthik -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 17, 2004 2:02 PM To: [EMAIL PROTECTED] Subject: Re: COUNT SUBINDEX [IN MERGERINDEX] On Wednesday 17 November 2004 07:10, Karthik N S wrote: Hi guy's Apologies. So A Mergeed Index is again a Single [ addition of subIndexes... ), If that case , If One of the Field Types is of type 'Field.Keyword' whic is Unique across the subIndexes [Before Merging]. and If I want to Count this Unique Field in a MergerIndex [After i'ts been Merged ] How do I do this Please. IndexReader.numDocs() will give the number of docs in an index. Lucene has no direct support for unique fields. After merging, if the same unique field value occurs in both source indexes, the merged index will contain two documents with that value. In case one wants to merge into unique field values, the non unique values in one of the source indexes need to be deleted before merging. See IndexReader.termDocs(term) on how to get the document numbers for (unique) terms via a TermDocs, and IndexReader.delete(docNum) for deleting docs. Regards, Paul. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
tool to check the index field
HI ALL I am having index file created by other people Now i want to know how many field are there in the index Is there any third party tool to do this I saw some where some GUI tool to do this but forgot the name. Regards LingaRaju - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: tool to check the index field
Try using : Luke : http://www.getopt.org/luke/ Limo : http://limo.sourceforge.net/ Regards, Kiran. -Original Message- From: lingaraju [mailto:[EMAIL PROTECTED] Sent: 17 November 2004 16:00 To: Lucene Users List Subject: tool to check the index field HI ALL I am having index file created by other people Now i want to know how many field are there in the index Is there any third party tool to do this I saw some where some GUI tool to do this but forgot the name. Regards LingaRaju - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: best ways of using IndexSearcher
Yes, IndexSearcher is thread safe. Aviran http://www.aviransplace.com -Original Message- From: Abhay Saswade [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 16, 2004 15:16 PM To: Lucene Users List Subject: Re: best ways of using IndexSearcher Hello, Can I use single instance of IndexSearcher in multiple threads with sorting? Thanks, Abhay - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, June 28, 2004 8:51 PM Subject: Re: best ways of using IndexSearcher Anson, Use a single instance of IndexSearcher and, if you want to always 'see' even the latest index changes (deletes and adds since you opened the IndexSearcher) make sure to re-create the IndexSearcher when you detect that the index version has changed (see http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade r.html#getCurrentVersion(org.apache.lucene.store.Directory)) When you get the new IndexSearcher, leave the old instance alone - let the GC take care of it, and don't call close() on it, in case something in your application is still using that instance. This stuff is not really CPU intensive. Disk I/O tends to be the bottleneck. If you are working with multiple indices, spread them over multiple disks (not just partitions, real disks), if you can. Otis --- Anson Lau [EMAIL PROTECTED] wrote: Hi Guys, What's the recommended way of using IndexSearcher? Should IndexSearcher be a singleton or pooled? Would pooling provide a more scalable solution by allowing you to decide how many IndexSearcher to use based on say how many CPU u have on ur server? Thanks, Anson - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help with filtering
Ah... recoding DateFilter. I will look into this today. Thanks for the help. Ed --- Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 17 November 2004 01:20, Edwin Tang wrote: Hello, I have been using DateFilter to limit my search results to a certain date range. I am now asked to replace this filter with one where my search results have document IDs greater than a given document ID. This document ID is assigned during indexing and is a Keyword field. I've browsed around the FAQs and archives and see that I can either use QueryFilter or BooleanQuery. I've tried both approaches to limit the document ID range, but am getting the BooleanQuery.TooManyClauses exception in both cases. I've also tried bumping max number of clauses via setMaxClauseCount(), but that number has gotten pretty big. Is there another approach to this? ... Recoding DateFilter to a DocumentIdFilter should be straightforward. The trick is to use only one document enumerator at a time for all terms. Document enumerators take buffer space, and that is the reason why BooleanQuery has an exception for too many clauses. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Whitespace Analyzer not producing expected search results
Thanks for the suggestions Erik. Displaying the query string is really usefull and this is what i've found. I issue a search using the search term ResponseHelper.writeNoCachingHeaders\(response\); The search is parsed using a query parser and produces the following query string +contents:ResponseHelper.writeNoCachingHeaders(response); This looks good and finds two documents I then try a search using the term ResponseHelper.writeNoCachingHeaders\(*\); now I'm expecting this to be a wider search term and it should find at least two, possibly more docs? the query parser produces the query +contents:responsehelper.writenocachingheaders(*); wow the query has lost its case and no docs get returned. Why does the query parser do this (my analyzer is the provided whitespace one). Any ideas to get around this ? Thanks Lee C Try using a TermQuery instead of QueryParser to see if you get the results you expect. Exact case matters. Also, when troubleshooting issues with QueryParser, it is helpful to see what the actual Query returned is - try displaying its toString output. Erik On Nov 16, 2004, at 6:25 AM, [EMAIL PROTECTED] wrote: Hi, We have indexed a set of web files (jsp , js , xslt , java properties and html) using the lucene Whitespace Analyzer. The purpose is to allow developers to find where code / functions are used and defined across a large and dissperate content management repository. Hopefully to aid code re-use, easier refactoring and standards control. However when a query parser search is made using a whitespace analyser with a string known to be in an indexed file, the search returns zero hits. For example the string jsp\:include page =\/path1/path2/path3/path4/file1.jsp\ / is searched for using the query parser (escaping the meta-chars)and an indexed document which contains the following text should be found ? // include HTML head % jsp:include page=/path1/path2/path3/path4/file1.jsp / script language=JavaScript src =/path1/path2/path3/file1.js/script !-- script I've taken a look at the FAQ advice regarding checking the effects of an analyser (in our case whitespace) but our test class returns the expected tokens for any given token stream. For Example this string % mytoken1 mytoken2 % is tokenised by the whitespace analyzer as [%] [mytoken1] [mytoken2] [%]. I'm sure I've missed something but i can't see what it is. If anyone could shed any light on posible reasons for why we are getting zero hits for text strings which are in our indexed files I'd be really gratefull. See below for more info on index and search set up Thanks a lot Lee C File contents are in a tokenised , indexed not stored field. Index uses the whitespace analyzer which comes with lucene Searches are performed using a boolean query. The boolean query is made up of a query parser which gets its search term from an html text box entered by the user and a prefix query which is used to limit search scope by directory paths. the search uses a whitespace analyzer, no filtering takes place - Get the best from British Airways at ba.com http://www.ba.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: tool to check the index field
Try this: http://www.getopt.org/luke/ Luke - Original Message - From: lingaraju [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 17, 2004 10:00 AM Subject: tool to check the index field HI ALL I am having index file created by other people Now i want to know how many field are there in the index Is there any third party tool to do this I saw some where some GUI tool to do this but forgot the name. Regards LingaRaju - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index copy
Whats the bestway to copy an index from one directory to another? I tried opening an IndexWriter at the new location and used addIndexes to read from the old index. But that was very slow. Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Whitespace Analyzer not producing expected search results
On Nov 17, 2004, at 7:44 AM, [EMAIL PROTECTED] wrote: I then try a search using the term ResponseHelper.writeNoCachingHeaders\(*\); now I'm expecting this to be a wider search term and it should find at least two, possibly more docs? the query parser produces the query +contents:responsehelper.writenocachingheaders(*); wow the query has lost its case and no docs get returned. Why does the query parser do this (my analyzer is the provided whitespace one). Any ideas to get around this ? Because generally terms are lowercased when indexed by the analyzer (but not in your case with WhitespaceAnalyzer), QueryParser defaults to lowercasing wildcarded queries. Wildcard query terms are not analyzed. To get around this, construct an instance of QueryParser and turn the lowercasing of wildcard terms off: QueryParser parser = new QueryParser(field, new StandardAnalyzer()); parser.setLowercaseWildcardTerms(false); Use the instance of QueryParser instead of the static parse method from now on. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Whitespace Analyzer not producing expected search results
Thanks a lot for the solution / explanation. Saved the day Erik. Summary Observation: Using a wild carded search term with queryParser and the WhitespaceAnalyser returned no hits when when hits where expected. Reason: This was caused by the default behaviour of queryParser to lower case wildcarded search terms. Resolution: To use an instance of query parser setting the instances setLowercaseWildcardTerms to false. Example:QueryParser parser = new QueryParser(field, new StandardAnalyzer()); parser.setLowercaseWildcardTerms(false); Solution provided by Erik Hatcher - Get the best from British Airways at ba.com http://www.ba.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
index document pdf
Hi, i downloading pdfbox 0.6.4 , what add in the source code the demo`s lucene -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
test __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
Enumerating the terms using WildcardTermEnum and an IndexReader seems to be too buggy to use. I'm now reimplementing my code using WildcardTermEnum.wildcardEquals which seems to be better so far. --- Sanyi [EMAIL PROTECTED] wrote: Hi! I have following problem with 1.4.2: I'm searching for c?ca (using StandardAnalyzer) and one of the hits looks something like this: blabla c0ca c0la etc.. etc... (those big o-s are zero characters) Now, I'm enumerating the terms using WildcardTermEnum and all I get is: caca ccca ceca cica coca crca csca cuca cyca It doesn't know about c0ca at all. Is there any solution to come over this problem? Thanks, Sanyi __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index copy
You could lock your index for writes, then copy the file using operating system copy commands. Another way would be to lock your index, make a filesystem snapshot, then unlock your index. You can then safely copy the snapshot without interupting further index operations. On Wed, 17 Nov 2004 11:25:48 -0500, Ravi [EMAIL PROTECTED] wrote: Whats the bestway to copy an index from one directory to another? I tried opening an IndexWriter at the new location and used addIndexes to read from the old index. But that was very slow. Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Something missing !!???
i noticed in the last period that alot of people disscus with each others about the bugs of lucene ... but something is missing ... i consider lucene is an indexing tool for text files and so one ... but there are alot of tools that makes this indexing like access ... what about compression ... compressing original text files and its indexes and performing indexing on them like (MG) system which is effecient in compression and indexing ... where all of that in Lucene please help me if these requierments satisfied in Lucene please anyone notify me and send link of the new version... thanks alot ... _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Something missing !!???
The HEAD version of CVS supports gz compression. You will need to check it out using cvs if you want to use it. On Wed, 17 Nov 2004 21:43:36 +0200, abdulrahman galal [EMAIL PROTECTED] wrote: i noticed in the last period that alot of people disscus with each others about the bugs of lucene ... but something is missing ... i consider lucene is an indexing tool for text files and so one ... but there are alot of tools that makes this indexing like access ... what about compression ... compressing original text files and its indexes and performing indexing on them like (MG) system which is effecient in compression and indexing ... where all of that in Lucene please help me if these requierments satisfied in Lucene please anyone notify me and send link of the new version... thanks alot ... _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
version documents
Hey all; I have ran into an interesting case. Our system has notes. These need to be indexed. They are xml files called default.xml and are easily parsed and indexed. No problem, have been doing it all week. The problem is if someone edits the note, the system doesn't update the default.xml. It creates a new file, default_1.xml (every edit creates a new file with an incremented number, the sytem only displays the content from the highest number). My problem is I index all the documents and end up with terms that were taken out of note several version ago still showing up in the query. From my point of view this makes sense because the files are still in the content. But to a user it is confusing because they have no idea every change they make to a note spans a new file and now the are seeing a term they removed from their note 2 weeks ago showing up in a query. I have started modifying my incremental update to be look for multiple version of the default.xml but it is more work than I thought and is going make things complex. Maybe there is an easier way? If I just let it run and create the index, can somebody suggest a way I could easily scan the index folder ensuring only the default.xml with the highest number in its filename remains (only for folders were there is more than one default.xml file)? Or is this wishful thinking? Thanks, Luke
mergeFactor
Can somebody explain the difference between the parameters minMergeDocs and mergeFactor in IndexWriter. When I read the documentation, it looks like both of them represent number of documents to be in buffer before they are merged into a new segment. Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: version documents
Split the filename into basefilename and version and make each a keyword. Sort your query by version descending, and only use the first basefile you encounter. On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon [EMAIL PROTECTED] wrote: Hey all; I have ran into an interesting case. Our system has notes. These need to be indexed. They are xml files called default.xml and are easily parsed and indexed. No problem, have been doing it all week. The problem is if someone edits the note, the system doesn't update the default.xml. It creates a new file, default_1.xml (every edit creates a new file with an incremented number, the sytem only displays the content from the highest number). My problem is I index all the documents and end up with terms that were taken out of note several version ago still showing up in the query. From my point of view this makes sense because the files are still in the content. But to a user it is confusing because they have no idea every change they make to a note spans a new file and now the are seeing a term they removed from their note 2 weeks ago showing up in a query. I have started modifying my incremental update to be look for multiple version of the default.xml but it is more work than I thought and is going make things complex. Maybe there is an easier way? If I just let it run and create the index, can somebody suggest a way I could easily scan the index folder ensuring only the default.xml with the highest number in its filename remains (only for folders were there is more than one default.xml file)? Or is this wishful thinking? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: version documents
That is a good idea. Thanks! - Original Message - From: Justin Swanhart [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, November 17, 2004 3:38 PM Subject: Re: version documents Split the filename into basefilename and version and make each a keyword. Sort your query by version descending, and only use the first basefile you encounter. On Wed, 17 Nov 2004 15:05:19 -0500, Luke Shannon [EMAIL PROTECTED] wrote: Hey all; I have ran into an interesting case. Our system has notes. These need to be indexed. They are xml files called default.xml and are easily parsed and indexed. No problem, have been doing it all week. The problem is if someone edits the note, the system doesn't update the default.xml. It creates a new file, default_1.xml (every edit creates a new file with an incremented number, the sytem only displays the content from the highest number). My problem is I index all the documents and end up with terms that were taken out of note several version ago still showing up in the query. From my point of view this makes sense because the files are still in the content. But to a user it is confusing because they have no idea every change they make to a note spans a new file and now the are seeing a term they removed from their note 2 weeks ago showing up in a query. I have started modifying my incremental update to be look for multiple version of the default.xml but it is more work than I thought and is going make things complex. Maybe there is an easier way? If I just let it run and create the index, can somebody suggest a way I could easily scan the index folder ensuring only the default.xml with the highest number in its filename remains (only for folders were there is more than one default.xml file)? Or is this wishful thinking? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene and SVD
Hi I need some kind of implementation of SVD (singular value decomposition) or LSI with Lucene engine. Have anyone any ideas how to create a query table for decomposition? The table must have documents as rows and terms as columns, if a term is presented in the docuement, the corresponding field contains 1 and a 0 if not. Then the SVD will be applied to this table, and with first 2 columns docuemnts will be displayed in a 2D-space. Does anyone work on a project like this? thank you and excuse for my language skills :) Anton - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Considering intermediary solution before Lucene question
Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Considering intermediary solution before Lucene question
Yes, you can use just the Analysis part. For instance, I use this for http://www.simpy.com and I believe we also have this in the Lucene book as part of the source code package: /** * Gets Tokens extracted from the given text, using the specified Analyzer. * * @param analyzer the codeAnalyzer/code to use * @param text the text to analyze * @param field the field to pass to the Analyzer for tokenization * @return an array of codeToken/codes * @exception IOException if an error occurs */ public static Token[] getTokens(Analyzer analyzer, String text, String field) throws IOException { TokenStream stream = analyzer.tokenStream(field, new StringReader(text)); ArrayList tokenList = new ArrayList(); while (true) { Token token = stream.next(); if (token == null) break; tokenList.add(token); } return (Token[]) tokenList.toArray(new Token[0]); } Otis --- [EMAIL PROTECTED] wrote: Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Considering intermediary solution before Lucene question
This is so cool Otis. I was just to write this off of something in the FAQ, but this is better then what I was doing. This rocks!!! Thank you. JohnE P.S.: I am assuming you use org.apache.lucene.analysis.Token? There are three Token's under Lucene. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] Date: Wednesday, November 17, 2004 7:17 pm Subject: Re: Considering intermediary solution before Lucene question Yes, you can use just the Analysis part. For instance, I use this for http://www.simpy.com and I believe we also have this in the Lucene bookas part of the source code package: /** * Gets Tokens extracted from the given text, using the specified Analyzer. * * @param analyzer the codeAnalyzer/code to use * @param text the text to analyze * @param field the field to pass to the Analyzer for tokenization * @return an array of codeToken/codes * @exception IOException if an error occurs */ public static Token[] getTokens(Analyzer analyzer, String text, String field) throws IOException { TokenStream stream = analyzer.tokenStream(field, new StringReader(text)); ArrayList tokenList = new ArrayList(); while (true) { Token token = stream.next(); if (token == null) break; tokenList.add(token); } return (Token[]) tokenList.toArray(new Token[0]); } Otis --- [EMAIL PROTECTED] wrote: Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Considering intermediary solution before Lucene question
John, It actually should be pretty easy to use just the parts of Lucene you want (the analyzers, etc) without using the rest. See the example of the PorterStemmer from this article: http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2 You could feed a Reader to the tokenStream() method of PorterStemAnalyzer, and get back a TokenStream, from which you pull the tokens using the next() method. On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Considering intermediary solution before Lucene question
I thank you both. I have it already partly implemented here. It seems easy. At least this should carry through my product until I can really get to use Lucene. I am not sure how far I can take MySql with stemmed, indexed key words, but should give me maybe 6 monthes at least of something useful as opposed to impossible searching. I need time and this might just be the trick. Always I fight for simplicity, but it is hard when you have 2 databases that have to keep in synch. If accuracy is important (people paying money) then handling all of the edge cases (such as the question that was just asked about if the machine goes down) are so important. I understand this is beyond the scope of Lucene. Thank you for the help. This really is an interesting project. JohnE - Original Message - From: Chris Lamprecht [EMAIL PROTECTED] Date: Wednesday, November 17, 2004 7:08 pm Subject: Re: Considering intermediary solution before Lucene question John, It actually should be pretty easy to use just the parts of Lucene you want (the analyzers, etc) without using the rest. See the example of the PorterStemmer from this article: http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2 You could feed a Reader to the tokenStream() method of PorterStemAnalyzer, and get back a TokenStream, from which you pull the tokens using the next() method. On Wed, 17 Nov 2004 18:54:07 -0500, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Is there a way to use Lucene stemming and stop word removal without using the rest of the tool? I am downloading the code now, but I imagine the answer might be deeply burried. I would like to be able to send in a phrase and get back a collection of keywords if possible. I am thinking of using an intermediary solution before moving fully to Lucene. I don't have time to spend a month making a carefully tested, administratable Lucene solution for my site yet, but I intend to do so over time. Funny thing is the Lucene code likely would only take up a couple hundred of lines, but integration and administration would take me much more time. In the meantime, I am thinking I could use perhaps Lucene steming and parsing of words, then stick each search word along with the associated primary key in an indexed MySql table. Each record I would need to do this to is small with maybe only average 15 userful words. I would be able to have an in-database solution though ranking, etc would not exist. This is better then the exact word searching i have currently which is really bad. By the way, MySql 4.1.1 has some Lucene type handling, but it too does not have stemming and I am sure it is very slow compaired to Lucene. Cpanel is still stuck on MySql 4.0.* so many people would not have access to even this basic ability in production systems for some time yet. JohnE - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index copy
Thanks. I was looking for an o/s independent way of copying. Probably I can use BufferedInputStream and BufferedOutputStream classes to copy the index to a different location. -Original Message- From: Justin Swanhart [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 17, 2004 2:35 PM To: Lucene Users List Subject: Re: Index copy You could lock your index for writes, then copy the file using operating system copy commands. Another way would be to lock your index, make a filesystem snapshot, then unlock your index. You can then safely copy the snapshot without interupting further index operations. On Wed, 17 Nov 2004 11:25:48 -0500, Ravi [EMAIL PROTECTED] wrote: Whats the bestway to copy an index from one directory to another? I tried opening an IndexWriter at the new location and used addIndexes to read from the old index. But that was very slow. Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]