RE: Lucene 1.4 RC 3 issue with temp directory
Your catalina.bat script is guessing your CATALINA_HOME environment variable since you don't have one set and is setting java.io.tmpdir based on that guess. You could work around this by setting a CATALINA_HOME environment variable or setting the system property org.apache.lucene.lockdir. That doesn't solve the problem for Lucene locks when java.io.tmpdir is set to a relative path that does not exist though. Eric -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Monday, May 17, 2004 2:15 PM To: [EMAIL PROTECTED] Subject: Lucene 1.4 RC 3 issue with temp directory Hi All, I just upgraded to 1.4 RC 3 and am now unable to open my index. I am getting: java.io.IOException: The system cannot find the path specified at java.io.WinNTFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:828) at org.apache.lucene.store.FSDirectory$1.obtain(FSDirectory.java:297) at org.apache.lucene.store.Lock.obtain(Lock.java:53) at org.apache.lucene.store.Lock$With.run(Lock.java:108) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:38) I _have_ reindexed using the new lucene jar. I am positive the path is correct as I can open an index in the same directory with the old Lucene with no problems. I notice that the problem only occurs when I am deployed inside of Tomcat. If I run searches on the command line or through JUnit everything functions correctly. When I print out the lockDir location that is trying to be obtained above, it looks like: C:\ENG\index\LDC\trec-ar-dar\..\temp which is the directory my index resides in, except ..\temp does not exist. When I create the directory, it works. I suppose I could create the temp directory for every index, but I didn't know that was a requirement. I do notice that Tomcat has a temp directory at the top, so it is probably setting some system property ("java.io.tmpdir") variable to "..\temp" that is being picked up by Lucene? The question is, what changed in RC 3 that would cause this to be used when it wasn't before? On a side note, would it be useful to create the lock directory if it doesn't exist? If the developers think so, I can submit the patch for it. Thanks, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Can documents be appended to?
Is it possible to append to an existing document? Judging by my own tests and this thread, NO. http://issues.apache.org/eyebrowse/[EMAIL PROTECTED] he.org&msgNo=3971 Wouldn't it be possible to look up an individual document (based upon a uid of sorts), then load the Fields off of the old one, delete it, then add the new document. Is there any hope of doing this efficiently? This would run into problems when merging indexes, you would get duplicates if they existed on more than 1 of your original indexes. Thank you, Will - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 1.4 RC 3 issue with temp directory
Hi All, I just upgraded to 1.4 RC 3 and am now unable to open my index. I am getting: java.io.IOException: The system cannot find the path specified at java.io.WinNTFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:828) at org.apache.lucene.store.FSDirectory$1.obtain(FSDirectory.java:297) at org.apache.lucene.store.Lock.obtain(Lock.java:53) at org.apache.lucene.store.Lock$With.run(Lock.java:108) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:38) I _have_ reindexed using the new lucene jar. I am positive the path is correct as I can open an index in the same directory with the old Lucene with no problems. I notice that the problem only occurs when I am deployed inside of Tomcat. If I run searches on the command line or through JUnit everything functions correctly. When I print out the lockDir location that is trying to be obtained above, it looks like: C:\ENG\index\LDC\trec-ar-dar\..\temp which is the directory my index resides in, except ..\temp does not exist. When I create the directory, it works. I suppose I could create the temp directory for every index, but I didn't know that was a requirement. I do notice that Tomcat has a temp directory at the top, so it is probably setting some system property ("java.io.tmpdir") variable to "..\temp" that is being picked up by Lucene? The question is, what changed in RC 3 that would cause this to be used when it wasn't before? On a side note, would it be useful to create the lock directory if it doesn't exist? If the developers think so, I can submit the patch for it. Thanks, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SELECTIVE Indexing
Lucene has no plug-in architecture, and does not assume you are indexing web pages, so your use of JTidy is all up to you, and independent of Lucene. Just feed Lucene the resulting text that you want to index and search. Otis --- Karthik N S <[EMAIL PROTECTED]> wrote: > Hi > > Can I Use TIDY [as plug in ] with Lucene ... > > > with regards > Karthik > > -Original Message- > From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED] > Sent: Monday, May 17, 2004 3:27 PM > To: 'Lucene Users List' > Subject: RE: SELECTIVE Indexing > > > > Try using Tidy. > Creates a Document of the html and allows you to apply xpath. > Hope this helps. > > Kiran. > > -Original Message- > From: Karthik N S [mailto:[EMAIL PROTECTED] > Sent: 17 May 2004 11:59 > To: Lucene Users List > Subject: SELECTIVE Indexing > > > > Hi all > >Can Some Body tell me How to Index CERTAIN PORTION OF THE HTML > FILE Only > >ex:- > > > > > > > with regards > Karthik > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SearchBlox J2EE Search Component Version 1.3 released
SearchBlox is a J2EE Search Component that delivers out-of-the-box search functionality for quick integration with your websites, applications, intranets and portals. SearchBlox uses the Lucene Search API and incorporates integrated HTTP/HTTPS and File System crawlers, support for various document formats including HTML, Word, PDF, PowerPoint and Excel, support for indexing and searching content in 17 languages and customizable search results, all controlled from a browser-based Admin Console. SearchBlox is available as a Web Archive (WAR) and is deployable on any Servlet 2.3/JSP 1.2 compliant server. Main features in this release: == - Support for HTTPS: SearchBlox can index HTTPS content without any special configuration - Support for Form-Based Authentication: SearchBlox spiders can index restricted content protected with form-based authentication - Performance enhancements SearchBlox Getting-Started Guides are available for the following servers: JBoss -http://www.searchblox.com/gettingstarted_jboss.html Jetty - http://www.searchblox.com/gettingstarted_jetty.html JRun - http://www.searchblox.com/gettingstarted_jrun.html Pramati - http://www.searchblox.com/gettingstarted_pramati.html Resin - http://www.searchblox.com/gettingstarted_resin.html Sun - http://www.searchblox.com/gettingstarted_sun.html Tomcat - http://www.searchblox.com/gettingstarted_tomcat.html Weblogic - http://www.searchblox.com/gettingstarted_weblogic.html Websphere - http://www.searchblox.com/gettingstarted_websphere.html The SearchBlox FREE Edition is available free of charge and can index up to 1000 documents. The software can be downloaded from http://www.searchblox.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: hierarchical search
Hi Matt, thanks for your reply! Indeed your proposed solution would work for the simple case I described, however I failed to mention that I must be able to combine the described queries to a complex one and there for can't make any assumptions based on size attribute. I'm sorry If I wasted your time. On the bright side though I found that creating a specialized query wasn't that difficult at all. A quick scan through the WildCardQuery class and the related WildCardQueryEnum gave me some valuable hints regarding this. If there is any interest I would happily contribute the code back to the community. Regards /Fredrik -Original Message- From: Matt Quail [mailto:[EMAIL PROTECTED] Sent: den 17 maj 2004 12:19 To: Lucene Users List Subject: Re: hierarchical search Fredrik, I would tackle your problem like this: Say that that field you want to index is "path". I would turn this into *three* indexed fields: 1) multiple path prefixes ("pre-paths") 2) multiple path suffixes ("post-paths") 3) the number of "components" in the path ("path-size"). For example, for a "path" of "/foo/bar/dog/cat/fish" I would index it like this: doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/fish/")); doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/")); doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/")); doc.add(Field.Keyword("pre-paths", "/foo/bar/")); doc.add(Field.Keyword("pre-paths", "/foo/")); doc.add(Field.Keyword("pre-paths", "/")); doc.add(Field.Keyword("post-paths", "/foo/bar/dog/cat/fish/")); doc.add(Field.Keyword("post-paths", "/bar/dog/cat/fish/")); doc.add(Field.Keyword("post-paths", "/dog/cat/fish/")); doc.add(Field.Keyword("post-paths", "/cat/fish/")); doc.add(Field.Keyword("post-paths", "/fish/")); doc.add(Field.Keyword("post-paths", "/")); doc.add(Field.Keyword("path-size", "5")); And to do your "type 2" search for (prefix="/p1/p2/p3/" and suffix="/s1/s2/s3/") I would use a query like this: Query q = QueryParser.parse("pre-paths:'/p1/p2/p3/' AND post-paths:'/s1/s2/s3/ AND (path-size:7)'"); The trick is to lock down the prefix and suffix, then define the amount of "slack" between the prefix and the suffix using the path-size. If you wanted the "slack" between either end to be zero or one segments, then change the size clause to something like (path-size:6 OR path-size:7) I think that should work. =Matt Fredrik Lindner wrote: > Hi all! > > I'm currently developing an application in which text searching is a > main component. Among other things, a document will contain a field > denoting hierarchical information. The information is stored as a string > using the common path syntax, /x/y/z/etc/... > > I would like to be able to search documents based on the path field > using two different selection criteria's, > > 1. given a prefix path and a suffix path select all documents for which > the path start with the supplied prefix, ends with the suffix and has > "some path" in between. > > 2. like (1) but with the requirement that "some path" spans one and one > level only. i.e. it defines a strict grandparent/grandchild relationship > between the last path entry of the prefix and the first of the suffix. > > For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three > documents with the path filed values > > a) /p1/p2/p3/x/s1/s2/s3/ > b) /p1/p2/p3/y/s1/s2/s3/ > c) /p1/p2/p3/x/y/s1/s2/s3/ > > case one should select them all whereas case two should select only a) > and b). > > My problem is that I am uncertain on how to implement the second case. I > guess I have to extend the Lucene internals somehow but I am quite too > inexperienced regarding Lucene to do so directly. Any pointers, hints or > comments are most welcome. > > Regards > /Fredrik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SELECTIVE Indexing
Hi Can I Use TIDY [as plug in ] with Lucene ... with regards Karthik -Original Message- From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED] Sent: Monday, May 17, 2004 3:27 PM To: 'Lucene Users List' Subject: RE: SELECTIVE Indexing Try using Tidy. Creates a Document of the html and allows you to apply xpath. Hope this helps. Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 11:59 To: Lucene Users List Subject: SELECTIVE Indexing Hi all Can Some Body tell me How to Index CERTAIN PORTION OF THE HTML FILE Only ex:- with regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: multivalue fields
Alex McManus writes: > > > Maybe your fields are too long so that only part of it gets indexed (look > at IndexWriter.maxFieldLength). > > This is interesting, I've had a look at the JavaDoc and I think I > understand. The maximum field length describes the maximum number of unique > terms, not the maximum number of words/tokens. Therefore, even if I have a > 4Gb field, I could quite safely have a maxFieldLength of, say, 100k words > which should safely handle the maximum number of unique words, rather than > 800 million which would be needed to handle every token. > > Is this correct? A short look at the source says no. maxFieldLength is handed to DocumentWriter where one finds TokenStream stream = analyzer.tokenStream(fieldName, reader); try { for (Token t = stream.next(); t != null; t = stream.next()) { position += (t.getPositionIncrement() - 1); addPosition(fieldName, t.termText(), position++); if (++length > maxFieldLength) break; } } finally { stream.close(); } so it's the number of terms not the number of different tokens. > > Is 100k a worrying maxFieldLength, in terms of how much memory this would > consume? > Depends on the size of your documents ;-) I use 25 without problems, but my documents are not as big (<4 tokens). I just want to make sure, not to loose any text for indexing. > Does Lucene issue a warning if this limit is exceeded during indexing (it > would be quite worrying if it was silently discarding terms)? > no. I guess the idea behind this limit is, that the relevant terms should occur in the first n words and indexing the rest just increases index size. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: hierarchical search
Fredrik, I would tackle your problem like this: Say that that field you want to index is "path". I would turn this into *three* indexed fields: 1) multiple path prefixes ("pre-paths") 2) multiple path suffixes ("post-paths") 3) the number of "components" in the path ("path-size"). For example, for a "path" of "/foo/bar/dog/cat/fish" I would index it like this: doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/fish/")); doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/")); doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/")); doc.add(Field.Keyword("pre-paths", "/foo/bar/")); doc.add(Field.Keyword("pre-paths", "/foo/")); doc.add(Field.Keyword("pre-paths", "/")); doc.add(Field.Keyword("post-paths", "/foo/bar/dog/cat/fish/")); doc.add(Field.Keyword("post-paths", "/bar/dog/cat/fish/")); doc.add(Field.Keyword("post-paths", "/dog/cat/fish/")); doc.add(Field.Keyword("post-paths", "/cat/fish/")); doc.add(Field.Keyword("post-paths", "/fish/")); doc.add(Field.Keyword("post-paths", "/")); doc.add(Field.Keyword("path-size", "5")); And to do your "type 2" search for (prefix="/p1/p2/p3/" and suffix="/s1/s2/s3/") I would use a query like this: Query q = QueryParser.parse("pre-paths:'/p1/p2/p3/' AND post-paths:'/s1/s2/s3/ AND (path-size:7)'"); The trick is to lock down the prefix and suffix, then define the amount of "slack" between the prefix and the suffix using the path-size. If you wanted the "slack" between either end to be zero or one segments, then change the size clause to something like (path-size:6 OR path-size:7) I think that should work. =Matt Fredrik Lindner wrote: Hi all! I'm currently developing an application in which text searching is a main component. Among other things, a document will contain a field denoting hierarchical information. The information is stored as a string using the common path syntax, /x/y/z/etc/... I would like to be able to search documents based on the path field using two different selection criteria's, 1. given a prefix path and a suffix path select all documents for which the path start with the supplied prefix, ends with the suffix and has "some path" in between. 2. like (1) but with the requirement that "some path" spans one and one level only. i.e. it defines a strict grandparent/grandchild relationship between the last path entry of the prefix and the first of the suffix. For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three documents with the path filed values a) /p1/p2/p3/x/s1/s2/s3/ b) /p1/p2/p3/y/s1/s2/s3/ c) /p1/p2/p3/x/y/s1/s2/s3/ case one should select them all whereas case two should select only a) and b). My problem is that I am uncertain on how to implement the second case. I guess I have to extend the Lucene internals somehow but I am quite too inexperienced regarding Lucene to do so directly. Any pointers, hints or comments are most welcome. Regards /Fredrik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: SELECTIVE Indexing
Try using Tidy. Creates a Document of the html and allows you to apply xpath. Hope this helps. Kiran. -Original Message- From: Karthik N S [mailto:[EMAIL PROTECTED] Sent: 17 May 2004 11:59 To: Lucene Users List Subject: SELECTIVE Indexing Hi all Can Some Body tell me How to Index CERTAIN PORTION OF THE HTML FILE Only ex:- with regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SELECTIVE Indexing
Hi all Can Some Body tell me How to Index CERTAIN PORTION OF THE HTML FILE Only ex:- with regards Karthik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: multivalue fields
> Maybe your fields are too long so that only part of it gets indexed (look at IndexWriter.maxFieldLength). This is interesting, I've had a look at the JavaDoc and I think I understand. The maximum field length describes the maximum number of unique terms, not the maximum number of words/tokens. Therefore, even if I have a 4Gb field, I could quite safely have a maxFieldLength of, say, 100k words which should safely handle the maximum number of unique words, rather than 800 million which would be needed to handle every token. Is this correct? Is 100k a worrying maxFieldLength, in terms of how much memory this would consume? Does Lucene issue a warning if this limit is exceeded during indexing (it would be quite worrying if it was silently discarding terms)? Thanks in advance, Alex. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
hierarchical search
Hi all! I'm currently developing an application in which text searching is a main component. Among other things, a document will contain a field denoting hierarchical information. The information is stored as a string using the common path syntax, /x/y/z/etc/... I would like to be able to search documents based on the path field using two different selection criteria's, 1. given a prefix path and a suffix path select all documents for which the path start with the supplied prefix, ends with the suffix and has "some path" in between. 2. like (1) but with the requirement that "some path" spans one and one level only. i.e. it defines a strict grandparent/grandchild relationship between the last path entry of the prefix and the first of the suffix. For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three documents with the path filed values a) /p1/p2/p3/x/s1/s2/s3/ b) /p1/p2/p3/y/s1/s2/s3/ c) /p1/p2/p3/x/y/s1/s2/s3/ case one should select them all whereas case two should select only a) and b). My problem is that I am uncertain on how to implement the second case. I guess I have to extend the Lucene internals somehow but I am quite too inexperienced regarding Lucene to do so directly. Any pointers, hints or comments are most welcome. Regards /Fredrik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]