RE: Lucene 1.4 RC 3 issue with temp directory

2004-05-17 Thread Eric Isakson
Your catalina.bat script is guessing your CATALINA_HOME environment variable since you 
don't have one set and is setting java.io.tmpdir based on that guess. You could work 
around this by setting a CATALINA_HOME environment variable or setting the system 
property org.apache.lucene.lockdir. That doesn't solve the problem for Lucene locks 
when java.io.tmpdir is set to a relative path that does not exist though.

Eric

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 17, 2004 2:15 PM
To: [EMAIL PROTECTED]
Subject: Lucene 1.4 RC 3 issue with temp directory


Hi All,

I just upgraded to 1.4 RC 3 and am now unable to open my index.

I am getting: 
java.io.IOException: The system cannot find the path specified
at java.io.WinNTFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:828)
at org.apache.lucene.store.FSDirectory$1.obtain(FSDirectory.java:297)
at org.apache.lucene.store.Lock.obtain(Lock.java:53)
at org.apache.lucene.store.Lock$With.run(Lock.java:108)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:38)


I _have_ reindexed using the new lucene jar.  I am positive the path is correct as I 
can open an index in the same directory with the old Lucene with no problems.  I 
notice that the problem only occurs when I am deployed inside of Tomcat.  If I run 
searches on the command line or through JUnit everything functions correctly.  

When I print out the lockDir location that is trying to be obtained above, it looks 
like: C:\ENG\index\LDC\trec-ar-dar\..\temp which is the directory my index resides in, 
except ..\temp does not exist.  When I create the directory, it works.  I suppose I 
could create the temp directory for every index, but I didn't know that was a 
requirement.  I do notice that Tomcat has a temp directory at the top, so it is 
probably setting some system property ("java.io.tmpdir") variable to "..\temp" 
that is being picked up by Lucene?  The question is, what changed in RC 3 that would 
cause this to be used when it wasn't before? 

On a side note, would it be useful to create the lock directory if it doesn't exist?  
If the developers think so, I can submit the patch for it.

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Can documents be appended to?

2004-05-17 Thread wallen
Is it possible to append to an existing document?

Judging by my own tests and this thread, NO.
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]
he.org&msgNo=3971

Wouldn't it be possible to look up an individual document (based upon a uid
of sorts), then load the Fields off of the old one, delete it, then add the
new document.  Is there any hope of doing this efficiently?  This would run
into problems when merging indexes, you would get duplicates if they existed
on more than 1 of your original indexes.

Thank you,
Will

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 1.4 RC 3 issue with temp directory

2004-05-17 Thread Grant Ingersoll
Hi All,

I just upgraded to 1.4 RC 3 and am now unable to open my index.

I am getting: 
java.io.IOException: The system cannot find the path specified
at java.io.WinNTFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:828)
at org.apache.lucene.store.FSDirectory$1.obtain(FSDirectory.java:297)
at org.apache.lucene.store.Lock.obtain(Lock.java:53)
at org.apache.lucene.store.Lock$With.run(Lock.java:108)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
at org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:38)


I _have_ reindexed using the new lucene jar.  I am positive the path is correct as I 
can open an index in the same directory with the old Lucene with no problems.  I 
notice that the problem only occurs when I am deployed inside of Tomcat.  If I run 
searches on the command line or through JUnit everything functions correctly.  

When I print out the lockDir location that is trying to be obtained above, it looks 
like: C:\ENG\index\LDC\trec-ar-dar\..\temp which is the directory my index resides in, 
except ..\temp does not exist.  When I create the directory, it works.  I suppose I 
could create the temp directory for every index, but I didn't know that was a 
requirement.  I do notice that Tomcat has a temp directory at the top, so it is 
probably setting some system property ("java.io.tmpdir") variable to "..\temp" 
that is being picked up by Lucene?  The question is, what changed in RC 3 that would 
cause this to be used when it wasn't before? 

On a side note, would it be useful to create the lock directory if it doesn't exist?  
If the developers think so, I can submit the patch for it.

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: SELECTIVE Indexing

2004-05-17 Thread Otis Gospodnetic
Lucene has no plug-in architecture, and does not assume you are
indexing web pages, so your use of JTidy is all up to you, and
independent of Lucene.  Just feed Lucene the resulting text that you
want to index and search.

Otis

--- Karthik N S <[EMAIL PROTECTED]> wrote:
> Hi
> 
> Can I Use TIDY [as plug in ] with Lucene ...
> 
> 
> with regards
> Karthik
> 
> -Original Message-
> From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
> Sent: Monday, May 17, 2004 3:27 PM
> To: 'Lucene Users List'
> Subject: RE: SELECTIVE Indexing
> 
> 
> 
> Try using Tidy.
> Creates a Document of the html and allows you to apply xpath.
> Hope this helps.
> 
> Kiran.
> 
> -Original Message-
> From: Karthik N S [mailto:[EMAIL PROTECTED]
> Sent: 17 May 2004 11:59
> To: Lucene Users List
> Subject: SELECTIVE Indexing
> 
> 
> 
> Hi all
> 
>Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML
> FILE Only
> 
>ex:-
> 
>
> 
>  
> 
> 
> with regards
> Karthik
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SearchBlox J2EE Search Component Version 1.3 released

2004-05-17 Thread Robert Selvaraj
SearchBlox is a J2EE Search Component that delivers out-of-the-box 
search functionality for quick integration with your websites, 
applications, intranets and portals. SearchBlox uses the Lucene Search 
API and incorporates integrated HTTP/HTTPS and File System crawlers, 
support for various document formats including HTML, Word, PDF, 
PowerPoint and Excel, support for indexing and searching content in 17 
languages and customizable search results, all controlled from a 
browser-based Admin Console. SearchBlox is available as a Web Archive 
(WAR) and is deployable on any Servlet 2.3/JSP 1.2 compliant server.

Main features in this release:
==
- Support for HTTPS: SearchBlox can index HTTPS content without any 
special configuration
- Support for Form-Based Authentication: SearchBlox spiders can index 
restricted content protected with form-based authentication
- Performance enhancements

SearchBlox Getting-Started Guides are available for the following servers:
JBoss -http://www.searchblox.com/gettingstarted_jboss.html
Jetty - http://www.searchblox.com/gettingstarted_jetty.html
JRun - http://www.searchblox.com/gettingstarted_jrun.html
Pramati - http://www.searchblox.com/gettingstarted_pramati.html
Resin - http://www.searchblox.com/gettingstarted_resin.html
Sun - http://www.searchblox.com/gettingstarted_sun.html
Tomcat - http://www.searchblox.com/gettingstarted_tomcat.html
Weblogic - http://www.searchblox.com/gettingstarted_weblogic.html
Websphere - http://www.searchblox.com/gettingstarted_websphere.html
The SearchBlox FREE Edition is available free of charge and can index up 
to 1000 documents.

The software can be downloaded from http://www.searchblox.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: hierarchical search

2004-05-17 Thread Fredrik Lindner
Hi Matt, thanks for your reply!

Indeed your proposed solution would work for the simple case I
described, however I failed to mention that I must be able to combine
the described queries to a complex one and there for can't make any
assumptions based on size attribute. I'm sorry If I wasted your time.

On the bright side though I found that creating a specialized query
wasn't that difficult at all. A quick scan through the WildCardQuery
class and the related WildCardQueryEnum gave me some valuable hints
regarding this. 

If there is any interest I would happily contribute the code back to the
community.

Regards
/Fredrik



-Original Message-
From: Matt Quail [mailto:[EMAIL PROTECTED] 
Sent: den 17 maj 2004 12:19
To: Lucene Users List
Subject: Re: hierarchical search

Fredrik,

I would tackle your problem like this:

Say that that field you want to index is "path". I would turn this into
*three* indexed fields:
1) multiple path prefixes ("pre-paths")
2) multiple path suffixes ("post-paths")
3) the number of "components" in the path ("path-size").

For example, for a "path" of "/foo/bar/dog/cat/fish" I would index it
like this:

doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/fish/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/"));
doc.add(Field.Keyword("pre-paths", "/foo/"));
doc.add(Field.Keyword("pre-paths", "/"));
doc.add(Field.Keyword("post-paths", "/foo/bar/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/bar/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/fish/"));
doc.add(Field.Keyword("post-paths", "/"));
doc.add(Field.Keyword("path-size", "5"));

And to do your "type 2" search for (prefix="/p1/p2/p3/" and
suffix="/s1/s2/s3/") I would use a query like this:

Query q = QueryParser.parse("pre-paths:'/p1/p2/p3/' AND
post-paths:'/s1/s2/s3/ AND (path-size:7)'");

The trick is to lock down the prefix and suffix, then define the amount
of "slack" between the prefix and the suffix using the path-size. If you
wanted the "slack" between either end to be zero or one segments, then
change the size clause to something like (path-size:6 OR path-size:7)


I think that should work.

=Matt


Fredrik Lindner wrote:

> Hi all!
> 
> I'm currently developing an application in which text searching is a
> main component. Among other things, a document will contain a field
> denoting hierarchical information. The information is stored as a
string
> using the common path syntax, /x/y/z/etc/...
> 
> I would like to be able to search documents based on the path field
> using two different selection criteria's,
> 
> 1. given a prefix path and a suffix path select all documents for
which
> the path start with the supplied prefix, ends with the suffix and has
> "some path" in between.
> 
> 2. like (1) but with the requirement that "some path" spans one and
one
> level only. i.e. it defines a strict grandparent/grandchild
relationship
> between the last path entry of the prefix and the first of the suffix.
> 
> For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three
> documents with the path filed values
> 
> a) /p1/p2/p3/x/s1/s2/s3/
> b) /p1/p2/p3/y/s1/s2/s3/
> c) /p1/p2/p3/x/y/s1/s2/s3/
> 
> case one should select them all whereas case two should select only a)
> and b).
> 
> My problem is that I am uncertain on how to implement the second case.
I
> guess I have to extend the Lucene internals somehow but I am quite too
> inexperienced regarding Lucene to do so directly. Any pointers, hints
or
> comments are most welcome.
> 
> Regards
> /Fredrik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: SELECTIVE Indexing

2004-05-17 Thread Karthik N S
Hi

Can I Use TIDY [as plug in ] with Lucene ...


with regards
Karthik

-Original Message-
From: Viparthi, Kiran (AFIS) [mailto:[EMAIL PROTECTED]
Sent: Monday, May 17, 2004 3:27 PM
To: 'Lucene Users List'
Subject: RE: SELECTIVE Indexing



Try using Tidy.
Creates a Document of the html and allows you to apply xpath.
Hope this helps.

Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED]
Sent: 17 May 2004 11:59
To: Lucene Users List
Subject: SELECTIVE Indexing



Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-

   

 


with regards
Karthik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: multivalue fields

2004-05-17 Thread Morus Walter
Alex McManus writes:
> 
> > Maybe your fields are too long so that only part of it gets indexed (look
> at IndexWriter.maxFieldLength).
> 
> This is interesting, I've had a look at the JavaDoc and I think I
> understand. The maximum field length describes the maximum number of unique
> terms, not the maximum number of words/tokens. Therefore, even if I have a
> 4Gb field, I could quite safely have a maxFieldLength of, say, 100k words
> which should safely handle the maximum number of unique words, rather than
> 800 million which would be needed to handle every token.
> 
> Is this correct? 

A short look at the source says no.

maxFieldLength is handed to DocumentWriter where one finds

  TokenStream stream = analyzer.tokenStream(fieldName, reader);
  try {
for (Token t = stream.next(); t != null; t = stream.next()) {
  position += (t.getPositionIncrement() - 1);
  addPosition(fieldName, t.termText(), position++);
  if (++length > maxFieldLength) break;
}
  } finally {
stream.close();
  }

so it's the number of terms not the number of different tokens.

> 
> Is 100k a worrying maxFieldLength, in terms of how much memory this would
> consume?
> 
Depends on the size of your documents ;-)
I use 25 without problems, but my documents are not as big (<4
tokens). I just want to make sure, not to loose any text for indexing.

> Does Lucene issue a warning if this limit is exceeded during indexing (it
> would be quite worrying if it was silently discarding terms)?
> 
no.
I guess the idea behind this limit is, that the relevant terms should occur
in the first n words and indexing the rest just increases index size.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: hierarchical search

2004-05-17 Thread Matt Quail
Fredrik,
I would tackle your problem like this:
Say that that field you want to index is "path". I would turn this into
*three* indexed fields:
1) multiple path prefixes ("pre-paths")
2) multiple path suffixes ("post-paths")
3) the number of "components" in the path ("path-size").
For example, for a "path" of "/foo/bar/dog/cat/fish" I would index it
like this:
doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/fish/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/cat/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/dog/"));
doc.add(Field.Keyword("pre-paths", "/foo/bar/"));
doc.add(Field.Keyword("pre-paths", "/foo/"));
doc.add(Field.Keyword("pre-paths", "/"));
doc.add(Field.Keyword("post-paths", "/foo/bar/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/bar/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/dog/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/cat/fish/"));
doc.add(Field.Keyword("post-paths", "/fish/"));
doc.add(Field.Keyword("post-paths", "/"));
doc.add(Field.Keyword("path-size", "5"));
And to do your "type 2" search for (prefix="/p1/p2/p3/" and
suffix="/s1/s2/s3/") I would use a query like this:
Query q = QueryParser.parse("pre-paths:'/p1/p2/p3/' AND
post-paths:'/s1/s2/s3/ AND (path-size:7)'");
The trick is to lock down the prefix and suffix, then define the amount
of "slack" between the prefix and the suffix using the path-size. If you
wanted the "slack" between either end to be zero or one segments, then
change the size clause to something like (path-size:6 OR path-size:7)
I think that should work.
=Matt
Fredrik Lindner wrote:
Hi all!
I'm currently developing an application in which text searching is a
main component. Among other things, a document will contain a field
denoting hierarchical information. The information is stored as a string
using the common path syntax, /x/y/z/etc/...
I would like to be able to search documents based on the path field
using two different selection criteria's,
1. given a prefix path and a suffix path select all documents for which
the path start with the supplied prefix, ends with the suffix and has
"some path" in between.
2. like (1) but with the requirement that "some path" spans one and one
level only. i.e. it defines a strict grandparent/grandchild relationship
between the last path entry of the prefix and the first of the suffix.
For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three
documents with the path filed values
a) /p1/p2/p3/x/s1/s2/s3/
b) /p1/p2/p3/y/s1/s2/s3/
c) /p1/p2/p3/x/y/s1/s2/s3/
case one should select them all whereas case two should select only a)
and b).
My problem is that I am uncertain on how to implement the second case. I
guess I have to extend the Lucene internals somehow but I am quite too
inexperienced regarding Lucene to do so directly. Any pointers, hints or
comments are most welcome.
Regards
/Fredrik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: SELECTIVE Indexing

2004-05-17 Thread Viparthi, Kiran (AFIS)

Try using Tidy.
Creates a Document of the html and allows you to apply xpath.
Hope this helps.

Kiran.

-Original Message-
From: Karthik N S [mailto:[EMAIL PROTECTED] 
Sent: 17 May 2004 11:59
To: Lucene Users List
Subject: SELECTIVE Indexing



Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-

   

 


with regards
Karthik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SELECTIVE Indexing

2004-05-17 Thread Karthik N S

Hi all

   Can Some Body tell me How to Index  CERTAIN PORTION OF THE HTML FILE Only

   ex:-

   

 


with regards
Karthik




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: multivalue fields

2004-05-17 Thread Alex McManus

> Maybe your fields are too long so that only part of it gets indexed (look
at IndexWriter.maxFieldLength).

This is interesting, I've had a look at the JavaDoc and I think I
understand. The maximum field length describes the maximum number of unique
terms, not the maximum number of words/tokens. Therefore, even if I have a
4Gb field, I could quite safely have a maxFieldLength of, say, 100k words
which should safely handle the maximum number of unique words, rather than
800 million which would be needed to handle every token.

Is this correct? 

Is 100k a worrying maxFieldLength, in terms of how much memory this would
consume?

Does Lucene issue a warning if this limit is exceeded during indexing (it
would be quite worrying if it was silently discarding terms)?

Thanks in advance,

Alex.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



hierarchical search

2004-05-17 Thread Fredrik Lindner
Hi all!

I'm currently developing an application in which text searching is a
main component. Among other things, a document will contain a field
denoting hierarchical information. The information is stored as a string
using the common path syntax, /x/y/z/etc/...

I would like to be able to search documents based on the path field
using two different selection criteria's,

1. given a prefix path and a suffix path select all documents for which
the path start with the supplied prefix, ends with the suffix and has
"some path" in between.

2. like (1) but with the requirement that "some path" spans one and one
level only. i.e. it defines a strict grandparent/grandchild relationship
between the last path entry of the prefix and the first of the suffix.

For example, with prefix /p1/p2/p3/ and suffix /s1/s2/s3/ and three
documents with the path filed values

a) /p1/p2/p3/x/s1/s2/s3/
b) /p1/p2/p3/y/s1/s2/s3/
c) /p1/p2/p3/x/y/s1/s2/s3/

case one should select them all whereas case two should select only a)
and b).

My problem is that I am uncertain on how to implement the second case. I
guess I have to extend the Lucene internals somehow but I am quite too
inexperienced regarding Lucene to do so directly. Any pointers, hints or
comments are most welcome.

Regards
/Fredrik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]