from:"Sergiu Gordea"

Re: Search Performance

2005-02-19 Thread sergiu gordea

Michael Celona wrote:
My index is changing in real time constantly... in this case I guess this
will not work for me any suggestions...
 

using a singleton pattern for the your index searcher makes sense anyway 
... I don'T think that you change
the index after each search. the computing effort is insignificant but 
the gain is.

How often do you optimize your index.
Run your jmeter tests before and after optimization!
Which is the value of your merge factor?
Try to use 2 or 3 and run the tests again.
I think it will be useful for lucene community to provide the results 
of your tests.

Best,
 Sergiu
Michael
-Original Message-
From: David Townsend [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 11:50 AM
To: Lucene Users List
Subject: RE: Search Performance

IndexSearchers are thread safe, so you can use the same object on multiple
requests.  If the index is static and not constantly updating, just keep one
IndexSearcher for the life of the app.  If the index changes and you need
that instantly reflected in the results, you need to check if the index has
changed, if it has create a new cached IndexSearcher.  To check for changes
use you'll need to monitor the version number of the index obtained via
IndexReader.getCurrentVersion(Index Name)
David
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: 18 February 2005 16:15
To: Lucene Users List
Subject: Re: Search Performance
Try a singleton pattern or an static field.
Stefan
Michael Celona wrote:
 

I am creating new IndexSearchers... how do I cache my IndexSearcher...
Michael
-Original Message-
From: David Townsend [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 11:00 AM
To: Lucene Users List
Subject: RE: Search Performance

Are you creating new IndexSearchers or IndexReaders on each search?
   

Caching
 

your IndexSearchers has a dramatic effect on speed.
David Townsend
-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: 18 February 2005 15:55
To: Lucene Users List
Subject: Search Performance
What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  


Michael
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea

Hi Erik,
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  
All I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
From what I was reading in the mailing list there are more lucene users 
that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

Thanks for understanding,
 Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea

Chong, Herb wrote:
commercial text analytics tools including search engines usually
tokenize with splitting of compound words for German.
Herb
That might be true ... but our application is not a text analysis 
aplication,
and it is also not intended to be a search engine. We use lucene just to 
index our pages.

 Best,
 Sergiu

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 10:38 AM
To: Lucene Users List
Subject: Re: Starts With x and Ends With x Queries

From what I was reading in the mailing list there are more lucene users
that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea

Erik Hatcher wrote:
On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:
Hi Erik,
I'm not changing any functionality.  WildcardQuery will still 
support leading wildcard characters, QueryParser will still disallow 
them.  All I'm going to change is the javadoc that makes it sound 
like WildcardQuery does not support leading wildcard characters.

Erik

From what I was reading in the mailing list there are more lucene 
users that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

I have no objections to that type of switch.  Please submit a path to 
QueryParser.jj that implements this as an option with the default to 
disallow suffix queries, along with a test case and I'd be happy to 
apply it.
I'm pleased to hear that. I'm not very skilled in writing .jj files but 
I will try to do it in next days,

Sergiu
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread sergiu gordea

Karl Koch wrote:
When I switch to Java 1.2, I can also not run it. Also I cannot index
anything. I have no idea why...
Can sombody help me?
 

I think you are a pioneer in this domain :) . I'm not very familiar with 
the lucene source code, but I think it uses the
advantages of java 1.3 and 1.4. 
Probably the best thing you can do is to get the sources of the old 
versions of lucene and to try to compile them with
java 1.2 compiler.

Best,
Sergiu
Karl
 

Hello all,
I have heard that Lucene 1.3 Final should run under Java 1.1. (I need that
because I want to run a search with a PDA using Java 1.1).
However, when I run my code. I get the following error:
--
A nonfatal internal JIT (3.10.107(x)) error 'chgTarg: Conditional' has
occurred in : 
 'org/apache/lucene/store/FSDirectory.getDirectory
(Ljava/io/File;Z)Lorg/apache/lucene/store/FSDirectory;': Interpreting
method.
 Please report this error in detail to
http://java.sun.com/cgi-bin/bugreport.cgi

Exception occured in StandardSearch:search(String, String[], String)!
java.lang.IllegalMonitorStateException: current thread not owner
at org.apache.lucene.store.FSDirectory.makeLock(FSDirectory.java:312)
at org.apache.lucene.index.IndexReader.open(IndexReader.java, Compiled
Code)
--
The error does not occur when I run it under Java 1.4.
What do I do wrong and what do I need to change in order to make it work.
It
must be my code. Here the code relevant to this error (the search method).
public static Result search(String queryString, String[] searchFields, 
 String indexDirectory) {
 // create access to index
 StandardAnalyzer analyser = new StandardAnalyzer();
 Hits hits = null;
 Result result = null;
 try {
 fsDirectory = 
FSDirectory.getDirectory(StandardSearcher.indexDirectory, false);
 IndexSearcher searcher = new IndexSearcher(fsDirectory);
 ...
}

What is wrong here?
Best Regards,
Karl
--
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Starts With x and Ends With x Queries

2005-02-06 Thread sergiu gordea

Hi Erick,

In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards code*/code or 
code?/code.

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change must to should. 
Will this change available in the next realease of lucene? How do you 
plan to implement this? Will this be available as an atributte of  
QueryParser?

 Best,
 Sergiu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea

Karl Koch wrote:
Hello Sergiu,
thank you for your help so far. I appreciate it.
I am working with Java 1.1 which does not include regular expressions.
 

Why are you using Java 1.1? Are you so limited in resources?
What operating system do you use?
I asume that you just need to index the html files, and you need a 
html2txt conversion.
If  an external converter si a solution for you, you can use
Runtime.executeCommnand(...) to run the converter that will extract the 
information from your HTMLs
and generate a .txt file. Then you can use a reader to index the txt.

As I told you before, the best solution depends on your constraints 
(time, effort, hardware, performance) and requirements :)

 Best,
 Sergiu
Your turn ;-)
Karl 

 

Karl Koch wrote:
   

I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).
Are there any very-short solutions for that?
 

if you are using only correct formated HTML pages and you are in control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that 
can cause you a lot of problems in the future.

It's up to you to use it 
Best,
Sergiu
   

Karl

 

Karl Koch wrote:
  

   

Hi,
yes, but the library your are using is quite big. I was thinking that a


 

5kB
  

   

code could actually do that. That sourceforge project is doing much
 

more
   

than that but I do not need it.


 

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the
   

size.
   

You can use 3 lines of code with a good regular expresion to eliminate
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

Best,
Sergiu
  

   

Karl



 

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
 

  

   

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and


 

simple
  

   

(KISS)) which allows to remove all HTML tags from HTML content? HTML


 

3.2
  

   

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure


 

but
  

   

need a facility to clean up HTML into its normal underlying content
   



 

before
 

  

   

indexing that content as a whole.
Karl

   



 

I think that depends on what you want to do.  The Lucene demo parser
 

  

   

does
 

  

   

simple mapping of HTML files into Lucene Documents; it does not give
  

   

you
  

   

 

  

   

a
 

  

   

parse tree for the HTML doc.  CyberNeko is an extension of Xerces
  

   

(uses
  

   


 

  

   

the
   



 

same API; will likely become part of Xerces), and so maps an HTML
 

  

   

document
 

  

   

into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it
  

   

as
  

   


 

  

   

well --
   



 

based on its UI, it appears to be focused primarily on HTML
   

validation
   

 

  

   

and
 

  

   

error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
 

  

   

beyond
 

  

   

indexing them in Lucene, and really like it.  It has been robust for
  

   

me
  

   

 

  

   

so
 

  

   

far.
Chuck
  

   

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦ÓÐ¾¡ÓÐ£¬ËÑ±éÃÀÍ¼¡¢ÑÞÍ¼ºÍ¿áÍ¼
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea

Karl Koch wrote:
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.
 

Why don't you get the data directly from  XML files?
You can use a SAX parser, ... but I think it will require java 1.3 or at 
least 1.2.2

Best,
 Sergiu
Karl
 

If you are not married to Java:
http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
Otis
--- sergiu gordea [EMAIL PROTECTED] wrote:
   

Karl Koch wrote:
 

I am in control of the html, which means it is well formated HTML. I
   

use
 

only HTML files which I have transformed from XML. No external HTML
   

(e.g.
 

the web).
Are there any very-short solutions for that?
   

if you are using only correct formated HTML pages and you are in
control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google
you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that
can cause you a lot of problems in the future.
It's up to you to use it 
Best,
Sergiu
 

Karl

   

Karl Koch wrote:
  

 

Hi,
yes, but the library your are using is quite big. I was thinking
   

that a
 



   

5kB
  

 

code could actually do that. That sourceforge project is doing
   

much more
 

than that but I do not need it.


   

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the
 

size.
 

You can use 3 lines of code with a good regular expresion to
 

eliminate 
 

the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

Best,
Sergiu
  

 

Karl



   

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
 

  

 

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short
   

and
 



   

simple
  

 

(KISS)) which allows to remove all HTML tags from HTML content?
   

HTML
 



   

3.2
  

 

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other
   

structure
 



   

but
  

 

need a facility to clean up HTML into its normal underlying
   

content
 

   



   

before
 

  

 

indexing that content as a whole.
Karl

   



   

I think that depends on what you want to do.  The Lucene demo
 

parser
 

 

  

 

does
 

  

 

simple mapping of HTML files into Lucene Documents; it does not
 

give
 

  

 

you
  

 

 

  

 

a
 

  

 

parse tree for the HTML doc.  CyberNeko is an extension of
 

Xerces
 

  

 

(uses
  

 


 

  

 

the
   



   

same API; will likely become part of Xerces), and so maps an
 

HTML
 

 

  

 

document
 

  

 

into a full DOM that you can manipulate easily for a wide range
 

of
 

purposes.  I haven't used JTidy at an API level and so don't
 

know it
 

  

 

as
  

 


 

  

 

well --
   



   

based on its UI, it appears to be focused primarily on HTML
 

validation
 

 

  

 

and
 

  

 

error detection/correction.
I use CyberNeko for a range of operations on HTML documents
 

that go
 

 

  

 

beyond
 

  

 

indexing them in Lucene, and really like it.  It has been
 

robust for
 

  

 

me
  

 

 

  

 

so
 

  

 

far.
Chuck
  

 

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea

Karl Koch wrote:
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.
 

I see,
In this case you can read line by line your HTML file and then write 
something like this:

String line;
int startPos, endPos;
StringBuffer text = new StringBuffer();
while((line = reader.readLine()) != null   ){
   startPos = line.indexOf();
   endPos = line.indexOf();
   if(startPos 0  endPos  startPos)
 text.append(line.substring(startPos, endPos));
}
This is just a sample code that should work if you have just one tag per 
line in the HTML file.
This can be a start point for you.

 Hope it helps,
Best,
Sergiu
I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.
Thank you all,
Karl
 

No one has yet mentioned using ParserDelegator and ParserCallback that 
are part of HTMLEditorKit in Swing.  I have been successfully using 
these classes to parse out the text of an HTML file.  You just need to 
extend HTMLEditorKit.ParserCallback and override the various methods 
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
   

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
 

--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea

 Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
 Best,
  Sergiu
Karl Koch wrote:
Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.
Karl
 

I think that depends on what you want to do.  The Lucene demo parser does
simple mapping of HTML files into Lucene Documents; it does not give you a
parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
   

the
 

same API; will likely become part of Xerces), and so maps an HTML document
into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it as
   

well --
 

based on its UI, it appears to be focused primarily on HTML validation and
error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go beyond
indexing them in Lucene, and really like it.  It has been robust for me so
far.
Chuck
  -Original Message-
  From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, February 01, 2005 1:15 AM
  To: lucene-user@jakarta.apache.org
  Subject: which HTML parser is better?
  
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
  
  _
  Do You Yahoo!?
  150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
  http://music.yisou.com/
  ÃÀÅ®Ã÷ÐÇÓ¦ÓÐ¾¡ÓÐ£¬ËÑ±éÃÀÍ¼¡¢ÑÞÍ¼ºÍ¿áÍ¼
  http://image.yisou.com
  1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
  il_1g/
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: *term

2005-02-02 Thread sergiu gordea

Tim Lebedkov (UPK) wrote:
Hi,
is there a way to make QueryParser accept *term?
 

yes, if you apply a patch the lucene sources.
Search for *term search in lucene archive.
Best,
 Sergiu
thank you
--Tim
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea

Karl Koch wrote:
Hi,
yes, but the library your are using is quite big. I was thinking that a 5kB
code could actually do that. That sourceforge project is doing much more
than that but I do not need it.
 

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the size.
 You can use 3 lines of code with a good regular expresion to eliminate 
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

 Best,
 Sergiu
Karl
 

 Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
 Best,
  Sergiu
Karl Koch wrote:
   

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content
 

before
   

indexing that content as a whole.
Karl

 

I think that depends on what you want to do.  The Lucene demo parser
   

does
   

simple mapping of HTML files into Lucene Documents; it does not give you
   

a
   

parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
  

   

the
 

same API; will likely become part of Xerces), and so maps an HTML
   

document
   

into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it as
  

   

well --
 

based on its UI, it appears to be focused primarily on HTML validation
   

and
   

error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
   

beyond
   

indexing them in Lucene, and really like it.  It has been robust for me
   

so
   

far.
Chuck
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦ÓÐ¾¡ÓÐ£¬ËÑ±éÃÀÍ¼¡¢ÑÞÍ¼ºÍ¿áÍ¼
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
 il_1g/
 

   

-
   

 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

   


 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea

Karl Koch wrote:
I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).
Are there any very-short solutions for that?
 

if you are using only correct formated HTML pages and you are in control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that 
can cause you a lot of problems in the future.

It's up to you to use it 
Best,
Sergiu
Karl
 

Karl Koch wrote:
   

Hi,
yes, but the library your are using is quite big. I was thinking that a
 

5kB
   

code could actually do that. That sourceforge project is doing much more
than that but I do not need it.
 

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the size.
 You can use 3 lines of code with a good regular expresion to eliminate 
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

 Best,
 Sergiu
   

Karl

 

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
 Sergiu
Karl Koch wrote:
  

   

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and
 

simple
   

(KISS)) which allows to remove all HTML tags from HTML content? HTML
 

3.2
   

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure
 

but
   

need a facility to clean up HTML into its normal underlying content


 

before
  

   

indexing that content as a whole.
Karl



 

I think that depends on what you want to do.  The Lucene demo parser
  

   

does
  

   

simple mapping of HTML files into Lucene Documents; it does not give
   

you
   

  

   

a
  

   

parse tree for the HTML doc.  CyberNeko is an extension of Xerces
   

(uses
   

 

  

   

the


 

same API; will likely become part of Xerces), and so maps an HTML
  

   

document
  

   

into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it
   

as
   

 

  

   

well --


 

based on its UI, it appears to be focused primarily on HTML validation
  

   

and
  

   

error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
  

   

beyond
  

   

indexing them in Lucene, and really like it.  It has been robust for
   

me
   

  

   

so
  

   

far.
Chuck
   

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦ÓÐ¾¡ÓÐ£¬ËÑ±éÃÀÍ¼¡¢ÑÞÍ¼ºÍ¿áÍ¼
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
 

il_1g/
 

  

   

-
  

   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
 

[EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

  

   



 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

   


 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea

Kauler, Leto S wrote:
Another very cheap, but robust solution in the case you use linux is to 
make lynx to parse your pages.

lynx page.html  page.txt.
This will strip out all html and  script, style, csimport tags. And you 
will have a .txt file ready for indexing.

 Best,
 Sergiu
We index the content from HTML files and because we only want the good
text and do not care about the structure, well-formedness, etc we went
with regular expressions similar to what Luke Shannon offered.
Only real difference being that we firstly remove entire blocks of
(script|style|csimport) and similar since the contents of those are not
useful for keyword searching, and afterward just remove every leftover
HTML tags.  I have been meaning to add an expression to extract things
like alt attribute text from img though.
--Leto

 

-Original Message-
From: Karl Koch [mailto:[EMAIL PROTECTED] 

I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very 
short and simple
(KISS)) which allows to remove all HTML tags from HTML 
content? HTML 3.2 would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other 
structure but need a facility to clean up HTML into its 
normal underlying content before indexing that content as a whole.

Karl
   

  -Original Message-
  From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, February 01, 2005 1:15 AM
  To: lucene-user@jakarta.apache.org
  Subject: which HTML parser is better?
  
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
  
 

CONFIDENTIALITY NOTICE AND DISCLAIMER
Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.
This disclaimer has been automatically added.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea

Jingkang Zhang wrote:

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
  


maybe you can try this library...

http://htmlparser.sourceforge.net/

I use the following code to get the text from HTML files,
it was not intensively tested, but it works.

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.Translate;

Parser parser = new Parser(source.getAbsolutePath());
NodeIterator iter = parser.elements();
while (iter.hasMoreNodes()) {
Node element = (Node) iter.nextNode();
//System.out.println(1: + element.getText());
String text = Translate.decode(element.toPlainTextString());
if (Utils.notEmptyString(text))
writer.write(text);
}

Sergiu

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Duplicate Hits

2005-02-01 Thread sergiu gordea

Erik Hatcher wrote:
On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote:
OK - but I'm dealing with indexing between 1.5 and 2 million 
documents, so I
really don't want to 'batch' them up if I can avoid it.  And I also 
don't
think I can keep an IndexRead open to the index at the same time I 
have an
IndexWriter open.  I may have to try and deal with this issue through 
some
sort of filter on the query side, provided it doesn't impact 
performance to
much.

You can use an IndexReader and IndexWriter at the same time (the 
caveat is that you cannot delete with the IndexReader at the same time 
you're writing with an IndexWriter).  Is there no other identifying 
information, though, on the incoming documents with a date stamp?  
Identifier?  Or something unique you can go on?

Erik
As Erick suggested earlier, I think that keeping the information in the 
database and indentifying the new entries at database level is a better 
approach.
Indexing documents and optimizing the index on a that big index will be 
very time consuming information.
Also .. consider that in the future you would like to modify the 
structure of your index.

Think how much effort will be to split some fields in a few smaller 
parts. Or just to change the format of a field,
let's say you have a date in DDMMYY format and you need to change to 
MMDD.

And consider how much effort is needed to rebuild a completly new index 
from the database

Of course, your requirements may not ask to have the information stored 
in the database, and ... it is up to you to use a DB + Lucene index,

or just a Lucene index.
Best,
Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: reading fields selectively

2005-01-25 Thread sergiu gordea

Hi to all lucene developers,
The read fields selectively feature would be a very useful for me.
Do you plan to include it in the next lucene realeases?
I can patch lucene, but I will need to do it each time I upgrade my version,
and probably I would need to run the unit tests, and this is just
duplicated effort

I'm working on an application that uses lucene only to index
information that we store in
the database and in external files. We perform the search with lucene to
get the IDs of our
database records. The ID keyword field is the only one that we need to
read from the index.
Each document may index a few txt, pdf, doc, html, ppt, or xls files,
and some other database fields,
so .. the size of the lucene documents may be quite big.

Writing the ID as the first field in the index, and having the
possibility to read only the ID from the index
will be a great performance improvement in our case (speed and memory
usage).

Another frecquenty met situation is to have an index with an ALL
field, in order to perform the search easily,
and a few another separate fields, needed to get information from the
index and to apply special constraints
(i.e. for extended search functionality). Also in this case, the
information from the ALL field won't be read, but
lucene will load it in the memory, and the memory usage will be at least
twice bigger.

Thanks for understanding,
Sergiu
mark harwood wrote:
There is no API for this, but I recall somebody
talking about adding support for this a few months
back

See
http://marc.theaimsgroup.com/?l=lucene-devm=109485996612177w=2
This implementation was working on a version of Lucene
before compression was introduced so things may have
changed a little.
Cheers,
Mark

___
ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: addIndexes() Question

2004-12-23 Thread Sergiu Gordea

I think you should change a little bit your plans, and to think that
your goal is to
create a fast search engine not a fast indexing engine.
When you plan to index a lot of documents then it is possible to creata
a lot of segments (if you don't optimize the index)
and the serch will be very slow comparing with the search on an
optimized index.
The problem is that the optimization of big indexes is a time consuming
operation, and also

addIndexes(Directory[] dirs) I think is also a time consuming operation.
Therefore I suggest to think how can you design the indices to have a fast search, and then
you should design an offline indexing process.

That is my suggestion ... maybe it doesn't fit your requirements, maybe it does
...
All the best,
Sergiu
Ryan Aslett wrote:
Hi there, Im about to embark on a Lucene project of massive scale
(between 500 million and 2 billion documents). I am currently working
on parallellizing the construction of the Index(es).

Rough summary of my plan:
I have many, many physical machines, each with multiple processors that
I wish to dedicate to the construction of a single index.
I plan on having each machine gather its documents from a central
sychronized source (network, JMS, whatever).
Within each machine I will have multiple threads each responsible for
construcing an index slice.

When all machines and all threads are finished, I should have a slew of
index slices that I want to combine together to create one index.
My question is this: Will it be more efficient to call
addIndexes(Directory[] dirs) on all the slices all at once?

Or might it be better to continually merge small indexes into a larger
index, i.e. once an index slice reaches a particular size, merge it into
the main index and start building a new slice...
Any help would be appreciated..

Ryan Aslett
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene index files from two different applications.

2004-12-21 Thread Sergiu Gordea

Gururaja H wrote:
Hi !
Have two applications.  Both are supposed
to write Lucene index files and the WebApplication is supposed to read
these index files.
Here are the questions:
1.  Can two applications write index files, in the same directory, at the same time ?
 

if you implement the synchronisation between these 2 applications, yes
2.  If two applications cannot write index files, in the same directory, at the same time.  
How should we resolve this ?  Would appriciate any solutions to this...
 

... se 1. and 3.
3.  My thought is to write the index files in two different directories and read both the indexes
(as though it forms a single index, search results should consider the documents in both the indexes) from the WebApplication.  How to go about implementing this, using Lucene API ?  Need inputs on which of the Lucene API's to use ?
 

If yor requirements allow you to create to independent indices, than you 
can use the MultiSearcher to search in both indices.
Maybe this will be the most cost effective solution in your case,

Best,
 Sergiu
 
Thanks,
Gururaja
__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: restricting search result

2004-12-06 Thread Sergiu Gordea

Paul wrote:
Hi,
how yould you restrict the search results for a certain user? I'm
indexing all the existing data in my application but there are certain
access levels so some users should see more results then an other.
Each lucene document has a field with an internal id and I want to
restrict on that basis. I tried it with adding a long concatenation of
my ids (+locationId:1 +locationId:3 + ...) but this throws a More
than 32 required/prohibited clauses in query. exception.
Any suggestions?
thx!
Paul
 

What about indexing security levels with the documents, as numeric value 
and adding the constrains you need on this field?

I asume that the search will be faster in this case.
 Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Thread safety

2004-12-02 Thread sergiu gordea

Otis Gospodnetic wrote:
1. yes
2. yes error, meaningful, it depends what you find meaningful :)
3. searcher will still find the document, unless you close it and
reopen it (searcher)
 

... What about LockException? I tried to index objects in a thread and 
to use a IndexSearcher
to search objects, but I have had problems with this.
I tried to create a new  IndexSearcher object if  the index version  was 
changed, but unfortunately
I got some Lock Exceptions and FileNotFound Exceptions.

If the answer number 3. is correct, then why did I get these exceptions.
Sergiu
Otis
--- Zhang, Lisheng [EMAIL PROTECTED] wrote:
 

Hi,
I have an urgent question about thread safety in lucene,
from lucene doc and code I could not get a clear answer.
1. is Searcher (IndexSearcher, MultiSearcher ..) thread
   safe, can multi-users call search(..) method on the
   same object at the same time?
2. if on the same object, one user calls close( ) and
   another calls search(..), I assume we should have a
   meaningful error message?
3. what would happen if one user calls Searcher.search(..),
   but at the same time another user tries to delete that
   document from index files by calling IndexReader.delete(..)
   (either through two threads or two separate processes)?
A brief answer would be good enough for me now, thanks
very much in advance!
Lisheng
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene : avoiding locking (incremental indexing)

2004-11-16 Thread Sergiu Gordea

[EMAIL PROTECTED] wrote:
I am interested in pursuing experienced peoples' understanding as I have half the queue approach developed already.

well I think that experienced people developed lucene :) theyoffered us
the possibility to use multithreading and concurent searching.
Of course .. depends on requirements to use them or not. I choose to use
them ... because I'm developing a web application.

I am not following why you don't like the queue approach Sergiu. From what I gathered from this board, if you do lots of updates, the opening of the WriterIndex is very intensive and should be used in a batch orientation rather then on a one-at-a-time incremental approach.

That's not my case .. I have to reindex the information that is changed
in our system. We are developing a knowledge management platform and
reindex the objects each time they are changed.

In some cases on this board they talk about it being so overwhelming that people are putting forced delays so the Java engine can catch up.

I haven'T had this kind of problems and I use multithreading when I
reindex the whole index ... and the searches still work correctly
whithout any locking
problems. I think that the locking problems come from outside .. and
this locking sources should be identified.
But again .. this is just my case ...

Using a queueing approach, you may get a hit every 30 seconds or minute
or...whatever you choose as your timeframe, but it should be enough of a delay
to allow the java engine to not be overwhelmed.
No .. I cannot accept this because our users should be able to change
information in the system and to make searches in the same time, without
having to wait
to much for server response ...

I would like this not to happen with Lucene and would like to be able to update every time an update occurs, but this does not seem the right approach right now. As I said before, this seems like a wish item for Lucene. I don't really know if the wish is feasible.

I agree that maybe a built in function for identifying false locking
would be very usefull ... but it might be also a little bit bad for the
users because they
will be tempted to unlock index ... instead of closing readers/writers
correctly.

So far the biggest problem I was facing with this approach, however, was having feedback from the archiving process to the main database that the archiving change actually has happened and correctly even if the server goes down.

... so .. it may work correctly if we use lucene (and the servers and
the OS) correctly :)

Maybe it will be a good idea to create some junit/jmeter tests to
identify the source of unespected locks.
This is also depending on your availability. But I think it will worth
the effort.

Sergiu
JohnE

Personally I don't like the Queue aproach... because I already
implemented multithreading in out application
to improve its performance. In our application indexing is not a
high
priority, but it's happening quite often.
Search is a priority.

Lucene allows to have more searches at on time. When you have a
big
index and a many users then ...
the Queue aproach can slow down your application to much. I think
it
will be a bottleneck.

I know that the lock problem is annoying, but I also think that
the
right way is to identify the source of locking.
Our application is a webbased application based on turbine, and
when we
want to restart tomcat, we just kill
the process (otherwise we need to restart 2 times because of some
log4j
initialization problem), so ...
the index is locked after the tomcat restart. In my case it makes
sense
to check if index is locked one time at
startup. I'm also logging all errors that I get in the systems,
this is
helping me to find their sourcce easier.

All the best,
Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene : avoiding locking (incremental indexing)

2004-11-15 Thread sergiu gordea

Luke Shannon wrote:
I like the sound of the Queue approach.  I also don't like that I have to
focefully unlock the index.
 

Personally I don't like the Queue aproach... because I already 
implemented multithreading in out application
to improve its performance. In our application indexing is not a high 
priority, but it's happening quite often.
Search is a priority.

Lucene allows to have more searches at on time. When you have a big 
index and a many users then ...
the Queue aproach can slow down your application to much. I think it 
will be a bottleneck.

I know that the lock problem is annoying, but I also think that the 
right way is to identify the source of locking.
Our application is a webbased application based on turbine, and when we 
want to restart tomcat, we just kill
the process (otherwise we need to restart 2 times because of some log4j 
initialization problem), so ...
the index is locked after the tomcat restart. In my case it makes sense 
to check if index is locked one time at
startup. I'm also logging all errors that I get in the systems, this is 
helping me to find their sourcce easier.

All the best,
Sergiu
I'm not the most experience programmer and am on a tight deadline. The
approach I ended up with was the best I could do with the experience I've
got and the time I had.
My indexer works so far and doesn't have to forcefully release the lock on
the Index too often (the case is most likely to occur when someone removes a
content file(s) and the reader needs to delete from the existing index for
the first time). We will see what happens as more people use the system with
large content directories.
As I learn more I plan to expand the functionality of my class.
Luke S
- Original Message - 
From: [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, November 15, 2004 5:50 PM
Subject: Re: Lucene : avoiding locking (incremental indexing)

 

It really seems like I am not the only person having this issue.
So far I am seeing 2 solutions and honestly I don't love either totally.
   

I am thinking that without changes to Lucene itself, the best general way
to implement this might be to have a queue of changes and have Lucene work
off this queue in a single thread using a time-settable batch method.   This
is similar to what you are using below, but I don't like that you forcibly
unlock Lucene if it shows itself locked.   Using the Queue approach, only
that one thread could be accessing Lucene for writes/deletes anyway so there
should be no unknown locking.
 

I can imagine this being a very good addition to Lucene - creating a high
   

level interface to Lucene that manages incremental updates in such a manner.
If anybody has such a general piece of code, please post it!!!   I would use
it tonight rather then create my own.
 

I am not sure if there is anything that can be done to Lucene itself to
   

help with this need people seem to be having.  I realize the likely reasons
why Lucene might need to only have one Index writer and the additional load
that might be caused by locking off pieces of the database rather then the
whole database.  I think I need to look in the developer archives.
 

JohnE

- Original Message -
From: Luke Shannon [EMAIL PROTECTED]
Date: Monday, November 15, 2004 5:14 pm
Subject: Re: Lucene : avoiding locking (incremental indexing)
   

Hi Luke;
I have a similar system (except people don't need to see results
immediatly). The approach I took is a little different.
I made my Indexer a thread with the indexing operations occuring
the in run
method. When the IndexWriter is to be created or the IndexReader
needs to
execute a delete I called the following method:
private void manageIndexLock() {
try {
 //check if the index is locked and deal with it if it is
 if (index.exists()  IndexReader.isLocked(indexFileLocation)) {
  System.out.println(INDEXING INFO: There is more than one
process trying
to write to the index folder. Will wait for index to become
available.);//perform this loop until the lock if released or
3 mins
  // has expired
  int indexChecks = 0;
  while (IndexReader.isLocked(indexFileLocation)
 indexChecks  6) {
   //increment the number of times we check the index
   // files
   indexChecks++;
   try {
//sleep for 30 seconds
Thread.sleep(3L);
   } catch (InterruptedException e2) {
System.out.println(INDEX ERROR: There was a problem waiting
for the
lock to release. 
+ e2.getMessage());
   }
  }//closes the while loop for checking on the index
  // directory
  //if we are still locked we need to do something about it
  if (IndexReader.isLocked(indexFileLocation)) {
   System.out.println(INDEXING INFO: Index Locked After 3
minute of
waiting. Forcefully releasing lock.);
   IndexReader.unlock(FSDirectory.getDirectory(index, false));
   System.out.println(INDEXING INFO: Index lock released);
  }//close the if that actually releases the lock
 }//close the if ensure the file exists
}//closes the

Re: one huge index or many small ones?

2004-11-05 Thread sergiu gordea

javier muguruza wrote:
Sergiu,
A month could have tens of millions of emails in the worst case, but
maybe I could discard such bad assumption for our current project.
Lets say 1 emails per day max, that makes 300k emails a month.
Either I would choose one index per day or per month (or week or
whatever).
Your suggestion about index per user is not valid, my searches do not
require a user desafortunately. They can maybe say 'all email from
department C from last week' etc. So, if I choose one index per day(or
month) I already know that I will have to search in many indexes
depending on the timeframe (the time frame is the only required value
for the search)
thanks for the suggestions!
 

If you have time frame and department constraints,  and this repartition 
will result in a 10-30 k emails, probably it will
be a good Idea to create 1 index /dep/month.

Anyway .. changing the structure of the index is no problem in your case 
since you display the infromation from external sources
(database, filesystem...). 
My advice is to create  a function that will build/rebuild the indexes 
for the biging, work with that indexing, test if it fulfils your 
requirements,
and then you can refactor your code in the future. So ... chose one 
solution and implement the first prototype, and keep in mind that your
information is managed by the database, and lucene is just your search 
module.

Sergiu
On Thu, 04 Nov 2004 19:01:53 +0100, Sergiu Gordea
[EMAIL PROTECTED] wrote:
 

javier muguruza wrote:
Hi Javier,
I think the your optimization should take care of the response time of
search queries. I asume that this is the
variable  you need to optimize. Probably it will be a good thing to read
first the lucene benchmarks:
http://jakarta.apache.org/lucene/docs/benchmarks.html.
http://jakarta.apache.org/lucene/docs/benchmarks.html
If you have a mandatory date constraint for each of your indexes you
can split the index on time basis, I asume that
one index per month will be enough I think ... 10.000 emails I think it
will be fast enough if you will search in only one index afterwards.
But I think this is not such a good Idea?
What about creating one index per user? If your search require a user or
a sender, and you can get its name from database, and apply only
the other constrains on an index dedicated to that user .. I think the
lucene search will be much more faster.
Also the database search will be fast .. I don'T think you will have
more then 1.000-10.000 user names.
or maybe 1 index/user/year
or 1 index/receiver/year + 1index/sender/year
What about this solution is it feasible for your system?
All the best,
Sergiu

   

Thanks Erik and Giulio for the fast reply.
I am just starting to look at lucene so forgive me if I got some ideas
wrong. I understand your concerns about one index per email. But
having one index only is also (I guess) out of question.
I am building an email archive. Email will be kept indefinitely
available for search, adding new email every day. Imagine a company
with millions of emails per day (been there), keep it growing for
years, adding stuff to the index while using it for searches
continuously...
That's why my idea is to decide on a time frame (a day, a month...an
extreme would be an instant, that is a single email, my original idea)
and build the index for all the email in that timeframe. After the
timeframe is finished no more stuff will be ever added.
Before the lucene search emails are selected based on other conditions
(we store the from, to, date etc in database as well, and these
conditions are enforced with a sql query first, so I would not need to
enforce them in the lucene search again, also that query can be quite
sophisticated and I guess would not be easyly possible to do it in
lucene by itself). That first db step gives me a group of emails that
maybe I have to further narrow down based on a lucene search (of body
and attachment contents). Having an index for more than one emails
means that after the search I would have to get only the overlaping
emails from the two searches...Maybe this is better than keeping the
same info I have in the db in lucene fields as well.
An example: I want all the email from [EMAIL PROTECTED] from Jan
to Dec containing the word 'money'. I run the db query that returns a
list with john's email for that period of time, then (lets assume I
have one index per day) I iterate on every day, looking for emails
that contain 'money', from the results returned by lucene I keep only
these that are also in the first list.
Does that sound better?
On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
[EMAIL PROTECTED] wrote:
 

Hi Javier,
I suggest you to build a single index, with all the information you
need to find the right mail you are looking for. You than can use
Lucene alone to find you messages.
Giulio Cesare

On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
   

Hi,
We are going to move from a just-in-time perl based

Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-05 Thread sergiu gordea

Chuck Williams wrote:
Otis, thanks for looking at this.  The stack trace of the exception is
below.  I looked at the code.  It wants to delete every file in the
index directory, but fails to delete the CVS subdirectory entry
(presumably because it is marked read-only; the specific exception is
swallowed).  Even if it could delete the CVS subdirectory, this would
just cause another problem with Netbeans/CVS, since it wouldn't know how
to fix up the pointers in the parent CVS subdirectory.  Is there a
change I could make that would cause it to safely leave this alone?
 

Why do you have the lucene index in CVS? From what I know the lucene 
index folder shouldn't contain any other folder,
just the lucene files.  I think it won't be any problem to delete CVS 
folder from lucene index and to remove the index from CVS.
If you are affraid to do that .. you can move the CVS subfolder from 
lucene index into another folder ... and restore if you have any
problems. I'm sure you will have no problem ... but this is just for 
your trust...

Sergiu
This problem only arises on a full index (incremental == false =
create == true).  Incremental indexes work fine in my app.
Chuck
java.io.IOException: Cannot delete CVS
   at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:144)
   at org.apache.lucene.store.FSDirectory.init(FSDirectory.java:128)
   at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:102)
   at
org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:83)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:173)
   at [my app]...
  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
  Sent: Thursday, November 04, 2004 1:54 PM
  To: Lucene Users List
  Subject: Re: Is there an easy way to have indexing ignore a CVS
  subdirectory in the index directory?
  
  Hm, as far as I know, a CVS sub-directory in an index directory
should
  not bother Lucene.  As a matter of fact, I tested this (I used a
file,
  not a directory) for Lucene in Action.  What error are you getting?
  
  I know there is -I CVS option for ignoring files; perhaps it works
with
  directories, too.
  
  Otis
  
  
  --- Chuck Williams [EMAIL PROTECTED] wrote:
  
   I have a Tomcat web module being developed with Netbeans 4.0 ide
   using
   CVS.  One CVS repository holds the sources of my various web files
in
   a
   directory structure that directly parallels the standard Tomcat
   webapp
   directory structure.  This is well supported in a fully automated
way
   within Netbeans.  I have my search index directory as a
subdirectory
   of
   WEB-INF, which seemed the natural place to put it.  The index
files
   themselves are not in the repository.  I want to be able to do CVS
   Update for the web module directory tree as a whole.  However,
this
   places a CVS subdirectory within the index directory, which in
turn
   causes Lucene indexing to blow up the next time I run it since
this
   is
   an unexpected entry in the index directory.  To make things works,
to
   work around the problem I both need to delete the CVS subdirectory
   and
   find and delete the pointers to it in the Entries file and
Netbeans
   cache file within the CVS subdirectory of the parent directory.
This
   is
   annoying to say the least.
  
  
  
   I've asked the Netbeans users if there is a way to avoid creation
of
   the
   index's CVS subdirectory, but the same thing happened using WinCVS
   and I
   so I expect this is not a Netbeans issue.  It could be my relative
   ignorance of CVS.
  
  
  
   How do others avoid this problem?
  
  
  
   Any advice or suggestions would be appreciated.
  
  
  
   Thanks,
  
  
  
   Chuck
  
  
  
  
  
  
 
-
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: one huge index or many small ones?

2004-11-04 Thread Sergiu Gordea

javier muguruza wrote:
Hi Javier,
I think the your optimization should take care of the response time of  
search queries. I asume that this is the
variable  you need to optimize. Probably it will be a good thing to read 
first the lucene benchmarks:
http://jakarta.apache.org/lucene/docs/benchmarks.html. 
http://jakarta.apache.org/lucene/docs/benchmarks.html

If you have a mandatory date constraint for each of your indexes you 
can split the index on time basis, I asume that
one index per month will be enough I think ... 10.000 emails I think it 
will be fast enough if you will search in only one index afterwards.
But I think this is not such a good Idea?

What about creating one index per user? If your search require a user or 
a sender, and you can get its name from database, and apply only
the other constrains on an index dedicated to that user .. I think the 
lucene search will be much more faster.

Also the database search will be fast .. I don'T think you will have 
more then 1.000-10.000 user names.

or maybe 1 index/user/year
or 1 index/receiver/year + 1index/sender/year
What about this solution is it feasible for your system?
All the best,
 Sergiu
Thanks Erik and Giulio for the fast reply.
I am just starting to look at lucene so forgive me if I got some ideas
wrong. I understand your concerns about one index per email. But
having one index only is also (I guess) out of question.
I am building an email archive. Email will be kept indefinitely
available for search, adding new email every day. Imagine a company
with millions of emails per day (been there), keep it growing for
years, adding stuff to the index while using it for searches
continuously...
That's why my idea is to decide on a time frame (a day, a month...an
extreme would be an instant, that is a single email, my original idea)
and build the index for all the email in that timeframe. After the
timeframe is finished no more stuff will be ever added.
Before the lucene search emails are selected based on other conditions
(we store the from, to, date etc in database as well, and these
conditions are enforced with a sql query first, so I would not need to
enforce them in the lucene search again, also that query can be quite
sophisticated and I guess would not be easyly possible to do it in
lucene by itself). That first db step gives me a group of emails that
maybe I have to further narrow down based on a lucene search (of body
and attachment contents). Having an index for more than one emails
means that after the search I would have to get only the overlaping
emails from the two searches...Maybe this is better than keeping the
same info I have in the db in lucene fields as well.
An example: I want all the email from [EMAIL PROTECTED] from Jan
to Dec containing the word 'money'. I run the db query that returns a
list with john's email for that period of time, then (lets assume I
have one index per day) I iterate on every day, looking for emails
that contain 'money', from the results returned by lucene I keep only
these that are also in the first list.
Does that sound better? 

On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
[EMAIL PROTECTED] wrote:
 

Hi Javier,
I suggest you to build a single index, with all the information you
need to find the right mail you are looking for. You than can use
Lucene alone to find you messages.
Giulio Cesare

On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza [EMAIL PROTECTED] wrote:
   

Hi,
We are going to move from a just-in-time perl based search to using
lucene in our project. I have to index emails (bodies and also
attachements). I keep in the filesystem all the bodies and attachments
for a long period of time. I have to find emails that fullfil certain
conditions, some of the conditions are take care of at a different
level, so in the end I have a SUBSET of emails I have to run through
lucene.
I was assuming that the best way would be to create an index for each
email. Having an unique index for a group of emails (say a day worth
of email) seems too coarse grained, imagine a day has 1 emails,
and some queries will like to look in only a handful of the
emails...But the problem with having one index per emails is the
massive number of emails...imagine having 10 indexes
Anyway, any idea about that? I just wanted to check wether someones
feels I am wrong.
Thanks
-
 

   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How do Lucene applications deal with API changes?

2004-11-03 Thread sergiu gordea

Bill Janssen wrote:
Thanks to Bill Tschumy, who points out that Lucene 1.4.21 *breaks* the
API exported by 1.4 by removing a parameter from
QueryParser.getFieldQuery().  That means that my
NewMultiFieldQueryParser also breaks, since it overrides that method.
To fix, just remove the Analyzer parameter from the getFieldQuery()
method in NewMultiFieldQueryParser.
More generally, how is an application developer that wants to use
Lucene supposed to deal with these kinds of things?  It's a micro
release, the change isn't noted in the CHANGES.txt file, and as far as
I can see, there are no version numbers in the jar file you could look
at during an application configure.
Does anyone have any successful ways of dealing with these kinds of
things?  The only thing I can think of is to put a specific Lucene jar
in my app source code.
 

what about writing a JUnit test?
It can show when the code is broken. It's not too much improvement but 
it can be an improvement.
I have a testcase ... for my parser .. maybe I can adapt it share the code.

Sergiu
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: jaspq: dashed numerical values tokenized differently

2004-11-01 Thread sergiu gordea

Daniel Taurat wrote:
Hi,
I have just another stupid parser question:
There seems to be a special handling of the dash sign - different from
Lucene 1.2 at least in Lucene 1.4.RC3
StandardAnalyzer.
 

From the behaviour you describe I think that the dash sign is removed 
from the text by the analyzer.
This is quite correct because dash is used to separate two words. 
Without its elimination you won't be able to
get the dash-test in results if you search for: dash or/and test

I suggest you to use LUKE ... see contributors page in order to see what 
exactly you have in the index, then you will understand
why search is working like that.

Sergiu
Examples (1.4RC3):
A document containing the string dash-test is matched by the following
search expressions:
dash
test
dash*
dash-test
It is _not_ matched by the following search expressions:
dash-*
dash-t*
If the string after the dash consists of digits, the behavior is
different.
E.g., a document containing the string dash-123 is matched by:
dash*
dash-*
dash-123
It is not matched by:
dash
123
Question:
Is this, esp. the different behavior when parsing digits and characters,
intentional and how can it be explained?
Regards,
Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching for a path

2004-10-29 Thread sergiu gordea

Bill Tschumy wrote:
I have a need to search an index for documents that were taken ffrom 
particulars files in the filesystem.

Each document in the index has a field named url that is created using:
doc.add(Field.Text(url, urlStr));
I understand this is both stored and indexed.
My search works if I do something like:
String queryStr = \file:///someDir/someOtherDir/File.txt\
 query = MultiFieldQueryParser.parse(url: + queryString, 
searchedFields, new StandardAnalyzer());
 hits = searcher.search(query);

It is important for me to quote the path for the search to succeed
I was hoping to speed the search up a bit by bypassing the 
QueryParser.  However, if I do something like

String queryStr = \file:///someDir/someOtherDir/File.txt\
Query query = new TermQuery(new Term(url, queryStr));
hits = searcher.search(query);
For the begining I suggest you to make a system.out.println(query);
and to see what is the difference between the 2 queries 
 Sergiu
ahh I see now
you must to construct a PhraseQuery instead of TermQuery ...
The first one is PhraseQuery the second one that you construct with the 
term is TermQuery.
I suggest you to use QueryParser, the differemce in performance between 
your constructed query is just
the interpretation of regular expresion to find the type of the query. 
Using the QueryParser will ensure you that you won't
face problems that this one anymore.

   All the best,
 Sergiu
I get zero hits.  Why are these not equivalent?  I think it has 
something to do with the fact that the url needs to be quoted so I 
search for an exact match.  It does work if I have stored the url as a 
Field.Keyword rather than as Field.Text and then don't need to 
quote the string.  However I would prefer not to have to change the 
format of the index.

Thanks for any help.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new version of NewMultiFieldQueryParser

2004-10-29 Thread sergiu gordea

Bill Janssen wrote:
Try to see the behavior if you want to have a single term query 
juat something like: robust .. and print out the query string ...
   

Sure, that works fine.  For instance, if you have the three default
fields title, authors, and contents, the one-word search
robust expands to
  title:foobar authors:foobar contents:foobar
just as it should.
 

Strange .. on my computer was created just someting like
default:foobar
... and I think that should work like that on your computer too ... I've 
take a look on lucene code ... and I undestood why ...
all the best ... Sergiu

 

Try to see what is happening with Prefix, Wild, and Fuzzy searches ...
   

Good point.  My older version (see below) found these, but the new one
doesn't.  Oh, well, back to the working version.  I knew there was some
reason getFieldQuery wasn't sufficient.
The working version is in the file SearchTest.java, which you can find
at ftp://ftp.parc.xerox.com/transient/janssen/SearchTest.java.  It's a
test program which runs the query through the NewMultiFieldQueryParser,
and then prints it out, so that you can see what the expansion is.
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new version of NewMultiFieldQueryParser

2004-10-29 Thread sergiu gordea

Morus Walter wrote:
Bill Janssen writes:
 

Try to see the behavior if you want to have a single term query 
juat something like: robust .. and print out the query string ...
 

Sure, that works fine.  For instance, if you have the three default
fields title, authors, and contents, the one-word search
robust expands to
  title:foobar authors:foobar contents:foobar
just as it should.
   

Try to see what is happening with Prefix, Wild, and Fuzzy searches ...
 

Good point.  My older version (see below) found these, but the new one
doesn't.  Oh, well, back to the working version.  I knew there was some
reason getFieldQuery wasn't sufficient.
   

wouldn't it be better to go on and overwrite the methods creating these 
types of queries too?

Morus
 

Yes that't what I wanted to suggest ...
The query parser work fine  if you add all types of query parser ... but 
it was not working correctly in the case of single tem.
Therefore I test this first and I create a Query by using the normal 
MultifieldQueryParser.
Maybe is not the best solution but it works perfect ... and I had to 
write just a few lines of code 

Sergiu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new version of NewMultiFieldQueryParser

2004-10-28 Thread sergiu gordea

Bill Janssen wrote:
I'm not sure this solution is very robust 
   

Thanks, but I'm pretty sure it *is* robust.  Can you please offer a
specific critique?  Always happy to learn and improve :-).
 

Try to see the behavior if you want to have a single term query 
juat something like: robust .. and print out the query string ...
Try to see what is happening with Prefix, Wild, and Fuzzy searches ...
 :)
  Sergiu
 

I think I already sent an email with a better code...
   

Pretty vague.  Can you send a URL for that message in the archive?
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: new version of NewMultiFieldQueryParser

2004-10-27 Thread sergiu gordea

Bill Janssen wrote:
I'm not sure this solution is very robust 
I think I already sent an email with a better code...
 Sergiu
Thanks to something Doug said when I first opened this discussion, I
went back and looked at my implementation.  He said, Can't we just do
this in getFieldQuery?.  Figuring that he probably knew what he was
talking about, I looked a bit harder, and it turns out he was right.
Here's a much simpler version of NewMultiFieldQueryParser that seems
to work.
[For those just tuning in, this is a version of MultiFieldQueryParser
that will work with a default query operator of AND, as well as with
OR.]
Enjoy!
Bill
class NewMultiFieldQueryParser extends QueryParser {
   static private final String DEFAULT_FIELD = %%;
   protected String[] fieldnames = null;
   private Analyzer analyzer = null;
   public NewMultiFieldQueryParser (Analyzer a) {
   super(DEFAULT_FIELD, a);
   }
   public NewMultiFieldQueryParser (String[] f, Analyzer a) {
   super(DEFAULT_FIELD, a);
   fieldnames = f;
   analyzer = a;
   }
   public void setFieldNames (String[] f) {
   fieldnames = f;
   }
   protected Query getFieldQuery (String field,
  Analyzer a,
  String queryText)
   throws ParseException {
   Query x = super.getFieldQuery(field, a, queryText);
   if (field == DEFAULT_FIELD  (fieldnames != null)) {
   BooleanQuery q2 = new BooleanQuery();
   if (x instanceof PhraseQuery) {
   Term[] terms = ((PhraseQuery)x).getTerms();
   for (int i = 0;  i  fieldnames.length;  i++) {
   PhraseQuery q3 = new PhraseQuery();
   q3.setSlop(((PhraseQuery)x).getSlop());
   for (int j = 0;  j  terms.length;  j++) {
   q3.add(new Term(fieldnames[i], terms[j].text()));
   }
   q2.add(q3, false, false);
   }
   } else if (x instanceof TermQuery) {
   String text = ((TermQuery)x).getTerm().text();
   for (int i = 0;  i  fieldnames.length;  i++) {
   q2.add(new TermQuery(new Term(fieldnames[i], text)), false, false);
   }
   }
   return q2;
   }
   return x;
   }
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Need advice: what pdf lib to use?

2004-10-25 Thread sergiu gordea

[EMAIL PROTECTED] wrote:
Hi Iouli,
 If you don't think is illegal, you can hack the pdfbox code to remove 
the protection ...

   Sergiu
PDFbox stumbles also with class java.io.IOException with message:  - You 
do not have permission to extract text in case the doc is copy/print 
protected.
I tested now the snowtide commercial product and it looks like it could 
process these files as well. Performance was also not so bad. Unfortunatly 
the test result could not be considered as 100%, because the free version 
processed just first  8  pages.  After all this product costs a fortune 
(as long the company is ready to pay I don't realy mind:))

J.


Robert Newson [EMAIL PROTECTED]
Sent by: news [EMAIL PROTECTED]
24.10.2004 17:44
Please respond to Lucene Users List
   To: [EMAIL PROTECTED]
   cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
   Subject:Re: Need advice: what pdf lib to use?
   Category: 


[EMAIL PROTECTED] wrote:
 

Hello all,
I need a piece of advice/experience..
What pdf parser (written in java) u'd recommend?
I played now with PDFBox-0.6.7a and would not say I was satisfied too 
   

much 
 

with it
On certain pdf's (not well formated but anyway readable with acrobate) 
   

it 
 

run into dead loop (this I could fix in code),
and on one file it produced out of memory error and killed jvm:( (this 
   

 

problem I could not identify yet)
After all the performance was not too great as well: it took c. 19 h. to 
   

 

index 13000 files (c. 3.5Gb)
Regards,
J.

   

On the specific problem of the dead loop, I reported an instance of 
this to Ben a week or so ago and he has fixed it in the latest 
nightlies.  I expect an official release will include this bugfix soon. 
The file in question was unreadable with any PDF software I have, but 
someone managed to create it somehow...

http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
I've found pdfbox to be pretty good. The only time I get problems is 
with corrupted or egregiously bad PDF files.

B.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread sergiu gordea

of course POI, for open source.
There are some commercial products based on POI also.
for WORD consider textmining.org
for XLS, POI does anything you need
for powerpoint  there is one commercial (it's about 1000$), but you can 
also find some source code in archives.

All the best,
 Sergiu
[EMAIL PROTECTED] wrote:
Hello all,
I need a piece of advice/experience again..
What ms Word/Excel/PowerPoint parsers (written in java) u'd recommend?
Thanks in advance
J.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Need advice: what pdf lib to use?

2004-10-25 Thread sergiu gordea

Ben Litchfield wrote:
In order to write software that consumes PDF documents you must agree to a
list of conditions.  One of those conditions is that permissions specified
by the author of the PDF document are respected.
PDFBox complies with this statement, if there is software that does not
then they are in violation of copyright law.
 

I wanted to say something like this in one of my previous emails, when I 
said that  anyone can modify the code of
PDFBox to replace the restrictions

That being said, PDFBox is open source so a user could make modifications
to the source code, or as a PDF library could change permissions on a
document.
 

This seems to me as beeing a business decision,
Iouli  if your boss tels you that PDFBox is useless because it 
prevents you to get the text from protected pdfs,
than you should say him ... I can fix it but it is not legal. You can 
hack PDFbox, but before doing this you should
ensure that the authors let you do it.

All the best,
 Sergiu

Ben
On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
 

Yes Ben, You are right.
This would be correct functionality from technical perspective. But look
it my way with application programmer eyes reporting to big boss that c.
30% of doc we cope with could not be indexed because of this stupid
limitation. Neither he or me have any influence on pdf owners and any
ideas about what made  them create files with documet security set.
In short, if You also could implement this uncorrect functionality  the
closed source guys did, it would be really great!
As far as sponsoring is concerned I would be ready to hack (or at least to
try) it even for 1/3 of that fortune:)))
J.


Ben Litchfield [EMAIL PROTECTED]
25.10.2004 14:02
Please respond to Lucene Users List
   To: Lucene Users List [EMAIL PROTECTED]
   cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
   Subject:Re: Need advice: what pdf lib to use?
   Category:

PDFBox does not 'stumble' when it gives that message, that is correct
functionality if that permission is not allowed.
If your company is willing to pay a 'fortune' why not sponsor a change to
an open source project for half a fortune.
Ben
http://www.pdfbox.org
On Mon, 25 Oct 2004 [EMAIL PROTECTED] wrote:
   

PDFbox stumbles also with class java.io.IOException with message:  -
 

You
   

do not have permission to extract text in case the doc is copy/print
protected.
I tested now the snowtide commercial product and it looks like it could
process these files as well. Performance was also not so bad.
 

Unfortunatly
   

the test result could not be considered as 100%, because the free
 

version
   

processed just first  8  pages.  After all this product costs a fortune
(as long the company is ready to pay I don't realy mind:))
J.


Robert Newson [EMAIL PROTECTED]
Sent by: news [EMAIL PROTECTED]
24.10.2004 17:44
Please respond to Lucene Users List
   To: [EMAIL PROTECTED]
   cc: (bcc: Iouli Golovatyi/X/GP/Novartis)
   Subject:Re: Need advice: what pdf lib to use?
   Category:

[EMAIL PROTECTED] wrote:
 

Hello all,
I need a piece of advice/experience..
What pdf parser (written in java) u'd recommend?
I played now with PDFBox-0.6.7a and would not say I was satisfied too
   

much
 

with it
On certain pdf's (not well formated but anyway readable with acrobate)
   

it
 

run into dead loop (this I could fix in code),
and on one file it produced out of memory error and killed jvm:(
   

(this
   

problem I could not identify yet)
After all the performance was not too great as well: it took c. 19 h.
   

to
   

index 13000 files (c. 3.5Gb)
Regards,
J.

   

On the specific problem of the dead loop, I reported an instance of
this to Ben a week or so ago and he has fixed it in the latest
nightlies.  I expect an official release will include this bugfix soon.
The file in question was unreadable with any PDF software I have, but
someone managed to create it somehow...
http://sourceforge.net/tracker/index.php?func=detailaid=1037145group_id=78314atid=552832
I've found pdfbox to be pretty good. The only time I get problems is
with corrupted or egregiously bad PDF files.
B.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread Sergiu Gordea

Genty Jean-Paul wrote:
At 17:05 25/10/2004, you wrote:
of course POI, for open source.
There are some commercial products based on POI also.
for WORD consider textmining.org
for XLS, POI does anything you need
for powerpoint  there is one commercial (it's about 1000$), but you 
can also find some source code in archives.

 And what do you think about using Open Office's UNO APIs  ?
I didn't knew about them. Are they implemented in Java?
Do they support all MSOffice formats (97/2000/XP)?
Sergiu
 If someone did, does it scale well ? (I just did some unit testing )
Jean-Paul 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Null or no analyzer

2004-10-21 Thread sergiu gordea

Erik Hatcher wrote:
I don't like the idea of users having to know how a field was indexed 
though.  That seems to defeat the purpose of a general-purpose 
QueryParser.

Erik
I agree that, but maybe lucene should provide some subclasses of 
QueryParser that should deal this problems.
I'm just a lucene user, not a lucene developer, but I have had to 
implement a Extension for MultifieldQueryParser
to fix some not wanted behaviour that I already discussed in the mailing 
list. 
These problems that user face with creating the right qeury strings, 
(with the special case of untokenized fileds) togheter
with MultifieldQueryParser problems, MultiSearcher problems ... I think 
that all together suggest the idea of creating a
QueryParser class hierarchy.

 What do you think about that?
 All the best,
Sergiu

On Oct 21, 2004, at 2:38 AM, Morus Walter wrote:
Erik Hatcher writes:
however perhaps it should be.  Or perhaps there are other options to
solve this recurring dilemma folks have with Field.Keyword indexed
fields and QueryParser?
I think one could introduce a special syntax in query parser for
keyword fields. Query parser wouldn't analyze them at all in this case.
Something like
field#Keyword
or
field#keyword containing blanks
I haven't thought through all consequences for
field#(keywordA keywordB otherfield:noKeyword)
but I think it should be doable.
Doesn't make query parser simpler, on the other hand.
Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Null or no analyzer

2004-10-21 Thread sergiu gordea

Erik Hatcher wrote:
On Oct 21, 2004, at 5:38 AM, sergiu gordea wrote:
Erik Hatcher wrote:
I don't like the idea of users having to know how a field was 
indexed though.  That seems to defeat the purpose of a 
general-purpose QueryParser.

Erik

I agree that, but maybe lucene should provide some subclasses of 
QueryParser that should deal this problems.
I'm just a lucene user, not a lucene developer, but I have had to 
implement a Extension for MultifieldQueryParser
to fix some not wanted behaviour that I already discussed in the 
mailing list. These problems that user face with creating the right 
qeury strings, (with the special case of untokenized fileds) togheter
with MultifieldQueryParser problems, MultiSearcher problems ... I 
think that all together suggest the idea of creating a
QueryParser class hierarchy.

 What do you think about that?

Query parsing/expansion is the holy grail.  There are so many ways to 
do this sort of thing that I'm mostly of the opinion it is a 
per-project customization to get it tuned for the needs of that project.

Nutch has done some nice things with query parsing/expansion and 
extensibility.

I'm all for a more extensible base to work from, no question.
I'm personally not fond of MultiFieldQueryParser - I much prefer 
aggregate fields that are indexed (not stored) to be used for 
queries.  Blindly expanding queries across fields doesn't seem that 
useful to me.
In my case is very usefull. Because my search has constaints like
1) has xxx file format attachment
2) has xxx type
3) was created by xxx
4) search in attachmets or not
so ... I cannot make this customization without indexing in more fields 
and searching in more fields.
Creating the queryString by adding filed:keyword pair is just a hardly 
maintainable way of reinventing the
wheel. So .. in may case, MultifieldQueryParser is very useful, because 
I haave to add some boolean clauses
after I create the base query.
Last month I just refactored the method that created the search query 
for our extended search functionality.
It was a method with 200 lines of structural code (no query parser used).
using Boolean clauses and MultifieldQueryParser helped me a lot ... and 
the result was a method with fewer, easily maintainable
lines of code.

Of course ... this is needed in my project, but I think that almost all 
lucene indexes contain more then 2-3 fileds.

So ... once again MultifieldQueryParser is an elegent solution.
 Sergiu
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Null or no analyzer

2004-10-20 Thread sergiu gordea

Erik Hatcher wrote:
On Oct 20, 2004, at 9:55 AM, Aviran wrote:
AFIK if the term Election 2004 will be between quotation marks this 
should
work fine.

No, it won't.  The Analyzer will analyze it, and the 
WhitespaceAnalyzer would split it into two tokens [Election] and [2004].

This is a tricky situation with no clear *best* way to do this sort of 
thing.  However, given what I've seen of this thread so far I'd 
recommend using the PerFieldAnalyzerWrapper and associate the fields 
indexed as Field.Keyword with a KeywordAnalyzer.  There have been some 
variants of this posted on the list - it is not included in the API, 
however perhaps it should be.  Or perhaps there are other options to 
solve this recurring dilemma folks have with Field.Keyword indexed 
fields and QueryParser?

Erik
I still don't understand what is wrong with the Idea of indexing the 
title in a separate field and searching with a Phrase query
+title:Elections 2004 ?
I think that the real problem is that the title is not tokenized and the 
title contains more then Elections 2004

I think it is worthing to give a try to this solution.
Or maybe I don't understand the problem correctly ...
All the best,
Sergiu



Aviran
http://aviran.mordos.com
-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 2:25 AM
To: Lucene Users List
Subject: RE: Null or no analyzer
Aviran writes:
You can use WhiteSpaceAnalyzer
Can he? If Elections 2004 is one token in the subject field (keyword),
this will fail, since WhiteSpeceAnalyzer will tokenize that to 
`Elections'
and `2004'.
So I guess he has to write an identity analyzer himself unless there 
is one
provided (which doesn't seem to be the case). The only alternatives 
are not
using query parser or extending query parser for a key word syntax, 
as far
as I can see.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Null or no analyzer

2004-10-20 Thread Sergiu Gordea

Rupinder Singh Mazara wrote:
hi
the basic problem here is that there  are data source which contain
a) id, b) text c) title d) authors AND  d) subject heading
 

text, title and authors need to be tokenized
the subject heading can be one or more words,
 

the subject must be also tokennized, otherwise you cannot get any 
results that doesn't match the Term exaclty

so ... for example, let's asume you have the folowing titles:
George Trash Elections
George Trash
if you search for George Trash and your title is not tokenized you 
will get just the second document (I hope I'm
not making any mistake when I say that, anyway it can be easily tested).

anyone searching such datasource is expected to know the subject headings ,
if the user is trying to find all articles that have the phrases
Jhon Kerry and Goerge Bush as well as that are classified as Election
2004
it is possible that there are other documents that are classified as Nation
Service Records
or Tax Returns etc...
 

how is there represented in the GUI as a select box? or input field?
if it is select box, if you have the concept of unique domain concept  
.. you can use a  a not tokenized string, or even a numerical
representation, but I think it is not your case.
In the case of input fields .. again I suggest you to tokenize the string

so the object is to find documents that have the above mentioned phrases as
well as one one
of the subject classifiers, so as to pull out the most meaning full
documents
 

no problem ... once again .. use
+subject:my searched subject
the subject classifiers pretain to domain knowledge, and it is possible that
2 or more
subject classification headings are composed of the same set of words, but
the sequence
in which they appear can drastically alter the meaning hence tokenizing the
subject field
is not exactly a healthy solution.
 

the tokenization doesn't change the word order, in the case you use a 
PhraseQuery you will get the correct results

+title:George Bush
doesn't return documents with the title
Bush George
also such search tools are meant for people who know / understand  this
classification system
 

:)) This is a general truth the the result are better when the people 
know what they are searching for :)

Taxonomy of animals can be taken as one such example,
hope this helps define the problem
 

I cannot see anything special in your problem.
Before strating to implement a complex solution probably will be better 
to give it a chance to the simple one ...
I ensure you that you won't loose anything, and even if you decide to 
implement complex solutions you will have
a lot of reusable code.

so ... Have fun,
 Sergiu
PS: if you can provide an example with a false positive please ... 
provide us the case



 

I still don't understand what is wrong with the Idea of indexing the
title in a separate field and searching with a Phrase query
+title:Elections 2004 ?
I think that the real problem is that the title is not tokenized and the
title contains more then Elections 2004
I think it is worthing to give a try to this solution.
Or maybe I don't understand the problem correctly ...
All the best,
Sergiu


   

 

Aviran
http://aviran.mordos.com
-Original Message-
From: Morus Walter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 20, 2004 2:25 AM
To: Lucene Users List
Subject: RE: Null or no analyzer
Aviran writes:
   

You can use WhiteSpaceAnalyzer
 

Can he? If Elections 2004 is one token in the subject field (keyword),
this will fail, since WhiteSpeceAnalyzer will tokenize that to
`Elections'
and `2004'.
So I guess he has to write an identity analyzer himself unless there
is one
provided (which doesn't seem to be the case). The only alternatives
are not
using query parser or extending query parser for a key word syntax,
as far
as I can see.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: *term search

2004-10-07 Thread sergiu gordea

[EMAIL PROTECTED] wrote:
.. and here is the way to do it:
(See attached file: SUPPOR~1.RAR)
 

Hi all,
I got from iouli the solution to enable prefix queries (*term). In fact 
you can find the solution in
lucene source, in QueryParser.jj is said in a comment how to  enable 
prefix queries.

I did so ... but I found a lot of bugs. If you  define WildTerm as
| WILDTERM:  (_TERM_CHAR | ( [ *, ? ] ))* 
a lot of  constructions will be validated, and you will get a lot of 
errors ...
for example  and + are considered valid, * is considered 
valid, and they generate TooManyBooleanClausesExceptions,

I' not very good in creating regular expressions but I succesfully use 
the following construction ..

| WILDTERM:  (_TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ))* )
   | ( [ *, ? ] _TERM_START_CHAR (_TERM_CHAR | 
( [ *, ? ] ) )* ) 

Can anyone improve the construction and update the comment in 
QueryParser.jj?

 Thanks a lot,
 Sergiu

 
 Erik Hatcher
 [EMAIL PROTECTED]To:   Lucene Users List  
 utions.com   [EMAIL PROTECTED]  
  cc:   (bcc: Iouli Golovatyi/X/GP/Novartis) 
 08.09.2004 12:46 Subject:  Re: *term search 
 Please respond to   
 Lucene UsersCategory:   |-|
 List| ( ) Action needed   |
  | ( ) Decision needed |
  | ( ) General Information |
  |-|
 
 


On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote:
 

I want to discuss a little problem, lucene doesn't support *Term like
queries.
   

First of all, this is untrue.  WildcardQuery itself most definitely
supports wildcards at the beginning.
 

I would like to use *schreiben.
   

The dilemma you've encountered is that QueryParser prevents queries
that begin with a wildcard.
 

So my question is if there is a simple solution for implementing the
funtionality mentioned above.
Maybe subclassing one class and overwriting some methods will sufice.
   

It will require more than that in this case.  You will need to create a
custom parser that allows the grammar you'd like.  Feel free to use the
JavaCC source code to QueryParser as a basis of your customizations.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: different analyzer all produce the same index?

2004-10-04 Thread sergiu gordea

Daan Hoogland wrote:
H all,
I try to create different indices using different Analyzer-classes. I 
tried standard, german, russian, and cjk. They all produce exactly the 
same index file (md5-wise). There are over 280 pages so I expected at 
least some differences.

 

Take a look in the lucene source code... Maybe you will find the answer ...
I asume that all the pages you indexed were written in English, 
therefore is normal that german, russian and cjk analyzers to
create identic indexex, but htey should be different  than english one 
(StandardAnalyzer)

All the best,
Sergiu
Any ideas anyone?
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: BooleanQuery - Too Many Clases on date range.

2004-10-04 Thread Sergiu Gordea

Chris Fraschetti wrote:
absoultely, limiting the user's query is no problem here. I've
currently implemented the lucene javascript to catcha lot of user
quries that could cause issues.. blank queries, ? or * at the
beginning of query, etc etc... but I couldn't think of a way to
prevent the user from doing a*  but not   comment*   wanting comments
or commentary...  any suggestions would be warmly welcomed.
 

One cheap solution is to ask the user to enter at least 3 alfa-numerical 
chars.
What do you say about that?

 All the best,
 Sergiu
On Mon, 4 Oct 2004 14:08:00 -0400 (EDT), Stephane James Vaucher
[EMAIL PROTECTED] wrote:
 

Ok, got it, got a small comment though.
For large wildcard queries, please note that google does not support wild
cards. Search hell*, and there will be no correct matches with hello.
Is there a reason why you wish to allow such large queries? We might
be able to find alternative ways of helping you out. No one will use a
query a*. If someone does, the results would be completely meaningless
(many false positives for a user). However a query like program* might be
interesting to a user.
The problem with hacking term expansion is that the rules of this
expansion might be hard to define (as is maybe one should use the
first, the most frequent terms or the even the least frequent, depending
on your app).
sv
On Mon, 4 Oct 2004, Chris Fraschetti wrote:
   

The date portion of my code works great now.. no problems there, so
 

   

let me thank you now for your date filter solution... but my current
problem is in regards to a stand alone   a* query giving me
the too many clauses exception
On Mon, 4 Oct 2004 12:47:24 -0400 (EDT), Stephane James Vaucher
[EMAIL PROTECTED] wrote:
 

BTW, what's wrong with the DateFilter solution, I mentionned earlier?
I've used it before (before lucene-1.4 though) without memory problems,
thus I always assumed that it avoided the allocation problems with prefix
queries.
sv

On Mon, 4 Oct 2004, Chris Fraschetti wrote:
   

Surely some folks out there have used lucene on a large scale and have
had to compensate for this somehow, any other solutions? Morus, thank
you very more for your imput, and I am looking into your solution,
just putting my feelers out there once more.
The lucene API is very limited as to it's descriptions of it's
components, short of digging into the code, is there a good doc
somewhere out there that explains the workins of lucene?
On Mon, 4 Oct 2004 01:57:06 -0700, Chris Fraschetti
[EMAIL PROTECTED] wrote:
 

So before I spend a significant amount of time digging into the lucene
code, how does your experience with lucene give light to my
situation  Our current index is pretty huge, and with each
increase in side i've had, i've experienced a problem like this...
Without taking up too much of your time.. because obviously this i my
task, I thought i'd ask you if you'd had any experience with this
boolean clause nonsense...  of course it can be overcome, but if you
know a quick hack, awesome, otherwise.. no big, but off to work i go
:)
-Fraschetti
-- Forwarded message --
From: Morus Walter [EMAIL PROTECTED]
Date: Mon, 4 Oct 2004 09:01:50 +0200
Subject: Re: BooleanQuery - Too Many Clases on date range.
To: Lucene Users List [EMAIL PROTECTED], Chris
Fraschetti [EMAIL PROTECTED]
Chris Fraschetti writes:
   

So i decicded to move my epoch date to the  20040608 date which fixed
my boolean query problem in regards to my current data size (approx
600,000) 
but now as soon as I do a query like ...  a*
I get the boolean error again. Google obviously can handle this query,
and I'm pretty sure lucene can handle it.. any ideas? With out
without a date dange specified i still get the  TooManyClauses error.
 

   

I tired cranking the maxclauses up to Integer.MaxInt, but java gave me
a out of memory error. Is this b/c the boolean search tried to
allocate that many clauses by default or because my query actually
needed that many clauses?
 

boolean search allocates clauses for all tokens having the prefix or
matching the wildcard expression.
   

Why does it work on small indexes but not
large?
 

Because there are fewer tokens starting with a.
   

Is there any way to have the parser create as many clauses as
it can and then search with what it has? w/o recompiling the source?
 

You need to create your own version of Wildcard- and Prefix-Query
that takes a maximum term number and ignores further clauses.
And you need a variant of the query parser that uses these queries.
This can be done, even without recompiling lucene, but you will have to
do some programming at the level of lucene queries.
Shouldn't be hard, since you can use the sources as a starting point.
I guess this does not exist because the lucene developer decided to prefer
a query error rather than uncomplete results.
Morus
--

Re: Using lucene in Tomcat

2004-09-28 Thread sergiu gordea

mahaveer jain wrote:
Hi all,
I have implemented lucene search for my documents and db successfully.
Now my problem is, the index i created is indexing to my local disk, i want the index 
to be created with reference to my server.
Right now I index C:/tomcat/webapps/jetspeed/document, but I want to index wrt,  /jetspped/document.
 

maybe you can create a small method to convert
/jetspped/document to a File object and then call toStringMethod and pass to indexer.  


Let me know if some one 

		
-
Do you Yahoo!?
Y! Messenger - Communicate in real time. Download now.
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea

 Hi Fred,
I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, and 
what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete file errors from the indexer. I discovered that restarting
tomcat clears the problem. (Note that I'm recreating the index
completely, not updating.)
I've spent the last couple of hours trolling the archives and I've
found numerous references to windows problems with open files.
Is there a fix for this? How can I force the files to close? What's
the best work-around?
Many thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea

Hi Fred,
That's right, there are many references to this kind of problems in the 
lucene-user list.
This suggestions were already made, but I'll list them once again:

1. One way to use the IndexSearcher is to use yopur code, but I don't 
encourage users to do that
   IndexReader reader = null;
   IndexSearcher searcher = null;
   reader = IndexReader.open(indexName);
 searcher = new IndexSearcher(reader);

   It's better to use the constructor that uses a String to create a 
IndexSearcher.
|*IndexSearcher 
http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String 
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. I 
even suggest that the path to be obtained as

File indexFolder = new File(luceneIndex);
IndexSearcher searcher = new IndexSearcher(indexFolder.toString()).
2. I can imagine situations when the lucene index must be created at 
each startup, but I think that this is very rare,
so I suggest to use code like

if(indexExists(indexFolder))
   writer = new IndexWriter(index, new StandardAnalyzer(), false);
else
   writer = new IndexWriter(index, new StandardAnalyzer(), true);
//don#t forget to close the indexWriter when you create the index and to 
open it again

I use a indexExists function like
boolean indexExists(File indexFolder)
   return indexFolder.exists()
and it works propertly  even if that's not the best example of 
testing the existence of the index

3.'It is here that I get a failure, can't delete _b9.cfs'
that's ptobably because of the way you use the searcher, and probably 
because you don't close the readers, writers and searchers propertly.
4. be sure that all close() methods are guarded with
   catch(Exception e){
 logger.log(e);
   } blocks

5. Pay attention if you use a multithreading environment, in this case 
you have to make indexing, delition and search synchronized

  So ...
 Have fun,
   Sergiu
PS: I think that I'll submit some code with synchronized 
index/delete/search operations and to tell why I need to use it.

Fred Toth wrote:
Hi Sergiu,
My searches take place in tomcat, in a struts action, in a single method
Abbreviated code:
IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);
// code to do a search and extract hits, works fine.
searcher.close();
  reader.close();
I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:
writer = new IndexWriter(index, new StandardAnalyzer(), create);
// add docs
(with the create flag set true). It is here that I get a failure, 
can't delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still 
testing and
not live), so all readers and searchers should be closed, as least as 
far as
java is concerned. But windows will not allow the indexer to delete 
the old index.

I restarted tomcat and the problem cleared. It's as if the JVM on 
windows doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but 
it's not clear
what the fix might be.

Many thanks,
Fred
At 02:32 AM 9/20/2004, you wrote:
 Hi Fred,
I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, 
and what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete file errors from the indexer. I discovered that 
restarting
tomcat clears the problem. (Note that I'm recreating the index
completely, not updating.)

I've spent the last couple of hours trolling the archives and I've
found numerous references to windows problems with open files.
Is there a fix for this? How can I force the files to close? What's
the best work-around?
Many thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea

Fred Toth wrote:
Hi Sergiu,
Thanks for your suggestions. I will try using just the 
IndexSearcher(String...)
and see if that makes a difference in the problem. I can confirm that
I am doing a proper close() and that I'm checking for exceptions. Again,
the problem is not with the search function, but with the command-line
indexer. It is not run at startup, but on demand when the index needs
to be recreated.

Thanks,
Fred
I remenber it was one case where the searcher was used in the way you 
use but without keeping the
named reference to the index reader. This is not your case.

why do you get It is here that I get a failure, can't delete _b9.cfs?
are you trying to delete the index folder sometimes or ... why?
maybe one object is still using the index  when you try to delete it.
do you write your errors in log files?
It will be very helpful to have a StackTrace.
All the best,
Sergiu


At 08:50 AM 9/20/2004, you wrote:
Hi Fred,
That's right, there are many references to this kind of problems in 
the lucene-user list.
This suggestions were already made, but I'll list them once again:

1. One way to use the IndexSearcher is to use yopur code, but I don't 
encourage users to do that
   IndexReader reader = null;
   IndexSearcher searcher = null;
   reader = IndexReader.open(indexName);
 searcher = new IndexSearcher(reader);

   It's better to use the constructor that uses a String to create a 
IndexSearcher.
|*IndexSearcher 
http://localhost:8080/webdoc/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#IndexSearcher%28java.lang.String%29*(String 
http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html path)|. 
I even suggest that the path to be obtained as

File indexFolder = new File(luceneIndex);
IndexSearcher searcher = new IndexSearcher(indexFolder.toString()).
2. I can imagine situations when the lucene index must be created at 
each startup, but I think that this is very rare,
so I suggest to use code like

if(indexExists(indexFolder))
   writer = new IndexWriter(index, new StandardAnalyzer(), false);
else
   writer = new IndexWriter(index, new StandardAnalyzer(), true);
//don#t forget to close the indexWriter when you create the index and 
to open it again

I use a indexExists function like
boolean indexExists(File indexFolder)
   return indexFolder.exists()
and it works propertly  even if that's not the best example of 
testing the existence of the index

3.'It is here that I get a failure, can't delete _b9.cfs'
that's ptobably because of the way you use the searcher, and probably 
because you don't close the readers, writers and searchers propertly.
4. be sure that all close() methods are guarded with
   catch(Exception e){
 logger.log(e);
   } blocks

5. Pay attention if you use a multithreading environment, in this 
case you have to make indexing, delition and search synchronized

  So ...
 Have fun,
   Sergiu
PS: I think that I'll submit some code with synchronized 
index/delete/search operations and to tell why I need to use it.

Fred Toth wrote:
Hi Sergiu,
My searches take place in tomcat, in a struts action, in a single 
method
Abbreviated code:

IndexReader reader = null;
IndexSearcher searcher = null;
reader = IndexReader.open(indexName);
  searcher = new IndexSearcher(reader);
// code to do a search and extract hits, works fine.
searcher.close();
  reader.close();
I have a command-line indexer that is a minor modification of the
IndexHTML.java that comes with Lucene. It does this:
writer = new IndexWriter(index, new StandardAnalyzer(), 
create);
// add docs

(with the create flag set true). It is here that I get a failure, 
can't delete _b9.cfs
or similar. This happens when tomcat is completely idle (we're still 
testing and
not live), so all readers and searchers should be closed, as least 
as far as
java is concerned. But windows will not allow the indexer to delete 
the old index.

I restarted tomcat and the problem cleared. It's as if the JVM on 
windows doesn't
get the file closes quite right.

I've seen numerous references on this list to similar behavior, but 
it's not clear
what the fix might be.

Many thanks,
Fred
At 02:32 AM 9/20/2004, you wrote:
 Hi Fred,
I think that we can help you if you provide us your code, and the 
context in which it is used.
we need to see how you open and close the searcher and the reader, 
and what operations are you doing on index.

 All the best,
 Sergiu

Fred Toth wrote:
Hi,
I have built a nice lucene application on linux with no problems,
but when I ported to windows for the customer, I started experiencing
problems with the index not closing. This prevents re-indexing.
I'm using lucene 1.4.1 under tomcat 5.0.28.
My search operation is very simple and works great:
create reader
create searcher
do search
extract N docs from hits
close searcher
close reader
However, on several occasions, when trying to re-index, I get
can't delete

Re: QueryParser.parse() and Lucene1.4.1

2004-09-17 Thread sergiu gordea

Hi Polima,
It seems to me that your query string is not correct ...
(A AND -(B))
AND = +
NOT = -
In lucene AND and NOT opperators are mapped internal to +/-,
(AND and NOT are supported only because they are comming from natural language)
so ...
A + - (B) makes no sense ...
Sergiu


Polina Litvak wrote:
I have a question regarding QueryParser and lucene-1.4.1.jar:
When using lucene-1.3-final.jar, a query of the form: Field:(A AND -(B))
was parsed into +Field:A -Field:B (using QueryParser.parse()). 
After making the switch to lucene-1.4.1.jar, the same query is being
parsed into Field:A Field:- Field:B which is not the desired outcome.

Does anyone know how to work around this new feature ?
 



Thanks,
Polina
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search PharseQuery

2004-09-14 Thread sergiu gordea

String queryString = \waht is java\;
Query q = QueryParser.parse(queryString, field, new StandardAnalyzer());
System.out.println(q.toString());
This is enough for starting  consult Lucene API for more information
  Sergiu
Natarajan.T wrote:
Hi,
Thanks for your mail, that link says only theoretically but I need some
sample
Regards,
Natarajan.
-Original Message-
From: Cocula Remi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 2:58 PM
To: Lucene Users List
Subject: RE: Search PharseQuery

Use QueryParser. 
please take a look at
http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html
It's pretty clear.

-Message d'origine-
De : Natarajan.T [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 septembre 2004 11:26
À : 'Lucene Users List'
Objet : Search PharseQuery
Hi All,

How do I implement PharseQuery API? Pls send me some sample code.( How
can I handle java is platform as single word?
)
 

Regards,
Natarajan.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search PharseQuery

2004-09-14 Thread sergiu gordea

Natarajan.T wrote:
Hi,
Thanks for your response.
For example search keyword is like below...
Language what is java
Token 1:  language
Token 2: what is java(like google)
Regards,
Natarajan.
 

Lucene works exaclty as you describe above with a simple correction ...
The analyzer has a list of  stopped keywords, and I bet is is one of  
them for your analyzer.
I don't mind right now about this, so I won't dig to find a solution for 
this problem, but the resolution
should be searched around Analyzer classes.
All the best,

 Sergiu



-Original Message-
From: Aad Nales [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 5:19 PM
To: 'Lucene Users List'
Subject: RE: Search PharseQuery

Hi,
Not sure if this is what you need but I created a lastname filter which
in Dutch means potential double last names like:van der Vaart. In
order to process these I created a finite state machine that queried
these last names. Since I only needed the filter on 'index' time and I
never use it for querieying, this may not be what you are looking for.
It should be simple to index 'what is java' as a single token and to
search for that same token. However, you will need to create a list of
accepted 'tokens'. If this is what you need let me know, I will make the
code available...
cheers,
Aad Nales
-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 14 September, 2004 13:39
To: Lucene Users List
Subject: Re: Search PharseQuery

--- Natarajan.T [EMAIL PROTECTED]
wrote: 
 

Hi All,

How do I implement PharseQuery API?
   

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
 George




___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search PharseQuery

2004-09-14 Thread sergiu gordea

Natarajan.T wrote:
Ok you are correct ...
Suppose if I type what java then how can I handle...
 

You don't have to handle it, lucene does it. If you don't like how 
lucene handles it then you may extend
the functionality.

If you use the same analyzer for indexing and searching then you will 
find the results with both search strings:

what java and what is java.
At least I obtain them in both cases. 
That's right you will obtain 
what java if you search for what is java, in my case is acceptable.

If it is not acceptable in your project, I suggest to try to create a new Analyzer.
 I whish you luck,
 Sergiu



Regards,
Natarajan.
-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 7:38 PM
To: Lucene Users List
Subject: Re: Search PharseQuery

Natarajan.T wrote:
 

Hi,
Thanks for your response.
For example search keyword is like below...
Language what is java
Token 1:  language
Token 2: what is java(like google)
Regards,
Natarajan.

   

Lucene works exaclty as you describe above with a simple correction ...
The analyzer has a list of  stopped keywords, and I bet is is one of
them for your analyzer.
I don't mind right now about this, so I won't dig to find a solution for
this problem, but the resolution
should be searched around Analyzer classes.
All the best,
 Sergiu

 

-Original Message-
From: Aad Nales [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 14, 2004 5:19 PM
To: 'Lucene Users List'
Subject: RE: Search PharseQuery

Hi,
Not sure if this is what you need but I created a lastname filter which
in Dutch means potential double last names like:van der Vaart. In
order to process these I created a finite state machine that queried
these last names. Since I only needed the filter on 'index' time and I
never use it for querieying, this may not be what you are looking for.
It should be simple to index 'what is java' as a single token and to
search for that same token. However, you will need to create a list of
accepted 'tokens'. If this is what you need let me know, I will make
   

the
 

code available...
cheers,
Aad Nales
-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 14 September, 2004 13:39
To: Lucene Users List
Subject: Re: Search PharseQuery

--- Natarajan.T [EMAIL PROTECTED]
wrote: 

   

Hi All,

How do I implement PharseQuery API?
  

 

What exactly you mean by implement? Are you trying to
extend the current behavior or only trying find out
the usage?
Thanks,
George




___ALL-NEW
Yahoo! Messenger - all new features - even more fun!
http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: OutOfMemory example

2004-09-13 Thread sergiu gordea

I have a few comments regarding your code ...
1. Why do you use RamDirectory and not the hard disk?
2. as John said, you should reuse the index instead of creating it each 
time in the main function
   if(!indexExists(File indexFile))
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), true);
   else
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), false);
   (in some cases indexExists can be as simple as verifying if the file 
exits on the hard disk)

3. you iterate in a loop over 10.000 times and you create a lot of objects
  

for (int i = 0; i  365 * 30; i++) {
   Document doc = new Document();
   doc.add(Field.Keyword(date, df.format(new 
Date(c.getTimeInMillis();
   doc.add(Field.Keyword(id, AB + String.valueOf(i)));
   doc.add(Field.Text(text, Tohle je text  + i));
   writer.addDocument(doc);

   c.add(Calendar.DAY_OF_YEAR, 1);
   }
all the underlined lines of code create new  ojects, and all of them are 
kept in memory.
This is a lot of memory allocated only by this loop. I think that you 
create more than 100.000 object in this loop ...
What do you think?
And none of them cannot be realeased (collected by gc) untill you close 
the index writer.

None says that your code is complicated, but all programmers should 
understand that this is a poor design...
And ... more then that your information is kept in a RamDirectory 
when you will close the writer you will still keep the information 
in memory ...

Sory if I was too agressive with my comments  but ... I cannot see 
what were you thinking when you wrote that code ...

If you are trying to make a test  then I sugest you to replace the 
hard codded 365 value ... with a variable, to iterate over it and to 
test the power of your machine
(PC + JVM) :))

I wish you luck,
Sergiu


Ji Kuhn wrote:
I disagree or I don't understand. 

I can change the code as it is shown below. Now I must reopen the index to see the 
changes, but the memory problem remains. I realy don't know what I'm doing wrong, the 
code is so simple.
Jiri.
...
   public static void main(String[] args) throws IOException
   {
   Directory directory = create_index();
   for (int i = 1; i  100; i++) {
   System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   }
   }
   private static void add_to_index(Directory directory, int i) throws IOException
   {
   IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false);
   SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
   Document doc = new Document();
   doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis();
   doc.add(Field.Keyword(id, CD + String.valueOf(i)));
   doc.add(Field.Text(text, Tohle neni text  + i));
   writer.addDocument(doc);
   System.err.println(index size:  + writer.docCount());
   writer.close();
   }
...
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 3:25 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: OutOfMemory example

2004-09-13 Thread sergiu gordea

then probably is my mistake ...I havn't read all the emails in the thread.
So ... your goal is to produce errors ... I try to avoid them :))
  All the best,
 Sergiu
 

Ji Kuhn wrote:
You don't see the point of my post. I sent an application which can everyone run only 
with lucene jar and in deterministic way produce OutOfMemoryError.
That's all.
Jiri.
-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:16 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
I have a few comments regarding your code ...
1. Why do you use RamDirectory and not the hard disk?
2. as John said, you should reuse the index instead of creating it each 
time in the main function
   if(!indexExists(File indexFile))
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), true);
   else
IndexWriter writer = new IndexWriter(directory, new 
StandardAnalyzer(), false);
   (in some cases indexExists can be as simple as verifying if the file 
exits on the hard disk)

3. you iterate in a loop over 10.000 times and you create a lot of objects
  

for (int i = 0; i  365 * 30; i++) {
   Document doc = new Document();
   doc.add(Field.Keyword(date, df.format(new 
Date(c.getTimeInMillis();
   doc.add(Field.Keyword(id, AB + String.valueOf(i)));
   doc.add(Field.Text(text, Tohle je text  + i));
   writer.addDocument(doc);

   c.add(Calendar.DAY_OF_YEAR, 1);
   }
all the underlined lines of code create new  ojects, and all of them are 
kept in memory.
This is a lot of memory allocated only by this loop. I think that you 
create more than 100.000 object in this loop ...
What do you think?
And none of them cannot be realeased (collected by gc) untill you close 
the index writer.

None says that your code is complicated, but all programmers should 
understand that this is a poor design...
And ... more then that your information is kept in a RamDirectory 
when you will close the writer you will still keep the information 
in memory ...

Sory if I was too agressive with my comments  but ... I cannot see 
what were you thinking when you wrote that code ...

If you are trying to make a test  then I sugest you to replace the 
hard codded 365 value ... with a variable, to iterate over it and to 
test the power of your machine
(PC + JVM) :))

I wish you luck,
Sergiu


Ji Kuhn wrote:
 

I disagree or I don't understand. 

I can change the code as it is shown below. Now I must reopen the index to see the 
changes, but the memory problem remains. I realy don't know what I'm doing wrong, the 
code is so simple.
Jiri.
...
  public static void main(String[] args) throws IOException
  {
  Directory directory = create_index();
  for (int i = 1; i  100; i++) {
  System.err.println(loop  + i + , index version:  + 
IndexReader.getCurrentVersion(directory));
  search_index(directory);
  add_to_index(directory, i);
  }
  }
  private static void add_to_index(Directory directory, int i) throws IOException
  {
  IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), false);
  SimpleDateFormat df = new SimpleDateFormat(-MM-dd);
  Document doc = new Document();
  doc.add(Field.Keyword(date, df.format(new Date(System.currentTimeMillis();
  doc.add(Field.Keyword(id, CD + String.valueOf(i)));
  doc.add(Field.Text(text, Tohle neni text  + i));
  writer.addDocument(doc);
  System.err.println(index size:  + writer.docCount());
  writer.close();
  }
...
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 3:25 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
You should reuse your old index (as eg an application variable) unless 
it has changed - use getCurrentVersion to check the index for updates. 
This has come up before.

John
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread sergiu gordea


.
I reckon there has been a discussion (and solution :-) on how to achieve the
functionality you've been
after:
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116
I'm not sure if this would be the same though.
Best regards,
René
 

Hi all,
I took the code indicated by Rene but I've seen that it's not completly 
feeting my requirements, because my application should
provide the facility to check queries as beeing Fuzzy queries. so I 
modified the code to the following one, and I added a test main method.
Hope it helps someone.


package org.apache.lucene;
/* @(#) CWK 1.5 10.09.2004
*
* Copyright 2003-2005 ConfigWorks Informationssysteme  Consulting GmbH
* Universitätsstr. 94/7 9020 Klagenfurt Austria
* www.configworks.com
* All rights reserved.
*/
import java.util.Vector;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.Query;
/**
* @author sergiu
* this class is a patch for MultifieldQueryParser
* it's behaviour can be tested by running the main method
*
* 
Now:

String[] fields = new String[] { title, abstract, content };
QueryParser parser = new CustomQueryParser(fields, new SimpleAnalyzer());
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
Query query = parser.parse(foo -bar (baz OR title:bla));
System.out.println(?  + query);
Produces:
? +(title:foo abstract:foo content:foo) -(title:bar abstract:bar
content:bar) +((title:baz abstract:baz content:baz) title:bla)
Perfect!
* @version 1.0
* @since CWK 1.5
*/
public class CustomQueryParser extends QueryParser{
  private String[] fields;
  private boolean fuzzySearch = false;
  public CustomQueryParser(String[] fields, Analyzer analyzer){
super(null, analyzer);
this.fields = fields;
  }
  public CustomQueryParser(String[] fields, Analyzer analyzer, int 
defaultOperator){
  super(null, analyzer);
  this.fields = fields;
  setOperator(defaultOperator);
  }

  protected Query getFieldQuery(String field, Analyzer analyzer, String 
queryText)
throws ParseException{
   
Query query = null;
   
if (field == null){
  Vector clauses = new Vector();
  for (int i = 0; i  fields.length; i++){
  if(isFuzzySearch())
  clauses.add(new 
BooleanClause(super.getFuzzyQuery(fields[i], queryText), false, false));
  else
  clauses.add(new 
BooleanClause(super.getFieldQuery(fields[i], analyzer, queryText), 
false, false));
 
  }
  query = getBooleanQuery(clauses); 
}else{
if (isFuzzySearch())
query = super.getFuzzyQuery(field, queryText);
else
query = super.getFieldQuery(field, analyzer, 
queryText);

}
return query;
  }
 
  public boolean isFuzzySearch() {
  return fuzzySearch;
  }
 
  public void setFuzzySearch(boolean fuzzySearch) {
  this.fuzzySearch = fuzzySearch;
  }

  public static void main(String[] args) throws Exception{
  String[] fields = new String[] { title, abstract, content };
  CustomQueryParser parser = new CustomQueryParser(fields, new 
StandardAnalyzer());
  parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
  parser.setFuzzySearch(true);
 
  String queryString = foo -bar (baz OR title:bla);
  System.out.println(queryString);
  Query query = parser.parse(queryString);
  System.out.println(?  + query);   

  }
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread sergiu gordea

Hi Bill,
 I think that more people wait for this patch of MultifieldIndexParser.
 It would be nice if it will be included in the next realease candidate 


   All the best,
  Sergiu
Bill Janssen wrote:
René,
Thanks for your note.
I'd think that if a user specified a query cutting lucene, with an
implicit AND and the default fields title and author, they'd
expect to see a match in which both cutting and lucene appears.  That is,
(title:cutting OR author:cutting) AND (title:lucene OR author:lucene)
Instead, what they'd get using the current (broken) strategy of outer
combination used by the current MultiFieldQueryParser, would be
(title:cutting OR title:lucene) AND (author:cutting OR author:lucene)
Note that this would match even if only lucene occurred in the
document, as long as it occurred both in the title field and in the
author field.  Or, for that matter, it would also match Cutting on
Cutting, by Doug Cutting :-).
 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116
   

Yes, the approach there is similar.  I attempted to complete the
solution and provide a working replacement for MultiFieldQueryParser.
Bill
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

2004-09-09 Thread sergiu gordea

René Hackl wrote:
is it a problem if the users will search coffee OR tea as a search 
string in the case that MultifieldQueryParser is
modifyed as Bill suggested?, and the default opperator is set to AND?
   

No. There's not a problem with the proposed correction to MFQP. MFQP should
work the way Bill suggested.
My babbling about coffee or tea was more aimed at Bill's referring to darn
users started demanding nifty feature. So this is a totally different
matter. In my experience, many users fall to everyday language traps, like
in: What do you want to drink, coffee or tea? The answer normally isn't
'yes' to both, is it?  

 

this problem may be solved if the users know the meaning of the 
following signs mean:
- +  * ~
this will improve the results in a better way that our parsing is doing ...

I have an app where in some cases I make subqueries for an initial
user-stated query. The aim is to come up with pointers to partial matching
docs. The background is, one ill-advised NOT can ruin a query. But this has
nothing to do with MFQP. Just random thoughts about making users happy even
when they are new to formulating queries :-)
Cheers,
René
 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene index parser problem

2004-09-08 Thread sergiu gordea

maybe you should encode the html code ...
Patrick Burleson wrote:
Why oh why did you send this to the tomcat lists?
Don't cross post! Especially when the question doesn't even apply to
one of the lists.
Patrick
On Tue, 7 Sep 2004 16:35:35 -0400, hui liu [EMAIL PROTECTED] wrote:
 

Hi,
I have such a problem when creating lucene index for many html files:
It shows aborted, expectedtagnametagend for those html files
which contain java scripts. It seems it cannot parse the tags  \.
   

?? is  \ a valid tag? I think it should be  /
Do you want to index the whole HTML file, or just the information i this 
files?
Maybe you should use a HTML2TXT converter, and then index the resulting 
text.

All the best,
 Sergiu
Does anyone has any solution?
Thank you very very much...!!!
Ivy.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

*term search

2004-09-08 Thread sergiu gordea


Hi all,
I want to discuss a little problem, lucene doesn't support *Term like 
queries.
I know that this can bring a lot of results in the memory and therefore 
it is restricted.

I think that allowing this kind of search and limiting the amount of 
returned results would be
a more usefull aproach. Since the german language has a lot of words 
that are concatenated or
derivated from another words by using a prefix.

I'm not a good german speaker but I can say that maybe a half of the 
german words are a part of the
category described above.

for example
Himbeer, Erdbeer, Johanesbeer -- all of them are fruits from a certain 
category. So it will make sense to search
for *beer. Also ... I know that the word is ended in beer but I 
don't know the exact word ...
*beer will help me a lot.

also:
schreiben = to write
beschreiben = to describe
verschreiben = to subscribe ..
I would like to use *schreiben.
So my question is if there is a simple solution for implementing the 
funtionality mentioned above.
Maybe subclassing one class and overwriting some methods will sufice.

Thank in advance,
Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread sergiu gordea

The class is at the end of the message.
But it hink that a better solution is that one suggested by Rene: 

http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116
Wermus Fernando wrote:
Bill,
I don't receive any .java. Could you send it again?
Thanks.
-Mensaje original-
De: Bill Janssen [mailto:[EMAIL PROTECTED] 
Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m.
Para: Lucene Users List
CC: Ali Rouhi
Asunto: MultiFieldQueryParser seems broken... Fix attached.

Hi!
I'm using Lucene for an application which has lots of fields/document,
in which the users can specify in their config files what fields they
wish to be included by default in a search.  I'd been happily using
MultiFieldQueryParser to do the searches, but the darn users started
demanding more Google-like searches; that is, they want the search
terms to be implicitly AND-ed instead of implicitly OR-ed.  No
problem, thinks I, I'll just set the operator.
Only to find this has no effect on MultiFieldQueryParser.
Once I looked at the code, I find that MultiFieldQueryParser combines
the clauses at the wrong level -- it combines them at the outermost
level instead of the innermost level.  This means that if you have two
fields, author and title, and the search string cutting lucene,
you'll get the final query
  (title:cutting title:lucene) (author:cutting author:lucene)
If the search operator is OR, this isn't a problem.  But if it is,
you have two problems.  The first is that MultiFieldQueryParser seems
to ignore the operator entirely.  But even if it didn't, the second
problem is that the query formed would be
  +(title:cutting title:lucene) +(author:cutting author:lucene)
That is, if the word Lucene was in both the author field and the
title field, the match would fit.  This clearly isn't what the
searcher intended.
You can re-write MultiFieldQueryParser, as I've done in the example
code which I append here.  This little program allows you to run
either my parser (-DSearchTest.QueryParser=new) or the old parser
(-DSearchTest.QueryParser=old).  It allows you to use either OR
(-DSearchTest.QueryDefaultOperator=or) or AND
(-DSearchTest.QueryDefaultOperator=and) as the operator.  And it
allows you to pick your favorite set of default search terms
(-DSearchTest.QueryDefaultFields=author:title:body, for example).  It
takes one argument, a query string, and outputs the re-written query
after running it through the query parser.  So to evaluate the above
query:
% java -classpath /import/lucene/lucene-1.4.1.jar:. \
  -DSearchTest.QueryDefaultFields=title:author \
  -DSearchTest.QueryDefaultOperator=AND \
  -DSearchTest.QueryParser=old \
  SearchTest cutting lucene
query is (title:cutting title:lucene) (author:cutting author:lucene)
%
The class NewMultiFieldQueryParser does the combination at the inner
level, using an override of addClause, instead of the outer level.
Note that it can't cover all cases (notably PhrasePrefixQuery, because
that class has no access methods which allow one to introspect over
it, and SpanQueries, because I don't understand them well enough :-).
I post it here in advance of filing a formal bug report for early
feedback.  But it will show up in a bug report in the near future.
Running the above query with the new parser gives:
% java -classpath /import/lucene/lucene-1.4.1.jar:. \
  -DSearchTest.QueryDefaultFields=title:author \
  -DSearchTest.QueryDefaultOperator=AND \
  -DSearchTest.QueryParser=new \
  SearchTest cutting lucene
query is +(title:cutting author:cutting) +(title:lucene author:lucene)
%
which I claim is what the user is expecting.
In addition, the new class uses an API more similar to QueryParser, so
that the user has less to learn when using it.  The code in it could
probably just be folded into QueryParser, in fact.
Bill
the code for SearchTest:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.document.Document;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.RangeQuery;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.FastCharStream;
import org.apache.lucene.queryParser.TokenMgrError;
import org.apache.lucene.queryParser.ParseException;
import java.io.File;

I appologize for this email...

2004-09-01 Thread Sergiu Gordea

Sory,
I send this email to transfer my contacts between Mozilla and 
Thunderbird email client.

 Sergiu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Having common word in the search

2004-08-02 Thread Sergiu Gordea

I have the same problem. Right now I think is not possible to do what 
you want by using MultifieldQueryParser.
Right now I iplemented a query normalization for our product, but I 
consider that the best way is to take the source code
and to implement:

Query q = MultiFieldQueryParser.parse(line,fields,analyzer, operator);
where operator can be  AND / OR
Another solution may be to Forget  about  MultifieldQueryParser, and to 
create an compose the correct
query parser with QueryParser and BooleanQueries.

MultifieldQueryParser failes in the case you search in two fileds for +a 
+b, because it requires that both terms to be in on of the fields.
A bigger problem is in the case you want to search for a AND NOT b in 
two fields. Because these error is instantly observed by users.
In the first case +a +b you will get less results then the correct one.
In teh second case you will get more results (correct ones + wrong ones).

There was a disscusion on this topic. Please search the mail archive.
 All the best,
Sergiu
lingaraju wrote:
Dear  All
Searcher searcher = new IndexSearcher(C:/index);
Analyzer analyzer = new StandardAnalyzer();
String line=curry asia;  
line=line+recipe;
String fields[] = new String[2];
fields[0] = title;
fields[1] = contents;

Query q = MultiFieldQueryParser.parse(line,fields,analyzer);
Hits hits1 = searcher.search(q);
In the above code Hits will return the documnet  that contains
the word 
1)Curry OR asia OR recipe
2)Curry OR asia AND recipe
3)Curry AND asia AND recipe
4)Curry AND asia OR recipe

But I want the result should be
Like this 
1)Curry AND asia AND recipe
2)(Curry OR asia) AND recipe

My question is how to give the condition
Actually my requirement is like this 
User will enter some text in text box it may be one word or two word or n word.(Eg curry asia)
but when i am searching i will append recipe word in the search string so the search must contains recipe  word.

Finally search should contains
1)Curry AND asia AND recipe
2)(Curry OR asia) AND recipe
search should not contains
1)Curry AND asia OR recipe
2)Curry OR asia OR recipe
Thanks and regards
Raju
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search exception in servlet!

2004-08-02 Thread Sergiu Gordea

Probably it will be a good idea to provide the stack trace of the error 
you get.
It's a little bit hard to guess the error in the code you provided.

 Sergiu
xuemei li wrote:
hi,all
I am using lucene to search.It works fine before I put the code into the
doPost of servlet.But after that it will throw exception when I use
servlet.
This is the exception sentence:
   IndexSearcher searcher = new IndexSearcher(indexPath);
 hits=searcher.search(query);--exception
What's the problem?What is the difference between servlet java program and
non-servlet java program?
Any reply will be appreciated.
Thanks,
Xuemei Li

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: continous index update

2004-07-28 Thread Sergiu Gordea

you have to delete the documents using IndexReader and write the 
Documents using IndexWriter,
both of them place a lock on the index file, so ... you cannot work with 
both of them in the same time.
(you get errors when you have an opened IndexWriter and try to delete a 
document with an IndexWriter).

 I'm using lucene to index information I have in a database. Each time 
the records in database are modified, I delete
the document and I add it again. If you are working with threads, make 
sure that the instantiation and closing of IndexWriter and
IndexReaders are made in synchronized code.

All the best,
 Sergiu

jitender ahuja wrote:
Hi,
 I am working on Windows platform and I think it wouldn't work there.
If it can, do please tell me.
Regards,
Jitender
- Original Message - 
From: Vladimir Yuryev [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, July 28, 2004 3:17 PM
Subject: Re: continous index update

 

Hi!
I do automatic index update by cron daemon.
Regards,
Vladimir.
On Wed, 28 Jul 2004 15:05:46 +0530
 jitender ahuja [EMAIL PROTECTED] wrote:
   

Hi all,
I am trying to make an automatic index update file based o
a background thread, but it gives errors in deleting the existing
index, if (only if) the server accesses the index at the same time or
has once accessed it and even if a different request is posed, i.e.
for a different index directory or a different job, it makes no
difference.
Can anyone tell that in such a continous update scenario, how the old
index can be updated as I feel deletion is a must of the earlier
contents so as to get the new contents in place.
Regards,
Jitender
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

rebuild index

2004-07-22 Thread Sergiu Gordea

Hi all,
I have a question related to reindexing of documents with lucene.
We want to implement the functinality of rebuilding lucene index.
That means I want to delete all documents in the index and to add newer 
versions.
All information I need to reindex is kept in the database so that I have 
a Term ID, which is unique.

My problem is that I don't have a deleteall() method in IndexReader, and 
I don't have undelete(int) and undelete(Term)
methods. I have only delete(Term) and  undeleteAll() methods that can be 
used for this action.

I would like to delete all documents (just mark as deleted). Add the new 
documents o the index and create a list of documents that were not 
succesfully indexed,
(from different reasons, that may depend on lucene or on our code). At 
the end I would like to restore (mark as undeleted) the documents in the 
list and to optimize the
index, so that the changes to be permanetly commited in the index.

Is this possible witout hacking lucene code? Any Ideas?
Thanks in advance,
Sergiu


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: rebuild index

2004-07-22 Thread Sergiu Gordea

Because on the other hand I want to have a clean index, without any kind 
of garbage.

This is the requested funtionality of the rebuild index function.
Clean Index and don't loose data.
I was also thinking that I can delete the index location and create a 
new index, this may have the same effect as the missing
deleteAll() method. But in this case I loose all the data from index 
forever, and If  I get a error because of write lock,
I may have no index at all. Which is inacceptable for a productve system.

Anyway, thanks for ideea, It may work if I merge the indexes in my code, 
but I don't fill that this is the right way to solve the problem.

Sergiu

Aviran wrote:
Why don't you just build a new index in a different location and at the end
add the missing documents from the old index to the new one, and then delete
the old index.
Aviran
-Original Message-
From: Sergiu Gordea [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 22, 2004 10:49 AM
To: Lucene Users List
Subject: rebuild index


Hi all,
I have a question related to reindexing of documents with lucene. We want
to implement the functinality of rebuilding lucene index. That means I want
to delete all documents in the index and to add newer 
versions.
All information I need to reindex is kept in the database so that I have 
a Term ID, which is unique.

My problem is that I don't have a deleteall() method in IndexReader, and 
I don't have undelete(int) and undelete(Term)
methods. I have only delete(Term) and  undeleteAll() methods that can be 
used for this action.

I would like to delete all documents (just mark as deleted). Add the new 
documents o the index and create a list of documents that were not 
succesfully indexed,
(from different reasons, that may depend on lucene or on our code). At 
the end I would like to restore (mark as undeleted) the documents in the 
list and to optimize the
index, so that the changes to be permanetly commited in the index.

Is this possible witout hacking lucene code? Any Ideas?
Thanks in advance,
Sergiu


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching against Database

2004-07-15 Thread Sergiu Gordea

Hi again,
I'm thinking to get the list of IDs from the database and the list of 
hits from Lucene Index and to create a comparator in order to eliminate the
not permitted  Hits from the list.

Which solution do you think is better?
Thanks,
Sergiu

Sergiu Gordea wrote:
Hi,
I have a simillar problem. I'm working on a web application in which 
the users have different permissions.
Not all information stored in the index is public for all users.

The documents in Index are identified by the same  ID that the  rows 
have in database tables.

I can get the  IDs of the documents that can be accesible by the user, 
but if this are 1000, what will happen in Lucene?

Is this a valid solution? Can anyone provide a better idea?
Thanks,
Sergiu
lingaraju wrote:
Hello
Even i am searching the same code as all my web display information is
stored  in database.
Early response will be very much helpful
Thanks and regards
Raju
- Original Message - From: Hetan Shah [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 5:56 AM
Subject: Searching against Database
 

Hello All,
I have got all the answers from this fantastic mailing list. I have
another question ;)
What is the best way (Best Practices) to integrate Lucene with live
database, Oracle to be more specific. Any pointers are really very much
appreciated.
thanks guys.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching against Database

2004-07-15 Thread Sergiu Gordea

This is not a solution in my case,
becasue the permissions of the groups, and the user groups can be 
changed, and it will make managing index to be a nightmare.

 anyway,
 I appreciate the advice, maybe it will be useful for the other  guys 
that asked this question.

  Sergiu
[EMAIL PROTECTED] wrote:
If you know ahead of time which documents are viewable by a certain user
group you could add a field, such as group, and then when you index the
document you put the names of the user groups that are allowed to view that
document.  Then your query tool can append, for example AND
group:developers to the user's query.  Then you will not have to merge
results.
-Will
-Original Message-
From: Sergiu Gordea [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 2:58 AM
To: Lucene Users List
Subject: Re: Searching against Database
Hi,
I have a simillar problem. I'm working on a web application in which the 
users have different permissions.
Not all information stored in the index is public for all users.

The documents in Index are identified by the same  ID that the  rows 
have in database tables.

I can get the  IDs of the documents that can be accesible by the user, 
but if this are 1000, what will happen in Lucene?

Is this a valid solution? Can anyone provide a better idea?
Thanks,
Sergiu
lingaraju wrote:
 

Hello
Even i am searching the same code as all my web display information is
stored  in database.
Early response will be very much helpful
Thanks and regards
Raju
- Original Message - 
From: Hetan Shah [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 5:56 AM
Subject: Searching against Database


   

Hello All,
I have got all the answers from this fantastic mailing list. I have
another question ;)
What is the best way (Best Practices) to integrate Lucene with live
database, Oracle to be more specific. Any pointers are really very much
appreciated.
thanks guys.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index MSOffice Documents

2004-06-28 Thread Sergiu Gordea


Ryan Ackley wrote:
Thanks Sergiu,
You should also post to the Lucene Users list.
-Ryan
 

I did it from the begining. But I want to report a bugin this code. My 
coleagues reported me
that is possible to get an OutOfMemoryException for a PPT they have. I 
will try to debug this is the next days.

 Sergiu

- Original Message - 
From: Sergiu Gordea [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED];
[EMAIL PROTECTED]
Cc: POI Users List [EMAIL PROTECTED]
Sent: Friday, June 25, 2004 8:42 AM
Subject: Index MSOffice Documents

 

Hi all,
I'm working on a project in which we are building a knowledge
management platform. We are using Turbine/Velocity
as framework and we are using lucene for search.
We want to make the search to be able to index MSOffice Documents,
therefore I was searching for some possibilities to extract the text
from this
documents. I found some examples based on POI library
(http://jakarta.apache.org/poi) and I addapted them to our needs.
The extraction of the text elements from XLS file I think is trustable
(the from POI development comunity did a great job with the package that
work with XSL files). The examples that extract the text from DOC and
PPT files are not very general, I think they have problems with the
documents
written with special charsets but they are working just well on the
documents I use. I hope someone that has more experience that I have
will improve this
and will a better source code.
Congratulations to all people involved in development of the Jakarta
project and it's subprojects,
Sergiu Gordea
Ps: ExeConverteImpl uses an external stand alone application (like
antiwort or pdf2txt) to extract the text.
   




 

/* @(#) CWK 1.4 07.06.2004
*
* Copyright 2003-2005 ConfigWorks Informationssysteme  Consulting GmbH
* Universitätsstr. 94/7 9020 Klagenfurt Austria
* www.configworks.com
* All rights reserved.
*/
package com.configworks.cwk.be.search.converters;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
/**
* Class description
*
* @author sergiu
* @version 1.0
* @since CWK 1.5
*/
public class XLSConverterImpl extends JavaDocumentConverter {
   private Log logger = null;
   File dest = null;

   public boolean extractText(InputStream reader, BufferedWriter writer)
   

throws FileNotFoundException,
 

   IOException {
   HSSFWorkbook workbook = new HSSFWorkbook(reader);
   for (int k = 0; k  workbook.getNumberOfSheets(); k++) {
   HSSFSheet sheet = workbook.getSheetAt(k);
   if (sheet != null) {
   int rows = sheet.getLastRowNum();
   //I don't know why the last row = sheet.getRow(rows) and
   

first row = sheet.getRow(0)
 

   for (int r = 0; r = rows; r++) {
   HSSFRow row = sheet.getRow(r);
   if (row != null) {
   int cells = row.getLastCellNum();
   for (int c = 0; c = cells; c++) {
   HSSFCell cell = row.getCell((short) c);
   String value = null;
   if (cell != null) {
   switch (cell.getCellType()) {
   case HSSFCell.CELL_TYPE_FORMULA:
   value = cell.getCellFormula();
   break;
   case HSSFCell.CELL_TYPE_STRING:
   value = cell.getStringCellValue();
   break;
   case HSSFCell.CELL_TYPE_NUMERIC:
   value =  + cell.getNumericCellValue();
   break;
   default:
   value = cell.getStringCellValue();
   }
   }
   if (value != null) {
   writer.write(value +  );
   }
   }//cels
   }
   }//rows
   }
   }//sheets
   //if no Exception was thrown consider that the conversion was
   

successful
 

   return true;
   }
   /**
* @return Returns the logger.
*/
   public Log getLogger() {
   if (logger == null)
   logger = LogFactory.getLog(XLSConverterImpl.class);
   return logger;
   }
}

   




 

package com.configworks.cwk.be.search.converters;
import com.configworks.cwk.share.Utils;
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import

Index MSOffice Documents

2004-06-25 Thread Sergiu Gordea

Hi all,
I'm working on a project in which we are building a knowledge 
management platform. We are using Turbine/Velocity
as framework and we are using lucene for search.

We want to make the search to be able to index MSOffice Documents, 
therefore I was searching for some possibilities to extract the text 
from this
documents. I found some examples based on POI library 
(http://jakarta.apache.org/poi) and I addapted them to our needs.
The extraction of the text elements from XLS file I think is trustable 
(the from POI development comunity did a great job with the package that
work with XSL files). The examples that extract the text from DOC and 
PPT files are not very general, I think they have problems with the 
documents
written with special charsets but they are working just well on the 
documents I use. I hope someone that has more experience that I have 
will improve this
and will a better source code.

Congratulations to all people involved in development of the Jakarta 
project and it's subprojects,

Sergiu Gordea
Ps: ExeConverteImpl uses an external stand alone application (like 
antiwort or pdf2txt) to extract the text.
/* @(#) CWK 1.4 07.06.2004
 * 
 * Copyright 2003-2005 ConfigWorks Informationssysteme  Consulting GmbH
 * Universitätsstr. 94/7 9020 Klagenfurt Austria
 * www.configworks.com
 * All rights reserved.
 */

package com.configworks.cwk.be.search.converters;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;

/**
 * Class description
 *
 * @author sergiu
 * @version 1.0
 * @since CWK 1.5
 */
public class XLSConverterImpl extends JavaDocumentConverter {

private Log logger = null;
File dest = null;



public boolean extractText(InputStream reader, BufferedWriter writer) throws 
FileNotFoundException,
IOException {

HSSFWorkbook workbook = new HSSFWorkbook(reader);

for (int k = 0; k  workbook.getNumberOfSheets(); k++) {
HSSFSheet sheet = workbook.getSheetAt(k);

if (sheet != null) {
int rows = sheet.getLastRowNum();
//I don't know why the last row = sheet.getRow(rows) and first row = 
sheet.getRow(0) 
for (int r = 0; r = rows; r++) {
HSSFRow row = sheet.getRow(r);
if (row != null) {
int cells = row.getLastCellNum();
for (int c = 0; c = cells; c++) {
HSSFCell cell = row.getCell((short) c);
String value = null;
if (cell != null) {
switch (cell.getCellType()) {
case HSSFCell.CELL_TYPE_FORMULA:
value = cell.getCellFormula();
break;
case HSSFCell.CELL_TYPE_STRING:
value = cell.getStringCellValue();
break;
case HSSFCell.CELL_TYPE_NUMERIC:
value =  + 
cell.getNumericCellValue();
break;
default:
value = cell.getStringCellValue();
}
}
if (value != null) {
writer.write(value +  );
}
}//cels
}
}//rows
}
}//sheets

//if no Exception was thrown consider that the conversion was successful 
return true;
}

/**
 * @return Returns the logger.
 */
public Log getLogger() {
if (logger == null)
logger = LogFactory.getLog(XLSConverterImpl.class);
return logger;
}

}


package com.configworks.cwk.be.search.converters;

import com.configworks.cwk.share.Utils;
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;


/**
 * Created by IntelliJ IDEA.
 * User: Kostya
 * Date: 12.09.2003
 * Time: 11:39:25
 * To change this template use Options | File

70 matches

Mail list logo