date:20070222

how to define a pool for Searcher?

2007-02-22 Thread Mohammad Norouzi


Hi all,
I am going to build a Searcher pooling. if any one has experience on this, I
would be glad to hear his/her recommendation and suggestion. I want to know
what issues I should be apply. considering I am going to use this on a web
application with many user sessions.

thank you very much in advance.
--
Regards,
Mohammad Norouzi

a question about indexing database tables

2007-02-22 Thread Mohammad Norouzi


Hello
In our application we have to index the database tables, there is two way to
make this

1- index each table in a separate directory and then keep all relation in
order to get right result. in this method, we should use filters to overcome
the problem of searching on another search result.
2. joining two or more tables and index the result of join query.

which approach is better, reliable, has acceptable performance.

thanks
--
Regards,
Mohammad

Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS requiring permissions

2007-02-22 Thread Ridzwan Aminuddin

Hi!

I'm writing a java program that uses Lucene 1.4.3 to index and create a vector 
file of words found in Text Files. The purpose is for text mining.

I created a Java .Jar file from my program and my python script calls the Java 
Jar executable. This is all triggered by my DTML code.

I'm running on Linux and i have no problem executing the script when i execute 
via command line. But once i trigger the script via the web (using Zope/Plone 
external methods ) it doesn't work anymore. This is because of the strict 
permissions that LInux has over its files and folders. 

I've narrowed down the problem to the IndexWriter.addDocument(doc) method in 
Lucene 1.4.3 and as you can see below my code fails specifically when a new 
FieldsWriter object is being initialised.

I strongly suspect that it fails at this point but have no idea how to overcome 
this problem. I know that it has to do with the permissions because th eprogram 
works like a miracle when it is called via command line by the super user 
(sudo).

Could anyone give me any pointers or ideas of how i could overcome this.

The final statement which is printed before the program hangs is:
"Entering DocumentWriter.AddDocument (4)"

Here is the portions of my relevant code :



//---
//  Indexer.Java // This is my own method and class
//---
// continued from some other code..

Document doc = new Document();

doc.add(Field.Text("articleTitle", articleTitle, true));
doc.add(Field.Text("articleURL", articleURL, true));
doc.add(Field.Text("articleSummary", articleSummary, true));
doc.add(Field.Text("articleDate", articleDate, true));
doc.add(Field.Text("articleSource", articleSource, true));
doc.add(Field.Text("articleBody", articleBody, true));
doc.add(Field.Keyword("filename", f.getCanonicalPath()));

try
{
writer.addDocument(doc); // indexing fails because this 
statement cannot be executed

}

catch (Exception e)

{
System.err.println ("Cannot add doc exception thrown!");

}


//---
//  IndexWriter.Java // Lucene 1.4.3
//---


public void addDocument(Document doc) throws IOException {

  addDocument(doc, analyzer);
  }


public void addDocument(Document doc, Analyzer analyzer) throws IOException {

DocumentWriter dw;  
  
dw = new DocumentWriter(ramDirectory, analyzer, similarity, maxFieldLength);

String segmentName = newSegmentName();
dw.addDocument(segmentName, doc);   // The program fails to 
execute this line onwards!

synchronized (this) {

  segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory));
  maybeMergeSegments();
}

  }


//---
//  DocumentWriter.Java // Lucene 1.4.3
//---



final void addDocument(String segment, Document doc)
throws IOException {
   
  System.out.println("Entering DocumentWriter.AddDocument (1)");
   
// write field names
fieldInfos = new FieldInfos();
  System.out.println("Entering DocumentWriter.AddDocument (2)");

fieldInfos.add(doc);
  System.out.println("Entering DocumentWriter.AddDocument (3)");

fieldInfos.write(directory, segment + ".fnm");

  System.out.println("Entering DocumentWriter.AddDocument (4)");  
// The program fails after this 

// write field values
FieldsWriter fieldsWriter =
new FieldsWriter(directory, segment, fieldInfos);   
// Program fails to execute this statement

  System.out.println("Entering DocumentWriter.AddDocument (5)");

try {
  fieldsWriter.addDocument(doc);

  System.out.println("Entering DocumentWriter.AddDocument (6)");
  
} finally {
  fieldsWriter.close();
System.out.println("Entering DocumentWriter.AddDocument (7)");

}

  System.out.println("Entering DocumentWriter.AddDocument (8)");

// invert doc into postingTable
postingTable.clear(); // clear postingTable
fieldLengths = new int[fieldInfos.size()];// init fieldLengths
fieldPositions = new int[fieldInfos.size()];  // init fieldPositions

  System.out.println("Entering DocumentWriter.AddDocument (9)");

fieldBoosts = new

Open & Close Reader

2007-02-22 Thread DECAFFMEYER MATHIEU

Hi,

I need to merge indexes,
if I want the user to see the changes (the merged indexes), I heard I
need to close the index reader and re-open it again.

But I will need to do this avery x minutes for some reasons,
So I wondered what could happen if user does a query just when a re-open
of the reader has been done.

Thank you.

__
   Matthew




Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS requiring permissions

2007-02-22 Thread Doron Cohen

This is a very common use case and Lucene is most likely not
the problem cause.

My guess is that (1) the first attempt to write anything to
disk failed. (2) opening the IndexWriter succeeded because
(a) the index exists already (from previous successful run) and
(b) locks are maintained in /tmp or so.

You can try running the same way some simple code that just
writes something to a file in the same directory where the
index is maintained (also, check access permissions at that folder).

Hope this helps,
Doron

(BTW, an exception stack trace would be far more informative than
the print statements here - best catch the exception in the
top level and print it.)

Ridzwan Aminuddin <[EMAIL PROTECTED]> wrote on 22/02/2007 00:20:12:

> Hi!
>
> I'm writing a java program that uses Lucene 1.4.3 to index and
> create a vector file of words found in Text Files. The purpose is
> for text mining.
>
> I created a Java .Jar file from my program and my python script
> calls the Java Jar executable. This is all triggered by my DTML code.
>
> I'm running on Linux and i have no problem executing the script when
> i execute via command line. But once i trigger the script via the
> web (using Zope/Plone external methods ) it doesn't work anymore.
> This is because of the strict permissions that LInux has over its
> files and folders.
>
> I've narrowed down the problem to the IndexWriter.addDocument(doc)
> method in Lucene 1.4.3 and as you can see below my code fails
> specifically when a new FieldsWriter object is being initialised.
>
> I strongly suspect that it fails at this point but have no idea how
> to overcome this problem. I know that it has to do with the
> permissions because th eprogram works like a miracle when it is
> called via command line by the super user (sudo).
>
> Could anyone give me any pointers or ideas of how i could overcome this.
>
> The final statement which is printed before the program hangs is:
> "Entering DocumentWriter.AddDocument (4)"
>
> Here is the portions of my relevant code :
>
>
>
>
//---

> // Indexer.Java // This is my own method and class
>
//---

> // continued from some other code..
>
>Document doc = new Document();
>
> doc.add(Field.Text("articleTitle", articleTitle, true));
> doc.add(Field.Text("articleURL", articleURL, true));
> doc.add(Field.Text("articleSummary", articleSummary, true));
> doc.add(Field.Text("articleDate", articleDate, true));
> doc.add(Field.Text("articleSource", articleSource, true));
> doc.add(Field.Text("articleBody", articleBody, true));
> doc.add(Field.Keyword("filename", f.getCanonicalPath()));
>
>try
> {
> writer.addDocument(doc); // indexing fails
> because this statement cannot be executed
>
> }
>
> catch (Exception e)
>
> {
> System.err.println ("Cannot add doc exception
thrown!");
>
> }
>
>
>
//---

> // IndexWriter.Java // Lucene 1.4.3
>
//---

>
>
> public void addDocument(Document doc) throws IOException {
>
>   addDocument(doc, analyzer);
>   }
>
>
> public void addDocument(Document doc, Analyzer analyzer) throws
IOException {
>
> DocumentWriter dw;
>
> dw = new DocumentWriter(ramDirectory, analyzer, similarity,
> maxFieldLength);
>
> String segmentName = newSegmentName();
> dw.addDocument(segmentName, doc);// The program
> fails to execute this line onwards!
>
> synchronized (this) {
>
>   segmentInfos.addElement(new SegmentInfo(segmentName, 1,
ramDirectory));
>   maybeMergeSegments();
> }
>
>   }
>
>
>
//---

> // DocumentWriter.Java // Lucene 1.4.3
>
//---

>
>
>
> final void addDocument(String segment, Document doc)
> throws IOException {
>
>   System.out.println("Entering DocumentWriter.AddDocument (1)");
>
> // write field names
> fieldInfos = new FieldInfos();
>   System.out.println("Entering DocumentWriter.AddDocument (2)");
>
> fieldInfos.add(doc);
>   System.out.println("Entering DocumentWriter.AddDocument
(3)");
>
> fieldInfos.write(directory, segment + ".fnm");
>
>   System.out.println("Entering DocumentWriter.
> AddDocument (4)");  // The program fails after this
>
> // write field values
> FieldsWriter fieldsWriter =
> new FieldsWriter(directory, segment, field

autocomplete with multiple terms

2007-02-22 Thread Martin Braun

Hello All,

I am implementing a query auto-complete function à la google. Right now
I am using a TermEnum enumerator on a specific field and list the Terms
found.
That works good for Searches with only one Term, but when the user's
typing two or three words the function will autocomplete each Term
individually - but the problem is that the combination of the terms
could probably return no results.
An autocomplete Function should be really fast, so a search for all
possible combinations of the terms  wouldn't be a good solution.

So my strategy is in a dead end.

Does anybody know a better way?

I am not sure if we get enough queries for a search over an index base
on the user-queries.

the only thing I have found in the list before concerning this subject
is http://issues.apache.org/jira/browse/LUCENE-625, but I'm not sure if
it does the things I want.

tia,
martin



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Registering a local dtd file for use with Digester

2007-02-22 Thread Mike O'Leary

I have a collection of XML files that I would like to parse using Digester
in order to index them for Lucene. A DTD file has been supplied for the XML
files, but none of those files has a  line associating them
with the DTD file. Can the Digester's register function be used to tell it
to use that DTD file for such things as entity resolution? If so, how do I
do it? I don't understand how to specify a pathname for a local file in
terms of a publicId and an entityURL. If register can't be used for this
purpose, is there another way to do it? Thanks.

Mike

RE: Returning only a small set of results

2007-02-22 Thread Kainth, Sachin

What can you use in place of Hits and how do they differ? 

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: 21 February 2007 22:43
To: java-user@lucene.apache.org
Subject: Re: Returning only a small set of results

: A question about efficiency and the internal workings of the Hits
class.
: When we make a call to IndexSearcher's search method thus:
:
: Hits hits = searcher.Search(query);
:
: Do we actually, physically get back all the results of the query even
if
: there are 20 million results or for efficiency do we physically get
back

the Hits class fetches back the first N result documents (where N is 100
i
think) and then it fetches more and more as needed if you ask for more.
generally speaking Hits works fine for simple pagination applications,
but if you are intented on walking deep down the list of ordred results,
i would avoid it.

-Hoss

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)

This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: ANN: Luke 0.7 released

2007-02-22 Thread Supriya Kumar Shyamal


Its really Great to have the tool compatible with Lucene 2.1.

It saves lot of energy.

Thanks once again.
supriya

Andrzej Bialecki wrote:

Hi all,

I'm happy to announce that a new version of Luke - the Lucene Index 
Toolbox - is now available. As usually, you can get it from:


   http://www.getopt.org/luke

Highlights of this release:

* support for Lucene 2.1.0 release and earlier
* pagination of search results
* support for many new Field flags
* new plugin for term analysis (contributed by Mark Harwood)
* many other usability and functionality improvements.

Have fun!



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: autocomplete with multiple terms

2007-02-22 Thread karl wettin



22 feb 2007 kl. 10.09 skrev Martin Braun:


the only thing I have found in the list before concerning this subject
is http://issues.apache.org/jira/browse/LUCENE-625, but I'm not  
sure if

it does the things I want.




I am not sure if we get enough queries for a search over an index base
on the user-queries.


If the content of your corpus is static enough, then time is the  
friend that will enable you gather enough user queries to build the  
suggestion data set.


Otherwise you have to produce simulated user queries by reducing your  
data set to the most common information. Perhaps using Markov chains,  
top n paths of terms with Dijkstra or so could be an easy way out.  
You can also start looking at the documents people choose to inspect,  
and use these as the base for phrase training.


I think you will get further considering this from a behavioral  
psychology angle rather than how to access the  corpus access  
problem. Also, navigating a reduced data set (such as the trie in  
LUCENE-625 compared to the corpus it suggests to) will save you a lot  
of system resources.


Hope this helps some.

--
karl





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching eats lots of memory?

2007-02-22 Thread karl wettin



22 feb 2007 kl. 05.21 skrev maureen tanuwidjaja:

I also would like to know wheter searching in the indexfile eats  
lots of memory...I always ran out of memory when doing  
searching,i.e. it gives the exception java heap space(although I  
have put -Xmx768 in the VM argument) ...Is there any way to solve it?


Are you sure it's Lucene that consumes the memory? And if, do you  
really close and decouple all resources when they are not used any  
more? Profiling the application will probably show you what's up.


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Optimizing Index

2007-02-22 Thread Michael McCandless

"maureen tanuwidjaja" wrote:

>   I had an exsisting index file with the size 20.6 GB...I havent done any
>   optimization in this index yet.Now I had a HDD of 100 GB,but apparently
>   when I create program to optimize(which simply calls writer.optimize()
>   to this indexfile),it gives the error that there is not enough space on
>   the disk.
>
>   I read that the size needed to optimize the index is twice as the
>   original index size...then it should be around 40 GB instead...I
>   confuse why the size of 100 GB is insufficient to do the
>   optimization...

Does your disk only have this index?  Ie 100 GB - 20.6 = 79.4 GB of
free space?

Do you have reader(s) open on the index when you kick off the
optimize?  If so then the temporary free space required is 2X the size
of the index (41.2 GB in your case).

Worse, if your readers are refreshing during the optimize (eg because
on IndexReader.isCurrent() returned false) even more temporary disk
space can be tied up.  This is a case LUCENE-710 aims to fix (adding
"commit only on close" to IndexWriter).  The workaround for now is to
create app level logic to ensure readers never refresh during
optimize.

If you don't have readers then you really should only need 1X
additional free space at the start of optimize (20.6 GB free in your
case) so I'm baffled why 79.4 GB would not be enough.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Open & Close Reader

2007-02-22 Thread Michael McCandless

<[EMAIL PROTECTED]> wrote:

> I need to merge indexes,
> if I want the user to see the changes (the merged indexes), I heard I
> need to close the index reader and re-open it again.

Yes.  More generally, whenever there have been changes to an index
that you want your readers/searchers to see, you need to re-open the
reader/searcher.  A reader keeps a "point in time" view of the index
as of when it was open, and will not show any changes until it is
re-opened.

> But I will need to do this avery x minutes for some reasons,
> So I wondered what could happen if user does a query just when a re-open
> of the reader has been done.

I don't really understand this question -- could you provide more
detail here?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Open & Close Reader

2007-02-22 Thread DECAFFMEYER MATHIEU

My question is what happen when a re-opening of the reader occurs and in
the same time a user does a query on the index ? And are there solutions
for this.

__
   Matt

-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 12:48 PM
To: java-user@lucene.apache.org
Subject: Re: Open & Close Reader

*  This message comes from the Internet Network *

<[EMAIL PROTECTED]> wrote:

> I need to merge indexes,
> if I want the user to see the changes (the merged indexes), I heard I
> need to close the index reader and re-open it again.

Yes.  More generally, whenever there have been changes to an index
that you want your readers/searchers to see, you need to re-open the
reader/searcher.  A reader keeps a "point in time" view of the index
as of when it was open, and will not show any changes until it is
re-opened.

> But I will need to do this avery x minutes for some reasons,
> So I wondered what could happen if user does a query just when a
re-open
> of the reader has been done.

I don't really understand this question -- could you provide more
detail here?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS requiring permissions

2007-02-22 Thread Michael D. Curtin

Is your disk almost full?  Under Linux, when you reach about 90% used on 
a file system, only the superuser can allocate more space (e.g. create 
files, add data to files, etc.).


--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a question about indexing database tables

2007-02-22 Thread Erick Erickson


don't do either one  Search this mail archive for discussions of
databases, there are several long threads discussing this along with various
options on how to make this work. See particularly a mail entitled
*Oracle/Lucene
integration -status- *and any discussions participated in by Marcelo Ochoa.

But, in general, Lucene is a text search engine, NOT a RDBMS. When you start
saying "keep all relation in order to get right result", it sounds like
you're trying to use Lucene as a RDBMS. It doesn't do this very well, that's
not what it was built for. There are several options...

get clever with your index such that you don't do anything like join

tables. This implies that you re-design your data layout, probably
de-normalizing lots of data, etc.

Use a hybrid solution. That is, use Lucene to search text and then do

whatever further relational processing you need in the database. You need to
store enough information in the Lucene documents to be able to query the
database.

stick with a database if it works for you already.


In general, it's a mis-use of lucene to try to get RDBMS behavior out of it.
When you find yourself trying to do this, take a few minutes and ask
yourself if this design is appropriate, and continue only if you can answer
in the affirmative...

Best
Erick

On 2/22/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:


Hello
In our application we have to index the database tables, there is two way
to
make this

1- index each table in a separate directory and then keep all relation in
order to get right result. in this method, we should use filters to
overcome
the problem of searching on another search result.
2. joining two or more tables and index the result of join query.

which approach is better, reliable, has acceptable performance.

thanks
--
Regards,
Mohammad

Re: Returning only a small set of results

2007-02-22 Thread Erick Erickson


See TopDocs, HitCollector, etc. You'll have to dig through the documentation
and try a few experiments to make sense of it all, one sentence explanations
aren't much help.

But think of Hits as a convenience class for getting the best-scoring 100
documents and use the other classes if you want to get *all* the documents.
Don't go to the other classes unless you start getting performance problems
with Hits. The main take-away from Hits is that it'll re-execute the query
every 100 documents you read from it or so, so the only time you care is
when you find yourself assembling large numbers of documents...

Erick

On 2/22/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote:


What can you use in place of Hits and how do they differ?

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: 21 February 2007 22:43
To: java-user@lucene.apache.org
Subject: Re: Returning only a small set of results

: A question about efficiency and the internal workings of the Hits
class.
: When we make a call to IndexSearcher's search method thus:
:
: Hits hits = searcher.Search(query);
:
: Do we actually, physically get back all the results of the query even
if
: there are 20 million results or for efficiency do we physically get
back

the Hits class fetches back the first N result documents (where N is 100
i
think) and then it fetches more and more as needed if you ask for more.
generally speaking Hits works fine for simple pagination applications,
but if you are intented on walking deep down the list of ordred results,
i would avoid it.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)


This email and any attached files are confidential and copyright
protected. If you are not the addressee, any dissemination of this
communication is strictly prohibited. Unless otherwise expressly agreed in
writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins
plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really
need to.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Multy Language documents indexing

2007-02-22 Thread Ivan Vasilev


Hi All,

Our application that uses Lucene for indexing will be used to index 
documents that each of which contains parts written in different 
languages. For example some document could contain English, Chinese and 
Brazilian text. So how to index such document? Is there some best 
practice to do this?


What comes in my mind is to index 3 different Lucene Documents for the 
real document and keep in a database the meta info that these 3 
Documents are related to our real doc. For example for the myDoc.doc we 
will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when 
making search when the searched word is found in myDocCn.doc we will 
visualize to user myDoc.doc. Disadvantage here is that in this case the 
occurrences of the searched item will have to be recalculated. It is 
important for queries like “Red NEAR/10 fox”. So if someone knows better 
practice than this, please let me help.


Tanks in advance,
Ivan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Open & Close Reader

2007-02-22 Thread Erick Erickson

Well, it's your logic that takes the request from the user and executes the
search. So it's your logic that has to take care of any coordination between
threads that use the same reader. This is a standard multi-threading
resource-sharing issue.

If your application is not multi-threaded, I don't see how you can "close
the reader while the user is executing a query"...

Erick

On 2/22/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote:

My question is what happen when a re-opening of the reader occurs and in
the same time a user does a query on the index ? And are there solutions
for this.

__
   Matt

-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 22, 2007 12:48 PM
To: java-user@lucene.apache.org
Subject: Re: Open & Close Reader

*  This message comes from the Internet Network *

<[EMAIL PROTECTED]> wrote:

> I need to merge indexes,
> if I want the user to see the changes (the merged indexes), I heard I
> need to close the index reader and re-open it again.

Yes.  More generally, whenever there have been changes to an index
that you want your readers/searchers to see, you need to re-open the
reader/searcher.  A reader keeps a "point in time" view of the index
as of when it was open, and will not show any changes until it is
re-opened.

> But I will need to do this avery x minutes for some reasons,
> So I wondered what could happen if user does a query just when a
re-open
> of the reader has been done.

I don't really understand this question -- could you provide more
detail here?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Internet communications are not secure and therefore Fortis Banque
Luxembourg S.A. does not accept legal responsibility for the contents of
this message. The information contained in this e-mail is confidential and
may be legally privileged. It is intended solely for the addressee. If you
are not the intended recipient, any disclosure, copying, distribution or any
action taken or omitted to be taken in reliance on it, is prohibited and may
be unlawful. Nothing in the message is capable or intended to create any
legally binding obligations on either party and it is not intended to
provide legal advice.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multy Language documents indexing

2007-02-22 Thread Erick Erickson


I know this has been discussed several times, but sure don't remember the
answers. Search the mail archive for "multiple languages" and you'll find
some good suggestions. But as I remember, it's not a trivial issue.

But I don't see why the "three different documents" approach wouldn't work.
You could also index the same text in three different fields in a single
document, using different language analyzers for each (See
PerFieldAnalyzerWrapper).

Erick

On 2/22/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Hi All,

Our application that uses Lucene for indexing will be used to index
documents that each of which contains parts written in different
languages. For example some document could contain English, Chinese and
Brazilian text. So how to index such document? Is there some best
practice to do this?

What comes in my mind is to index 3 different Lucene Documents for the
real document and keep in a database the meta info that these 3
Documents are related to our real doc. For example for the myDoc.doc we
will have in the index myDocEn.doc, myDocCn.doc and myDocBr.doc and when
making search when the searched word is found in myDocCn.doc we will
visualize to user myDoc.doc. Disadvantage here is that in this case the
occurrences of the searched item will have to be recalculated. It is
important for queries like "Red NEAR/10 fox". So if someone knows better
practice than this, please let me help.

Tanks in advance,
Ivan


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a question about indexing database tables

2007-02-22 Thread Mohammad Norouzi

Thanks Erick
but we have to because we need to execute very big queries that create
traffik network and are very very slow. but with lucene we do it in some
milliseconds. and now we indexed our needed information by joining tables.
it works fine, besides, it returns the exact result as we can get from
database. we indexed about one million records.
but let me say, we are not using it instead of database, we use it to
generate some dynamic reports that if we did it by sql queries, it would
take about 15 minutes.

On 2/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:

don't do either one  Search this mail archive for discussions of
databases, there are several long threads discussing this along with
various
options on how to make this work. See particularly a mail entitled
*Oracle/Lucene
integration -status- *and any discussions participated in by Marcelo
Ochoa.

But, in general, Lucene is a text search engine, NOT a RDBMS. When you
start
saying "keep all relation in order to get right result", it sounds like
you're trying to use Lucene as a RDBMS. It doesn't do this very well,
that's
not what it was built for. There are several options...
> get clever with your index such that you don't do anything like join
tables. This implies that you re-design your data layout, probably
de-normalizing lots of data, etc.
> Use a hybrid solution. That is, use Lucene to search text and then do
whatever further relational processing you need in the database. You need
to
store enough information in the Lucene documents to be able to query the
database.
> stick with a database if it works for you already.

In general, it's a mis-use of lucene to try to get RDBMS behavior out of
it.
When you find yourself trying to do this, take a few minutes and ask
yourself if this design is appropriate, and continue only if you can
answer
in the affirmative...

Best
Erick

On 2/22/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
>
> Hello
> In our application we have to index the database tables, there is two
way
> to
> make this
>
> 1- index each table in a separate directory and then keep all relation
in
> order to get right result. in this method, we should use filters to
> overcome
> the problem of searching on another search result.
> 2. joining two or more tables and index the result of join query.
>
> which approach is better, reliable, has acceptable performance.
>
> thanks
> --
> Regards,
> Mohammad
>

--
Regards,
Mohammad

RE: Returning only a small set of results

2007-02-22 Thread Kainth, Sachin

Thanks Erick you've helped a lot and so has everyone else. 

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: 22 February 2007 13:00
To: java-user@lucene.apache.org
Subject: Re: Returning only a small set of results

See TopDocs, HitCollector, etc. You'll have to dig through the
documentation and try a few experiments to make sense of it all, one
sentence explanations aren't much help.

But think of Hits as a convenience class for getting the best-scoring
100 documents and use the other classes if you want to get *all* the
documents.
Don't go to the other classes unless you start getting performance
problems with Hits. The main take-away from Hits is that it'll
re-execute the query every 100 documents you read from it or so, so the
only time you care is when you find yourself assembling large numbers of
documents...

Erick

On 2/22/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote:
>
> What can you use in place of Hits and how do they differ?
>
> -Original Message-
> From: Chris Hostetter [mailto:[EMAIL PROTECTED]
> Sent: 21 February 2007 22:43
> To: java-user@lucene.apache.org
> Subject: Re: Returning only a small set of results
>
> : A question about efficiency and the internal workings of the Hits 
> class.
> : When we make a call to IndexSearcher's search method thus:
> :
> : Hits hits = searcher.Search(query);
> :
> : Do we actually, physically get back all the results of the query 
> even if
> : there are 20 million results or for efficiency do we physically get 
> back
>
> the Hits class fetches back the first N result documents (where N is 
> 100 i
> think) and then it fetches more and more as needed if you ask for
more.
> generally speaking Hits works fine for simple pagination applications,

> but if you are intented on walking deep down the list of ordred 
> results, i would avoid it.
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> This message has been scanned for viruses by MailControl - (see
> http://bluepages.wsatkins.co.uk/?6875772)
>
>
> This email and any attached files are confidential and copyright 
> protected. If you are not the addressee, any dissemination of this 
> communication is strictly prohibited. Unless otherwise expressly 
> agreed in writing, nothing stated in this communication shall be
legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins plc.  
> Registered in England No. 1885586.  Registered Office Woodcote Grove, 
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you 
> really need to.
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Open & Close Reader

2007-02-22 Thread DECAFFMEYER MATHIEU

Actually I don't see how it could not be multi-threaded,
since it seems normal to me that I run it in a web application which is
multi-threaded for each user request ?

Erick, could u please explain to me your comment ?

Thank u.

__
   Matt

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 2:06 PM
To: java-user@lucene.apache.org
Subject: Re: Open & Close Reader

*  This message comes from the Internet Network *

Well, it's your logic that takes the request from the user and executes
the
search. So it's your logic that has to take care of any coordination
between
threads that use the same reader. This is a standard multi-threading
resource-sharing issue.

If your application is not multi-threaded, I don't see how you can
"close
the reader while the user is executing a query"...

Erick

On 2/22/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote:
>
> My question is what happen when a re-opening of the reader occurs and
in
> the same time a user does a query on the index ? And are there
solutions
> for this.
>
> __
>Matt
>
>
>
> -Original Message-
> From: Michael McCandless [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 22, 2007 12:48 PM
> To: java-user@lucene.apache.org
> Subject: Re: Open & Close Reader
>
> *  This message comes from the Internet Network *
>
>
> <[EMAIL PROTECTED]> wrote:
>
> > I need to merge indexes,
> > if I want the user to see the changes (the merged indexes), I heard
I
> > need to close the index reader and re-open it again.
>
> Yes.  More generally, whenever there have been changes to an index
> that you want your readers/searchers to see, you need to re-open the
> reader/searcher.  A reader keeps a "point in time" view of the index
> as of when it was open, and will not show any changes until it is
> re-opened.
>
> > But I will need to do this avery x minutes for some reasons,
> > So I wondered what could happen if user does a query just when a
> re-open
> > of the reader has been done.
>
> I don't really understand this question -- could you provide more
> detail here?
>
> Mike
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> 
> Internet communications are not secure and therefore Fortis Banque
> Luxembourg S.A. does not accept legal responsibility for the contents
of
> this message. The information contained in this e-mail is confidential
and
> may be legally privileged. It is intended solely for the addressee. If
you
> are not the intended recipient, any disclosure, copying, distribution
or any
> action taken or omitted to be taken in reliance on it, is prohibited
and may
> be unlawful. Nothing in the message is capable or intended to create
any
> legally binding obligations on either party and it is not intended to
> provide legal advice.
> 
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a question about indexing database tables

2007-02-22 Thread Erick Erickson

OK, I was off on a tangent. We've had several discussions where people were
effectively trying to replace a RDBMS with Lucene and finding out it that
RDBMSs are very good at what they do ...

But in general, I'd probably approach it by doing the RDBMS work first and
indexing the result. I think this is your option (2). Yes, this will
de-normalize a bunch of your data and you'll chew up some space, but disk
space is cheap. Very cheap .

One thing to remember, though, that took me a while to get used to,
especially when I had my database hat on. There's no requirement that every
document in a Lucene index have the same fields. Conceptually, you can store
*all* your tables in the same index. So a document for table one has fields
table_1_field1 table_1_field2 table_1_field3. "documents" for table two have
fields table_2_field1 table_2_field2 etc.

These documents will never interfere with each other during searches because
they share no fields (and each query goes against a particular field).

I mention this because your maintenance will be much easier if you only have
one index 

Best
Erick

On 2/22/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:

Thanks Erick
but we have to because we need to execute very big queries that create
traffik network and are very very slow. but with lucene we do it in some
milliseconds. and now we indexed our needed information by joining tables.
it works fine, besides, it returns the exact result as we can get from
database. we indexed about one million records.
but let me say, we are not using it instead of database, we use it to
generate some dynamic reports that if we did it by sql queries, it would
take about 15 minutes.

On 2/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> don't do either one  Search this mail archive for discussions of
> databases, there are several long threads discussing this along with
> various
> options on how to make this work. See particularly a mail entitled
> *Oracle/Lucene
> integration -status- *and any discussions participated in by Marcelo
> Ochoa.
>
> But, in general, Lucene is a text search engine, NOT a RDBMS. When you
> start
> saying "keep all relation in order to get right result", it sounds like
> you're trying to use Lucene as a RDBMS. It doesn't do this very well,
> that's
> not what it was built for. There are several options...
> > get clever with your index such that you don't do anything like join
> tables. This implies that you re-design your data layout, probably
> de-normalizing lots of data, etc.
> > Use a hybrid solution. That is, use Lucene to search text and then do
> whatever further relational processing you need in the database. You
need
> to
> store enough information in the Lucene documents to be able to query the
> database.
> > stick with a database if it works for you already.
>
> In general, it's a mis-use of lucene to try to get RDBMS behavior out of
> it.
> When you find yourself trying to do this, take a few minutes and ask
> yourself if this design is appropriate, and continue only if you can
> answer
> in the affirmative...
>
> Best
> Erick
>
> On 2/22/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
> >
> > Hello
> > In our application we have to index the database tables, there is two
> way
> > to
> > make this
> >
> > 1- index each table in a separate directory and then keep all relation
> in
> > order to get right result. in this method, we should use filters to
> > overcome
> > the problem of searching on another search result.
> > 2. joining two or more tables and index the result of join query.
> >
> > which approach is better, reliable, has acceptable performance.
> >
> > thanks
> > --
> > Regards,
> > Mohammad
> >
>

--
Regards,
Mohammad

Re: Scoring while sorting

2007-02-22 Thread Otis Gospodnetic

 - Original Message ---From: dmitri <[EMAIL PROTECTED]>

> What is the point to calculate score if the result set is going to be sorted
> by some field?

No point, I believe, unless your sort includes relevance score.  I believe 
there is a Lucene patch that involves a Matcher (a new concept for Lucene) 
which does only matching without scoring.  If you try that patch, please let us 
know how it works.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Registering a local dtd file for use with Digester

2007-02-22 Thread Steven Rowe

Hi Mike,

> I have a collection of XML files that I would like to parse using Digester
> in order to index them for Lucene. A DTD file has been supplied for the XML
> files, but none of those files has a  line associating them
> with the DTD file. Can the Digester's register function be used to tell it
> to use that DTD file for such things as entity resolution? If so, how do I
> do it? I don't understand how to specify a pathname for a local file in
> terms of a publicId and an entityURL. If register can't be used for this
> purpose, is there another way to do it? Thanks.

Your issue will almost certainly be better addressed in a Digester forum
- your problem has nothing to do with Lucene.

A hint: it looks like you can create a Digester instance with an
externally created SAX parser[1], on which you can set the entity
resolver to an extended DefaultHandler2[2] (Java 1.5) which overrides
the getExternalSubset() method (specified by the EntityResolver2
interface[3]) to return an InputSource containing your desired DTD.

Something like (warning - untested; stolen in part from the Digester
FAQ[1]):

  SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
  parser.getXMLReader().setEntityResolver(new DefaultHandler2() {
getExternalSubset(String name, String baseURI) {
  return new InputSource(/* put your DTD here */);
}
  });
  Digester digester = new Digester(parser);
  // add digester rules here
  parser.setContentHandler(digester);
  parser.parse(/* put your input document here */);

Hope it helps,
Steve

[1] Digester FAQ (instantiating Digester with an external SAX parser):


[2] DefaultHandler2 (enables external DTD resolution with no DOCTYPE in
the XML document):


[3] EntityResolver2 (implemented by DefaultHandler2):



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search on colon ":" ending words

2007-02-22 Thread Felix Litman

Yes. thank you.  How did you make that modification not to treat ":" as a 
field-name terminator?

Is it using this  Or some other way?

String newquery = query.replace(query, ":", " ");
Thank you,
Felix
Antony Bowesman <[EMAIL PROTECTED]> wrote: Not sure if you're still after a 
solution, but I had a similar issue and I 
modified QueryParser.jj to not treat : as a field name terminator, so work: 
would then just be given as work: to the analyzer and treated as a search term.

Antony


Felix Litman wrote:
> We want to be able to return a result regardless if users use a colon or not 
> in the query.  So 'work:' and 'work' query should still return same result.
> 
> With the current parser if a user enters 'work:'  with a ":" , Lucene does 
> not return anything :-(.   It seems to me the Lucene parser issue we are 
> wondering if there is any simple way to make the Lucene parser ignore the ":" 
> in the query?
> 
> any thoughts?
> 
> Erick Erickson  wrote: I've got to ask why you'd want to search on colons. 
> Why not just index the
> words without colons and search without them too? Let's say you index the
> word "work:" Do you really want to have a search on "work" fail?
> 
> By and large, you're better off indexing and searching without
> punctuation
> 
> Best
> Erick
> 
> On 1/28/07, Felix Litman  wrote:
>> Is there a simple way to turn off field-search syntax in the Lucene
>> parser, and have Lucene recognize words ending in a colon ":" as search
>> terms instead?
>>
>> Such words are very common occurrences for our documents (or any plain
>> text), but Lucene does not seem to find them. :-(
>>
>> Thank you,
>> Felix
>>
>>
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

"did you mean" for multi-word queries implementation

2007-02-22 Thread Felix Litman

Did any one have success implementing "did you mean" feature for multi-word 
queries as described in Tom White's excellent "Did you Mean Lucene?" article?

 http://today.java.net/pub/a/today/2005/08/09/didyoumean.html

...and more specifically, using the CompositeDidYouMeanParser implementation as 
described in "Supporting Composite Queries" section of the article?

We are not able so far to get good "suggestions" to multi-word queries using 
this approach, so we are trying to determine if it is a Lucene issue, or our 
implementation...

Thank you,
Felix

Re: "did you mean" for multi-word queries implementation

2007-02-22 Thread Otis Gospodnetic

I believe it's a SpellChecker implementation deficiency, and Karl will probably 
suggest looking at LUCENE-626 as an alternative.  And I'll ask you to please 
report back how much better than the contrib SpellChecker Karl's solution is.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Felix Litman <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, February 22, 2007 1:19:26 PM
Subject: "did you mean" for multi-word queries implementation

Did any one have success implementing "did you mean" feature for multi-word 
queries as described in Tom White's excellent "Did you Mean Lucene?" article?

 http://today.java.net/pub/a/today/2005/08/09/didyoumean.html

...and more specifically, using the CompositeDidYouMeanParser implementation as 
described in "Supporting Composite Queries" section of the article?

We are not able so far to get good "suggestions" to multi-word queries using 
this approach, so we are trying to determine if it is a Lucene issue, or our 
implementation...

Thank you,
Felix




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: "did you mean" for multi-word queries implementation

2007-02-22 Thread karl wettin


22 feb 2007 kl. 19.22 skrev Otis Gospodnetic:

I believe it's a SpellChecker implementation deficiency, and Karl  
will probably suggest looking at LUCENE-626 as an alternative.  And  
I'll ask you to please report back how much better than the contrib  
SpellChecker Karl's solution is.


:)

The package level documentation for the refactor of LUCENE-626  
available in (and dependent to) LUCENE-550 might be helpful even  
though the API looks a bit diffrent. At least it describes a bit more  
how it works.


It is available as HTML at this location:

http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/ 
search/didyoumean/package-summary.html


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: how to define a pool for Searcher?

2007-02-22 Thread Mark Miller

I would not do this from scratch...if you are interested in Solr go that 
route else I would build off http://issues.apache.org/jira/browse/LUCENE-390


- Mark

Mohammad Norouzi wrote:

Hi all,
I am going to build a Searcher pooling. if any one has experience on 
this, I
would be glad to hear his/her recommendation and suggestion. I want to 
know
what issues I should be apply. considering I am going to use this on a 
web

application with many user sessions.

thank you very much in advance.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Scoring while sorting

2007-02-22 Thread Chris Hostetter


: > What is the point to calculate score if the result set is going to be sorted
: > by some field?

: No point, I believe, unless your sort includes relevance score.  I

...which is non trivial information to deduce, since a SortField can
contain a SortComparatorSource which uses a ScoreDocComparator which can
do anything it wants with the ScoreDoc.

If you know that you really don't care about score, you can use Filters
instead of Queries and then sort the docs represented by bits() yourself
... this is an approachSolr takes if a DocSet (Solr concept roughly equal
to a BitSet of documents) is already in the cache and you want the first N
sorted by a field.

(the heart of the jira issue Otis refered to is unifying the concepts of a
Query Scorer and a Filter into a common base class: Matcher)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: QueryParser bug?

2007-02-22 Thread Chris Hostetter


i'm not very familiar with this issue, but are you using
setAllowLeadingWildcard(true) ? ... if not it definitely won't work.


: Date: Thu, 22 Feb 2007 15:36:43 +1100
: From: Antony Bowesman <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: QueryParser bug?
:
: Using QueryParser to parse *tex* seems to create a PrefixQuery rather than
: WildcardQuery due to the trailing *, rather than Wildcard because of the other
: leading *.
:
: As a result, this does not match, for example "context".  I've swapped the 
order
: of WILDTERM and PREFIXTERM in queryparsr.jj but that just prevents PrefixQuery
: from ever being generated.
:
: Is this a known problem and is there any way around it?
: Antony
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS requiring permissions

2007-02-22 Thread Chris Hostetter


This sounds like it has absolutely nothing to do with Lucene, and
everything to do with good security permissions -- your Zope/python front
end is most likely running as a user thta does not have write permissions
to the directory where your index lives.  you'll need to remedy that.

you can write a simple java app that doens't use lucene at all -- just
creates a file and writes  "hellow world" to it -- and you will most
likely see this exact same behavior, dealing with teh file permissions is
totally out side the scope of Lucene.


: Date: Thu, 22 Feb 2007 00:20:12 -0800
: From: Ridzwan Aminuddin <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS
:  requiring permissions
:
: Hi!
:
: I'm writing a java program that uses Lucene 1.4.3 to index and create a 
vector file of words found in Text Files. The purpose is for text mining.
:
: I created a Java .Jar file from my program and my python script calls the 
Java Jar executable. This is all triggered by my DTML code.
:
: I'm running on Linux and i have no problem executing the script when i 
execute via command line. But once i trigger the script via the web (using 
Zope/Plone external methods ) it doesn't work anymore. This is because of the 
strict permissions that LInux has over its files and folders.
:
: I've narrowed down the problem to the IndexWriter.addDocument(doc) method in 
Lucene 1.4.3 and as you can see below my code fails specifically when a new 
FieldsWriter object is being initialised.
:
: I strongly suspect that it fails at this point but have no idea how to 
overcome this problem. I know that it has to do with the permissions because th 
eprogram works like a miracle when it is called via command line by the super 
user (sudo).
:
: Could anyone give me any pointers or ideas of how i could overcome this.
:
: The final statement which is printed before the program hangs is:
: "Entering DocumentWriter.AddDocument (4)"
:
: Here is the portions of my relevant code :
:
:
:
: 
//---
: //Indexer.Java // This is my own method and class
: 
//---
: // continued from some other code..
:
:   Document doc = new Document();
:
: doc.add(Field.Text("articleTitle", articleTitle, true));
: doc.add(Field.Text("articleURL", articleURL, true));
: doc.add(Field.Text("articleSummary", articleSummary, true));
: doc.add(Field.Text("articleDate", articleDate, true));
: doc.add(Field.Text("articleSource", articleSource, true));
: doc.add(Field.Text("articleBody", articleBody, true));
: doc.add(Field.Keyword("filename", f.getCanonicalPath()));
:
:   try
: {
: writer.addDocument(doc); // indexing fails because this 
statement cannot be executed
:
: }
:
: catch (Exception e)
:
: {
: System.err.println ("Cannot add doc exception thrown!");
:
: }
:
:
: 
//---
: //IndexWriter.Java // Lucene 1.4.3
: 
//---
:
:
: public void addDocument(Document doc) throws IOException {
:
:   addDocument(doc, analyzer);
:   }
:
:
: public void addDocument(Document doc, Analyzer analyzer) throws IOException {
:
: DocumentWriter dw;
:
: dw = new DocumentWriter(ramDirectory, analyzer, similarity, 
maxFieldLength);
:
: String segmentName = newSegmentName();
: dw.addDocument(segmentName, doc); // The program fails to 
execute this line onwards!
:
: synchronized (this) {
:
:   segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory));
:   maybeMergeSegments();
: }
:
:   }
:
:
: 
//---
: //DocumentWriter.Java // Lucene 1.4.3
: 
//---
:
:
:
: final void addDocument(String segment, Document doc)
: throws IOException {
:
:   System.out.println("Entering DocumentWriter.AddDocument (1)");
:
: // write field names
: fieldInfos = new FieldInfos();
:   System.out.println("Entering DocumentWriter.AddDocument (2)");
:
: fieldInfos.add(doc);
:   System.out.println("Entering DocumentWriter.AddDocument (3)");
:
: fieldInfos.write(directory, segment + ".fnm");
:
:   System.out.println("Entering DocumentWriter.AddDocument (4)");  
// The program fails after this
:
: // write field valu

RE: Open & Close Reader

2007-02-22 Thread Chris Hostetter

: Actually I don't see how it could not be multi-threaded,
: since it seems normal to me that I run it in a web application which is
: multi-threaded for each user request ?

every application in the world is not a web application.

if you are dealing with multiple threads, you will need to o something to
ensure that threads don't try to use an IndexReader refrence while you are
in the middle of closing it and assigning a new IndexReader to it ... you
can do this with synchronization for example.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Efficient count of documents by type?

2007-02-22 Thread Phillip Rhodes

I have a query that can return documents that represent different types of 
things (e.g. books, movies, coupons, etc)

There is a  "object_type" keyword on each document, so I can tell that a 
document is a coupon or a book etc...

The problem is that I need to display a count of each item type that was found. 
 
For example,  your searched returned: 67 coupons, 54 movies, 28 books...

While I can loop through each document and increment some sort of counter by 
document type, sometimes I have over a 2000 documents, and this would mean that 
the query would be executed internally by lucene 20 times (for every 100 
records).

I am looking at the HitCollector, but since I would need to still get each and 
every document (to figure out if it's a coupon vs. a book), I am not sure if 
there would be any benefits.  Would using a HitCollector cause the query to be 
run only 1x vs. 20 for 2000 documents?  Would that be the only benefit?

I would be interested in hearing what others think about this problem and how 
to best implement this functionality with lucene.

Thank you.
Phillip









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Efficient count of documents by type?

2007-02-22 Thread Erick Erickson


You might have some luck searching the mailing list for "faceted search", as
I remember there's been quite a discussion on that topic and I *think* it
applies...

Even if you use a HitCollector, you still have to categorize your document,
and all you have is the doc id to work with. But I think you'll be able to
combine a HitCollector with TermDocs and be pretty quick about getting your
results.

Something like this in your hitcollector
TermDocs td = new TermDocs();
td.seek(new Term("objecttype", "coupons");
if (td.skipTo(docid) && td.doc == docId) {
  increment coupons counter
}

And repeat ad nauseum.

But what I'd really do is just collect an ordered list of all the doc IDs in
my hitcollector and *then* do something like...

TermDocs tdCoupons = new TermDocs();
TermDocs tdMovies = new TermDocs();

tdCoupons.seek(new Term("objecttype", "coupons");
tdMovies.seek(new Term("objecttype", "movies");

for (int docId : setOfAllDocIds) {
   if (tdCoupons.skipTo(docId) && tdCoupons.doc == docId) {
  increment coupon counter;
   }
   if (tdMovies.skipTo(docId) && tdMovies.doc == docId) {
 increment movie counter
  }
}

That way, you're progressing through all of the TermDocs in order and not
skipping around so much. I really have no clue how much more efficient this
would be..

Best
Erick

On 2/22/07, Phillip Rhodes <[EMAIL PROTECTED]> wrote:


I have a query that can return documents that represent different types of
things (e.g. books, movies, coupons, etc)

There is a  "object_type" keyword on each document, so I can tell that a
document is a coupon or a book etc...

The problem is that I need to display a count of each item type that was
found.
For example,  your searched returned: 67 coupons, 54 movies, 28 books...

While I can loop through each document and increment some sort of counter
by document type, sometimes I have over a 2000 documents, and this would
mean that the query would be executed internally by lucene 20 times (for
every 100 records).

I am looking at the HitCollector, but since I would need to still get each
and every document (to figure out if it's a coupon vs. a book), I am not
sure if there would be any benefits.  Would using a HitCollector cause the
query to be run only 1x vs. 20 for 2000 documents?  Would that be the only
benefit?

I would be interested in hearing what others think about this problem and
how to best implement this functionality with lucene.

Thank you.
Phillip









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: pagination

2007-02-22 Thread Peter W.


Hello,

This snippet may help to understand TopDocs:

http://mail-archives.apache.org/mod_mbox/lucene-general/200508.mbox/% 
[EMAIL PROTECTED]


Also, paging through Lucene results is 'do-it-yourself' exercise using
hits.length() until someone contributes a good implementation.

Oversimplifying, if you want 10 hits per page:

hitsperpage equals ten;

-if hits length is less than ten, you have one page
-else if hits length/hitsperpage modulos is 0, that's your pagecount
-else hits length/hitsperpage is your pagecount, modulos is for your  
last page


You will also need a variable to keep track of which page you are on  
and a static
method which returns min/max values to be included in your iteration  
loop.


You can also see my previous attempt at solving this:

http://www.gossamer-threads.com/lists/lucene/java-user/43595

Regards,

Peter W.

On Feb 21, 2007, at 6:32 AM, Kainth, Sachin wrote:


I might be missing something because TopDocs seems to only be about
finding the relevancy of documents and HitCollector doesn't seem to be
relavent either.

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: 21 February 2007 13:08
To: java-user@lucene.apache.org
Subject: Re: pagination

See TopDocs, HitCollector, etc. Don't iterate through a Hits  
objects to
get docs beyond, say, 100 since it's designed to efficiently return  
the
first 100 documents but re-executes the queries each 100 or so  
times you

advance to the next document.

Erick

On 2/21/07, Kainth, Sachin <[EMAIL PROTECTED]> wrote:


Hello,

I was wondering if Lucene provides any mechanism which helps in
pagination.  In other words is there a way to return the first 10 of
500 results and then the  next 10 and so on.

RE: Running Lucene as a stateless session bean

2007-02-22 Thread Walker, Keith 1

Thanks for the suggestions. 

I'm using the Lucene packaged with Gate, which is lucene-1.3-final.jar
(ancient I suppose).

I am now seeing the threading problems with GATE, and although I was
hoping to stay with Gate in case we need some of it's capabilities,
although with the current design we could go with something like Lius.  

Regards,
Keith

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Lucene - Design Question

2007-02-22 Thread Peter W.


Hello,

If you have experience using XML and doing web services requests
Solr is what you need. It's production quality code and evolving
quickly. It has a remarkable amount of extra functionality.

For CORBA type programmers, go with terracotta. It looks to go a
step further beyond sharing objects to sharing/clustering JVMs.

The RMI capabilities of RemoteSearchable within Lucene seem to
have been developed before Solr gained traction. I tried taking
some working RMI code and writing an inner class with Lucene but
it didn't feel robust.

Research on the mailing lists brings up older file copying
techniques based on synching the indexes with rsync. Probably
still in use, it looks to be an old-school solution better
addressed by Solr.

If you are mirroring your index in a database, there are some
combined Lucene/db update methods available:

1. mysql replication - data on the master is continuously
updated and replicates behind the scenes to remote slaves.
Lucene/db indexing code on each remote slave is a cron job.

2. Lucene indexing application on remote boxes makes network
call to central database, getting/indexing new data and reloading
it's own local ramdir.

For someone trying to get work done, use incremental updates to
one local index first. Then explore writing to multiple indexes and
reading them using MultiSearcher.

Afterward, use HTTP-based updates/requests with Solr to scale out.

Hope that helps.

Peter W.


On Feb 20, 2007, at 5:29 PM, orion wrote:



If you'd like to try using Terracotta, we (Terracotta) would be  
glad to help

you out.  If you want more info, you can email me directly (orion at
terracotta.org) or you can use our web forums (http:// 
forums.terracotta.org)

or our user mailing list (http://lists.terracotta.org/)

Cheers,
Orion



shai deljo wrote:


I considered getting  Lucene in action but figured I'll wait for the
DVD to come out ;).
Seriously though, they write about RemoteSearchable and use RMI, Is
this the recommended solution? does it scale well?
Thanks

On 2/20/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Well, there is also a Remote cousin there.  That will let you  
distribute

your indices over N severs (sounds like you'll need multiple).  You
should really take a stroll through Lucene's javadoc, it's  
incredibly
nice now in winter time.  Or ... clears throat you could get  
a book

;)

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: shai deljo <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, February 20, 2007 2:05:25 PM
Subject: Re: Using Lucene - Design Question

Hi,
Thanks for the reply.
* Regarding hardware I'll use something similar to: Core 2 Duo -
2.66GHz, 2x300 GB disk drives, 4 GB RAM running on one of the Linux
distributions.
* Regarding response time I'm looking to be ~300 milliseconds for at
least 80% of queries and ~500 milliseconds for 95% of queries.
* Will MultiSearcher (and it's parallel cosine :) ) allow me to  
search

indices cross multiple servers or is the assumption is that all
indices are on 1 server?
Thanks


On 2/20/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

Hi Shi,

Nobody will be able to give you the precise answer, obviously.  The

best way is to try.

You didn't say what response time is desirable nor what kind of

hardware you will be using.


I wouldn't bother with the Berkeley DB-backed Lucene index for now,

just use the regular one (maybe use non-compound format).
If you need to partition your index, MultiSearcher will help you  
search
all your indices, and its Parallel cousin will let you  
parallelize those

searches.
It sounds like rsync will work, but you'll have to make sure  
that the

segments file gets rsynced last.


Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: shai deljo <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, February 20, 2007 5:51:13 AM
Subject: Using Lucene - Design Question

Hi,
I have no experience with Lucene and I'm trying to collect some
information in order to determine what solution is best for me.
I need to index ~50M documents (starting with 10M), the size of  
each
document is ~2k-~5k and I'll index a couple of fields per  
document. I
expect ~20 queries per seconds and each query is ~4 terms.  
Update rate
- not sure what is best and/or possible strategy based on  
performance,
i.e. incremental indexing vs. pushing a full index but as far as  
the
product is concerned most data can be updated daily, the head  
(let's

say 20%) needs hourly (or at least on the order of hours) update.
I also need to be able to override the scoring/ranking and  
inject my
own logic and of course  my main concern is response time,  
especially
since i have additional computation on the hits before returning  
the

results.

BTW, for the add

Re: QueryParser bug?

2007-02-22 Thread Antony Bowesman


Chris Hostetter wrote:

i'm not very familiar with this issue, but are you using
setAllowLeadingWildcard(true) ? ... if not it definitely won't work.


That's not the issue.  (I've modified QP to allow "minWildcardPrefix" rather 
than just on/off), but the original QP shows the problem with 
setAllowLeadingWildcard(true).  The compiled JavaCC code will always create a 
PrefixQuery if the last character is *, regardless of any other wildcard 
characters before it.  Therefore the query is based on the Term:


Term(field, "*abc")

The decision is made in the JavaCC compiled code and I'm not familiar enough 
with JavaCC high level stuff to know how to make it choose based on an existing 
condition.


Regards
Antony


: Date: Thu, 22 Feb 2007 15:36:43 +1100
: From: Antony Bowesman <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: QueryParser bug?
:
: Using QueryParser to parse *tex* seems to create a PrefixQuery rather than
: WildcardQuery due to the trailing *, rather than Wildcard because of the other
: leading *.
:
: As a result, this does not match, for example "context".  I've swapped the 
order
: of WILDTERM and PREFIXTERM in queryparsr.jj but that just prevents PrefixQuery
: from ever being generated.
:
: Is this a known problem and is there any way around it?
: Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Positions in SpanFirst

2007-02-22 Thread Antony Bowesman


Chris Hostetter wrote:

: So I don't see why using a SpanNear that respects order and a large
: IncrementGap won't solve your problem.. Although it would return "odd"

i think the use case he's worreid about is that he needs to be able to
find matches just on the "start" of a persons name, ie...

Email#1 To: Jim Bob; Sue Anne-Marie Brown; John Doe
Email#2 To: Tom Smith; Bob Jones; John Doe

...he needs to support existing semantics that let him say "find emails
where the start of a persons name is 'bob' and it returns Email#2, but not
email#1 .. hence his interest in SpanFirst -- he wants to match the
"first" few tokens of a "value" added to a field (which isn't waht
SpanFirst does)


Correct.  That's one of the current search mechanisms.  It's not a major issue 
and I think I'll probably end up ducking it on the basis that the system 
directory defaults to a surname/firstname name order, but of course there's no 
guarantee that mail from other systems will have those names in that order, e.g.


#1 To: Bowesman Antony
#2 To: Antony Bowesman

makes this 'starts with' feature less useful.

Thanks again Hoss and Erick for the suggestions!  This list is excellent and I 
do wonder if you guys actually have day jobs that you are able to do as well as 
give the amount of time you seem to on this list!


Antony




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Optimizing Index

2007-02-22 Thread maureen tanuwidjaja

yes I do have around 75 GB of free space on that HDD...I do not invoke any 
index reader...hence the program only calls indexwriter to optimize the 
index,and that's it..

  I am also perplexed why it tells that it have not enough disk space to do 
optimization...

Michael McCandless <[EMAIL PROTECTED]> wrote:

"maureen tanuwidjaja" wrote:

> I had an exsisting index file with the size 20.6 GB...I havent done any
> optimization in this index yet.Now I had a HDD of 100 GB,but apparently
> when I create program to optimize(which simply calls writer.optimize()
> to this indexfile),it gives the error that there is not enough space on
> the disk.
> 
> I read that the size needed to optimize the index is twice as the
> original index size...then it should be around 40 GB instead...I
> confuse why the size of 100 GB is insufficient to do the
> optimization...

Does your disk only have this index? Ie 100 GB - 20.6 = 79.4 GB of
free space?

Do you have reader(s) open on the index when you kick off the
optimize? If so then the temporary free space required is 2X the size
of the index (41.2 GB in your case).

Worse, if your readers are refreshing during the optimize (eg because
on IndexReader.isCurrent() returned false) even more temporary disk
space can be tied up. This is a case LUCENE-710 aims to fix (adding
"commit only on close" to IndexWriter). The workaround for now is to
create app level logic to ensure readers never refresh during
optimize.

If you don't have readers then you really should only need 1X
additional free space at the start of optimize (20.6 GB free in your
case) so I'm baffled why 79.4 GB would not be enough.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
Now that's room service! Choose from over 150,000 hotels 
in 45,000 destinations on Yahoo! Travel to find your fit.

RE: Optimizing Index

2007-02-22 Thread Damien McCarthy

What file system is the hard disc? If it is FAT32 one of your indexing files
is probably getting bigger than 4.7 gigs - the maximum file size in FAT32

Damien

-Original Message-
From: maureen tanuwidjaja [mailto:[EMAIL PROTECTED] 
Sent: 23 February 2007 02:07
To: java-user@lucene.apache.org
Subject: Re: Optimizing Index

yes I do have around 75 GB of free space on that HDD...I do not invoke any
index reader...hence the program only calls indexwriter to optimize the
index,and that's it..

  I am also perplexed why it tells that it have not enough disk space to do
optimization...

Michael McCandless <[EMAIL PROTECTED]> wrote:

"maureen tanuwidjaja" wrote:

> I had an exsisting index file with the size 20.6 GB...I havent done any
> optimization in this index yet.Now I had a HDD of 100 GB,but apparently
> when I create program to optimize(which simply calls writer.optimize()
> to this indexfile),it gives the error that there is not enough space on
> the disk.
> 
> I read that the size needed to optimize the index is twice as the
> original index size...then it should be around 40 GB instead...I
> confuse why the size of 100 GB is insufficient to do the
> optimization...

Does your disk only have this index? Ie 100 GB - 20.6 = 79.4 GB of
free space?

Do you have reader(s) open on the index when you kick off the
optimize? If so then the temporary free space required is 2X the size
of the index (41.2 GB in your case).

Worse, if your readers are refreshing during the optimize (eg because
on IndexReader.isCurrent() returned false) even more temporary disk
space can be tied up. This is a case LUCENE-710 aims to fix (adding
"commit only on close" to IndexWriter). The workaround for now is to
create app level logic to ensure readers never refresh during
optimize.

If you don't have readers then you really should only need 1X
additional free space at the start of optimize (20.6 GB free in your
case) so I'm baffled why 79.4 GB would not be enough.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
Now that's room service! Choose from over 150,000 hotels 
in 45,000 destinations on Yahoo! Travel to find your fit.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search on colon ":" ending words

2007-02-22 Thread Antony Bowesman


Felix Litman wrote:

Yes. thank you.  How did you make that modification not to treat ":" as a 
field-name terminator?

Is it using this  Or some other way?


I removed the : handling stuff from QueryParser.jj in the method:

Query Clause(String field) :

I removed this section
---
  [
LOOKAHEAD(2)
(
fieldToken=  {field=discardEscapeChar(fieldToken.image);}
|   {field="*";}
)
  ]
---

and you can also remove the COLON and : related bits to do with start terms and 
escaped chars if you want to exclude treating : as a separator, but from memory, 
it's the above section that does the field recognition.


Antony


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search on colon ":" ending words

2007-02-22 Thread Felix Litman

OK. Thank you.  We'll have to consider using this approach.

  I guess the drawback here is that ":" will not longer work as a field 
operator. ?:-(

  We were also considering using the following approach.

  String newquery = query.replace(query, ": ", " "); 

  It seems this way a colon should still work as a field operator if followed 
by a query term with no space in between

  Thanks,
  Felix.

Antony Bowesman <[EMAIL PROTECTED]> wrote:
  Felix Litman wrote:
> Yes. thank you. How did you make that modification not to treat ":" as a 
> field-name terminator?
> 
> Is it using this Or some other way?

I removed the : handling stuff from QueryParser.jj in the method:

Query Clause(String field) :

I removed this section
---
[
LOOKAHEAD(2)
(
fieldToken= {field=discardEscapeChar(fieldToken.image);}
| {field="*";}
)
]
---

and you can also remove the COLON and : related bits to do with start terms and 
escaped chars if you want to exclude treating : as a separator, but from 
memory, 
it's the above section that does the field recognition.

Antony

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS requiring permissions

2007-02-22 Thread Ridzwan Aminuddin

Hi Guys.

Ok thanks for the replies. You guys are right that it is to do with the system 
and not with Lucene. However, what i'm trying to do is to pinpoint and narrow 
down to the exact place that causes the system to fail. and then from there try 
to remedy the problem. 

The odd thing is that the program is still able to write other files to the 
subdirectories that the program itself creates. It is only when it goes through 
this indexing process that this program halts due to the insufficient 
permissions. But the directory i provided to store the target index files has 
been set to read/write/execute (drwxrwxrwx) all permissions. 

In any case, i suspect that it is due to this portion of code.


:   FieldsWriter(Directory d, String segment, FieldInfos fn)
:throws IOException {
: fieldInfos = fn;
: fieldsStream = d.createFile(segment + ".fdt");
: indexStream = d.createFile(segment + ".fdx");
:   }


Are these two files created in the Directory d?
And is this Directory d the ramDirectory that i have provided when i called the 
method: dw = new DocumentWriter(ramDirectory, analyzer, similarity,
 maxFieldLength);
in IndexWriter.Java?

If yes. then where exactly does this ramDirectory point to? cos as far as i can 
see from the code, ramDirectory is initialised as 

RamDirectory ramDirectory = new RamDirectory()

I think the program fails when it tries to create these two files somehow.

Please help to enlighten me?... Maybe knowing the exact path to this 
ramDirectory would help me in finding out which folder i need to provide access 
to

Also, does Lucene ever write any temp data to /tmp ?



> -Original Message-
> From: [EMAIL PROTECTED]
> Sent: Thu, 22 Feb 2007 11:43:38 -0800 (PST)
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run
> on OS requiring permissions
> 
> 
> This sounds like it has absolutely nothing to do with Lucene, and
> everything to do with good security permissions -- your Zope/python front
> end is most likely running as a user thta does not have write permissions
> to the directory where your index lives.  you'll need to remedy that.
> 
> you can write a simple java app that doens't use lucene at all -- just
> creates a file and writes  "hellow world" to it -- and you will most
> likely see this exact same behavior, dealing with teh file permissions is
> totally out side the scope of Lucene.
> 
> 
> : Date: Thu, 22 Feb 2007 00:20:12 -0800
> : From: Ridzwan Aminuddin <[EMAIL PROTECTED]>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on
> OS
> :  requiring permissions
> :
> : Hi!
> :
> : I'm writing a java program that uses Lucene 1.4.3 to index and create a
> vector file of words found in Text Files. The purpose is for text mining.
> :
> : I created a Java .Jar file from my program and my python script calls
> the Java Jar executable. This is all triggered by my DTML code.
> :
> : I'm running on Linux and i have no problem executing the script when i
> execute via command line. But once i trigger the script via the web
> (using Zope/Plone external methods ) it doesn't work anymore. This is
> because of the strict permissions that LInux has over its files and
> folders.
> :
> : I've narrowed down the problem to the IndexWriter.addDocument(doc)
> method in Lucene 1.4.3 and as you can see below my code fails
> specifically when a new FieldsWriter object is being initialised.
> :
> : I strongly suspect that it fails at this point but have no idea how to
> overcome this problem. I know that it has to do with the permissions
> because th eprogram works like a miracle when it is called via command
> line by the super user (sudo).
> :
> : Could anyone give me any pointers or ideas of how i could overcome
> this.
> :
> : The final statement which is printed before the program hangs is:
> : "Entering DocumentWriter.AddDocument (4)"
> :
> : Here is the portions of my relevant code :
> :
> :
> :
> :
> //---
> : //  Indexer.Java // This is my own method and class
> :
> //---
> : // continued from some other code..
> :
> : Document doc = new Document();
> :
> : doc.add(Field.Text("articleTitle", articleTitle, true));
> : doc.add(Field.Text("articleURL", articleURL, true));
> : doc.add(Field.Text("articleSummary", articleSummary, true));
> : doc.add(Field.Text("articleDate", articleDate, true));
> : doc.add(Field.Text("articleSource", articleSource, true));
> : doc.add(Field.Text("articleBody", articleBody, true));
> : doc.add(Field.Keyword("filename", f.getCanonicalPath()));
> :
> : try
> : {
> : write

TextMining.org Word extractor

2007-02-22 Thread Antony Bowesman

I'm extracting text from Word using TextMining.org extractors - it works better 
than POI because it extracts Word 6/95 as well as 97-2002, which POI cannot do. 
 However, I'm trying to find out about licence issues with the TM jar. The TM 
website seems to be permanently hacked these days.


Anyone know?

Also, has anyone come up with a good solution for extracting data from 
fast-saved files, something that neither TM nor POI can do.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: QueryParser bug?

2007-02-22 Thread Chris Hostetter


: than just on/off), but the original QP shows the problem with
: setAllowLeadingWildcard(true).  The compiled JavaCC code will always create a
: PrefixQuery if the last character is *, regardless of any other wildcard
: characters before it.  Therefore the query is based on the Term:

Yep, definitely a bug...

https://issues.apache.org/jira/browse/LUCENE-813

..i'm afraid i don't have any suggested fix or workarround.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

49 matches

Mail list logo