Size limit for indexing ?

2002-10-09 Thread Christophe GOGUYER DESSAGNES

Hi,

I use lucene 1.2 and I index a text document wich size is near 500 ko.
(I use Field.UnStored method)
It seems that only the beginning of this document is indexing !
If I search a term that is at the end of this document, I don't find it (but
If find term at the beginning).
So, I split my document in 2 parts and index them, and now it works fine.

Is there a limit size for indexing a document ?

Thx.
-
Christophe


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Size limit for indexing ?

2002-10-09 Thread Nader S. Henein

The size of the document is limited only by the OS constraints and 500 kb is
really small, I have documents in the hundreds of megs it's fine .. check
you indexing and searching you might find the problem there also are you
using wildcard searches because they don't work from both sides


Nader Henein

-Original Message-
From: Christophe GOGUYER DESSAGNES [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, October 09, 2002 12:08 PM
To: [EMAIL PROTECTED]
Subject: Size limit for indexing ?


Hi,

I use lucene 1.2 and I index a text document wich size is near 500 ko.
(I use Field.UnStored method)
It seems that only the beginning of this document is indexing !
If I search a term that is at the end of this document, I don't find it (but
If find term at the beginning).
So, I split my document in 2 parts and index them, and now it works fine.

Is there a limit size for indexing a document ?

Thx.
-
Christophe


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Size limit for indexing ?

2002-10-09 Thread Materna, Wolf-Dietrich (empolis B)

Hello,
 I use lucene 1.2 and I index a text document wich size is near 500 ko.
 (I use Field.UnStored method)
 It seems that only the beginning of this document is indexing !
 If I search a term that is at the end of this document, I 
 don't find it (but
 If find term at the beginning).
 So, I split my document in 2 parts and index them, and now it 
 works fine.
 
 Is there a limit size for indexing a document ?
You are right. There is a limit for the number of terms for each field, but
you can
change it. Look at org.apache.lucene.index.IndexWriter for maxFieldLength.
The default limit is set to 1 terms. A 500k document contains more terms
depending on stopwords and number of white spaces. That why the end of your
document
was ignored.
Regards,

-- 
Wolf-Dietrich Materna
Development
 
empolis GmbH -  arvato knowledge management 
Kekuléstr. 7 
12489 Berlin, Germany
 
phone :  +49-30-6780-6510
fax :+49-30-6780-6549
 
 mailto:[EMAIL PROTECTED]  http://www.empolis.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Size limit for indexing ?

2002-10-09 Thread Christophe GOGUYER DESSAGNES

Thank you for your help, it solved my problem.

-
Christophe

- Message d'origine -
De : Materna, Wolf-Dietrich (empolis B)
[EMAIL PROTECTED]
À : 'Lucene Users List' [EMAIL PROTECTED]
Envoyé : mercredi 9 octobre 2002 10:33
Objet : RE: Size limit for indexing ?


Hello,
 I use lucene 1.2 and I index a text document wich size is near 500 ko.
 (I use Field.UnStored method)
 It seems that only the beginning of this document is indexing !
 If I search a term that is at the end of this document, I
 don't find it (but
 If find term at the beginning).
 So, I split my document in 2 parts and index them, and now it
 works fine.

 Is there a limit size for indexing a document ?
You are right. There is a limit for the number of terms for each field, but
you can
change it. Look at org.apache.lucene.index.IndexWriter for maxFieldLength.
The default limit is set to 1 terms. A 500k document contains more terms
depending on stopwords and number of white spaces. That why the end of your
document
was ignored.
Regards,

--
Wolf-Dietrich Materna
Development

empolis GmbH -  arvato knowledge management
Kekuléstr. 7
12489 Berlin, Germany

phone :  +49-30-6780-6510
fax :+49-30-6780-6549

 mailto:[EMAIL PROTECTED]  http://www.empolis.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Deleting a document found in a search

2002-10-09 Thread lucene . user

I am just getting started with Lucene and I think I have a problem
understanding  some basic concepts.

I am using two-part identifiers to uniquely identify a document in the
index.  So whenever I want to index a document, I first want to find and
delete the old form.

To find it, I intend to use:

BooleanQuery findOurs = new BooleanQuery();
findOurs.add(new TermQuery(new Term(Id, id)), true, false);
findOurs.add(new TermQuery(new Term(Domain, domain)), true, false);

System.out.println(Deleting document matching: \ +
   findOurs.toString() + '');

Searcher searcher = new IndexSearcher(directory);
Hits hits = searcher.search(findOurs);

// Assert: hits.length() = 1

for (int i = 0 ; i  hits.length()  i  10; i++) {
  Document d = hits.doc(i);

  // Now what can I do to find document id?

  int id = ??
searcher.delete(id);
}

But I can't discover how to convert a search result into a document id.  It
is recorded in the private HitDoc class, but since it is not publicly
accessible, there must be a reason why it would not work to add a public
getter for it.

Is there an alternative way that I can do this?  My first thought is to
define a Field.Keyword(composite-key, domain + \u + id).  This
would allow me to use the delete(Term) interface to delete the key.

-- 
Thanks, Adrian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Enumerating all Terms

2002-10-09 Thread lucene . user

Is there a way of getting a list of all Terms that have been indexed?  I
guess it would approximate a wildcard query of the form *:* if that were
valid, and instead of returning matching documents, just returning the
fields and values.
-- 
Thanks, Adrian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Deleting a document found in a search

2002-10-09 Thread Otis Gospodnetic

You mean d.get(Id); ?

Otis

--- [EMAIL PROTECTED] wrote:
 I am just getting started with Lucene and I think I have a problem
 understanding  some basic concepts.
 
 I am using two-part identifiers to uniquely identify a document in
 the
 index.  So whenever I want to index a document, I first want to find
 and
 delete the old form.
 
 To find it, I intend to use:
 
 BooleanQuery findOurs = new BooleanQuery();
 findOurs.add(new TermQuery(new Term(Id, id)), true, false);
 findOurs.add(new TermQuery(new Term(Domain, domain)), true,
 false);
 
 System.out.println(Deleting document matching: \ +
findOurs.toString() + '');
 
 Searcher searcher = new IndexSearcher(directory);
 Hits hits = searcher.search(findOurs);
 
 // Assert: hits.length() = 1
 
 for (int i = 0 ; i  hits.length()  i  10; i++) {
   Document d = hits.doc(i);
 
   // Now what can I do to find document id?
 
   int id = ??
   searcher.delete(id);
 }
 
 But I can't discover how to convert a search result into a document
 id.  It
 is recorded in the private HitDoc class, but since it is not publicly
 accessible, there must be a reason why it would not work to add a
 public
 getter for it.
 
 Is there an alternative way that I can do this?  My first thought is
 to
 define a Field.Keyword(composite-key, domain + \u + id). 
 This
 would allow me to use the delete(Term) interface to delete the key.
 
 -- 
 Thanks, Adrian.
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos  More
http://faith.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Deleting a document found in a search

2002-10-09 Thread lucene . user

No, I mean HitDoc.id, the document number field stored in
the HitDoc class.  This number is needed when calling
IndexReader.delete(int docnum) but it is not publicly
accessible.

-- 
Adrian


At 06:32 09/10/2002 -0700, Otis Gospodnetic wrote:
You mean d.get(Id); ?

--- [EMAIL PROTECTED] wrote:
 I am just getting started with Lucene and I think I have
 a problem understanding some basic concepts.
 
 I am using two-part identifiers to uniquely identify a
 document in the index.  So whenever I want to index a
 document, I first want to find and delete the old form.
 
 To find it, I intend to use:
 
 BooleanQuery findOurs = new BooleanQuery();
 findOurs.add(new TermQuery(new Term(Id, id)), true, false);
 findOurs.add(new TermQuery(new Term(Domain, domain)), true, false);
 
 System.out.println(Deleting document matching: \ +
findOurs.toString() + '');
 
 Searcher searcher = new IndexSearcher(directory);
 Hits hits = searcher.search(findOurs);
 
 // Assert: hits.length() = 1
 
 for (int i = 0 ; i  hits.length()  i  10; i++) {
   Document d = hits.doc(i);
 
   // Now what can I do to find document id?
 
   int id = ??
  searcher.delete(id);
 }
 
 But I can't discover how to convert a search result into
 a document id.  It is recorded in the private HitDoc
 class, but since it is not publicly accessible, there
 must be a reason why it would not work to add a public
 getter for it.
 
 Is there an alternative way that I can do this?  My first
 thought is to define a Field.Keyword(composite-key,
 domain + \u + id).  This would allow me to use the
 delete(Term) interface to delete the key.
 
 -- 
 Thanks, Adrian.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE : Enumerating all Terms

2002-10-09 Thread Laurent Trillaud

Yes You can.

IQ-Computing, one of the contributors, has already made the job for you,
when they implement the highlighting for Lucene.
http://www.iq-computing.de/lucene/highlight.htm
Follow their instructions and you will be able to use a getTerms().

Laurent Trillaud

-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Envoyé : mercredi 9 octobre 2002 13:51
À : [EMAIL PROTECTED]
Objet : Enumerating all Terms

Is there a way of getting a list of all Terms that have been indexed?  I
guess it would approximate a wildcard query of the form *:* if that
were
valid, and instead of returning matching documents, just returning the
fields and values.
-- 
Thanks, Adrian.

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Deleting a document found in a search

2002-10-09 Thread Doug Cutting

[EMAIL PROTECTED] wrote:
 My first thought is to
 define a Field.Keyword(composite-key, domain + \u + id).  This
 would allow me to use the delete(Term) interface to delete the key.

That sounds like a good way to solve this.

You could also use a HitCollector with a Query, but I think the 
composite key is a better approach.

Doug



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: IndexSearcher on JAR resources?

2002-10-09 Thread Tim Dawson

I wrote:
  I need to do almost exactly the same thing as Erik - create a
read-only
  index on our help webapp that will be packaged inside an ear file.

I figured out a way around the lack of a Jar index searcher. Basically I
created the jar file from the index dir and added a bean for my search
page with scope=application that locates the jar file as a resource in
my war and extracts the files from the jar into a temp dir. Not pretty,
but it works.

Tim

 -Original Message-
 From: Erik Hatcher [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, September 12, 2002 4:24 AM
 To: Lucene Users List
 Subject: Re: IndexSearcher on JAR resources?
 
 Tim Dawson wrote:
  I need to do almost exactly the same thing as Erik - create a
read-only
  index on our help webapp that will be packaged inside an ear file.
 
 Eventually I'll have a look at implementing this (and of course
 contributing it back to Lucene's codebase) - its on my to-do list.
But
 if you want to beat me to it, even better!  It could be a few months
 before I actually get to it, since the filesystem works fine for my
 demonstration environment.
 
 
  I'll probably end up creating an ant task to do the actual indexing.
 
 Save yourself a bit of leg-work - and reuse what I've already done.
Its
 in the Lucene sandbox CVS area already.  It could use a little work,
but
 it does work nicely for what I've pushed through it to index text and
 HTML files.  It also has quite speedy dependency checking, so if you
 index the same files a second time, its much much faster as it just
 compares dates and ignores them.  If you aren't indexing filesystem
 files then this won't work out of the box for you, but might serve as
a
 starting point.
 
  Has anybody packaged indexes into a jar before? Why is the API so
  restrictive as to require an open filesystem?
 
 I suspect that leveraging the read-only FSDirectory would work,
although
 I have not looked at the code to see how tough or easy that might be.
 
   Erik
 
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
 [EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
 [EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Lucene and Geographic Searching

2002-10-09 Thread David Kendig

Hi,

I'm very interested in migrating our current search engine to use Lucene.  
After evaluating Lucene, I have become very impressed and have been telling 
lots of people about it.  One requirement that we have is to be able to 
search our documents by specifying a geographical boundary.  I searched 
everything I could find on Lucene but I barely found any mention of anyone 
using it for such a purpose.  My XML documents contain both temporal and 
spatial information that I would like my users to be able to search on.  Does 
such a thing exist for Lucene?  Is there an easy way to do this with Lucene?  
Is there interest in adding this type of functionality to Lucene if it 
doesn't exist?  Could something like GeoTools or some other Java toolkit be 
integrated into Lucene.  I would even offer my help to make it so, if there 
is a need.  

David Kendig
Global Change Master Directory
GSFC/NASA
http://globalchange.nasa.gov

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Whats the type of inverted file Lucene is using??

2002-10-09 Thread Jacob Gutierrez

Hi everybody

I was just wondering the type of implementation used for the inverted file 
that its used by Lucene in the index.
Is it using a sorted array??



Jacob Gutiérrez R.
Cochabamba - Bolivia



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Web search engine size optimisation problems..

2002-10-09 Thread Kyriakos Ktorides

Hello, 

I've been trying for a while to create a web search engine to spider a
small number of websites (around 1000 of them). Before even considering
Lucene I used a dbms and tried crawling a site while taking in all
keywords from the html files (filtering out stopwords etc).
Unfortunately this simplistic approach resulted into huge amounts of
data which made the whole project impractical. Then I looked into Lucene
as a friend suggested because it's more efficient in storing indexes of
this kind. Since most websites nowadays are dynamically produced based
on templates much of the web page content remains the same over and over
again meaning that the same words are re-added to the index making it
larger without adding any useful information to it. I came up with the
idea to approximately find which keywords remain the same over the site
and index them only once in a document calling it the base. Now every
page from the same website gets compared to the base document and only
the differences are stored as a separate document with a field
containing the link to the base document. This works as expected i.e.
it substantially decreases the index size but introduces another
problem; how do I search?

Say I want to run a query with two terms being searched using the AND
operator. For example search for home and test. Suppose that home
is in the base document and test appears in a couple of documents of
the same website but does not exist in the base document. The correct
result is those two documents. How do I get Lucene to do this for me?

I've not had any experience before with search engine programming so I
might be doing it all wrong, I'd be glad if anyone could point me to the
right direction if I am doing it all wrong. I'm expecting your
suggestions or comments. 

Thanks in advance,

Kyriakos Ktorides


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]