Size limit for indexing ?
Hi, I use lucene 1.2 and I index a text document wich size is near 500 ko. (I use Field.UnStored method) It seems that only the beginning of this document is indexing ! If I search a term that is at the end of this document, I don't find it (but If find term at the beginning). So, I split my document in 2 parts and index them, and now it works fine. Is there a limit size for indexing a document ? Thx. - Christophe -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Size limit for indexing ?
The size of the document is limited only by the OS constraints and 500 kb is really small, I have documents in the hundreds of megs it's fine .. check you indexing and searching you might find the problem there also are you using wildcard searches because they don't work from both sides Nader Henein -Original Message- From: Christophe GOGUYER DESSAGNES [mailto:[EMAIL PROTECTED]] Sent: Wednesday, October 09, 2002 12:08 PM To: [EMAIL PROTECTED] Subject: Size limit for indexing ? Hi, I use lucene 1.2 and I index a text document wich size is near 500 ko. (I use Field.UnStored method) It seems that only the beginning of this document is indexing ! If I search a term that is at the end of this document, I don't find it (but If find term at the beginning). So, I split my document in 2 parts and index them, and now it works fine. Is there a limit size for indexing a document ? Thx. - Christophe -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Size limit for indexing ?
Hello, I use lucene 1.2 and I index a text document wich size is near 500 ko. (I use Field.UnStored method) It seems that only the beginning of this document is indexing ! If I search a term that is at the end of this document, I don't find it (but If find term at the beginning). So, I split my document in 2 parts and index them, and now it works fine. Is there a limit size for indexing a document ? You are right. There is a limit for the number of terms for each field, but you can change it. Look at org.apache.lucene.index.IndexWriter for maxFieldLength. The default limit is set to 1 terms. A 500k document contains more terms depending on stopwords and number of white spaces. That why the end of your document was ignored. Regards, -- Wolf-Dietrich Materna Development empolis GmbH - arvato knowledge management Kekuléstr. 7 12489 Berlin, Germany phone : +49-30-6780-6510 fax :+49-30-6780-6549 mailto:[EMAIL PROTECTED] http://www.empolis.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Size limit for indexing ?
Thank you for your help, it solved my problem. - Christophe - Message d'origine - De : Materna, Wolf-Dietrich (empolis B) [EMAIL PROTECTED] À : 'Lucene Users List' [EMAIL PROTECTED] Envoyé : mercredi 9 octobre 2002 10:33 Objet : RE: Size limit for indexing ? Hello, I use lucene 1.2 and I index a text document wich size is near 500 ko. (I use Field.UnStored method) It seems that only the beginning of this document is indexing ! If I search a term that is at the end of this document, I don't find it (but If find term at the beginning). So, I split my document in 2 parts and index them, and now it works fine. Is there a limit size for indexing a document ? You are right. There is a limit for the number of terms for each field, but you can change it. Look at org.apache.lucene.index.IndexWriter for maxFieldLength. The default limit is set to 1 terms. A 500k document contains more terms depending on stopwords and number of white spaces. That why the end of your document was ignored. Regards, -- Wolf-Dietrich Materna Development empolis GmbH - arvato knowledge management Kekuléstr. 7 12489 Berlin, Germany phone : +49-30-6780-6510 fax :+49-30-6780-6549 mailto:[EMAIL PROTECTED] http://www.empolis.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Deleting a document found in a search
I am just getting started with Lucene and I think I have a problem understanding some basic concepts. I am using two-part identifiers to uniquely identify a document in the index. So whenever I want to index a document, I first want to find and delete the old form. To find it, I intend to use: BooleanQuery findOurs = new BooleanQuery(); findOurs.add(new TermQuery(new Term(Id, id)), true, false); findOurs.add(new TermQuery(new Term(Domain, domain)), true, false); System.out.println(Deleting document matching: \ + findOurs.toString() + ''); Searcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(findOurs); // Assert: hits.length() = 1 for (int i = 0 ; i hits.length() i 10; i++) { Document d = hits.doc(i); // Now what can I do to find document id? int id = ?? searcher.delete(id); } But I can't discover how to convert a search result into a document id. It is recorded in the private HitDoc class, but since it is not publicly accessible, there must be a reason why it would not work to add a public getter for it. Is there an alternative way that I can do this? My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Enumerating all Terms
Is there a way of getting a list of all Terms that have been indexed? I guess it would approximate a wildcard query of the form *:* if that were valid, and instead of returning matching documents, just returning the fields and values. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Deleting a document found in a search
You mean d.get(Id); ? Otis --- [EMAIL PROTECTED] wrote: I am just getting started with Lucene and I think I have a problem understanding some basic concepts. I am using two-part identifiers to uniquely identify a document in the index. So whenever I want to index a document, I first want to find and delete the old form. To find it, I intend to use: BooleanQuery findOurs = new BooleanQuery(); findOurs.add(new TermQuery(new Term(Id, id)), true, false); findOurs.add(new TermQuery(new Term(Domain, domain)), true, false); System.out.println(Deleting document matching: \ + findOurs.toString() + ''); Searcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(findOurs); // Assert: hits.length() = 1 for (int i = 0 ; i hits.length() i 10; i++) { Document d = hits.doc(i); // Now what can I do to find document id? int id = ?? searcher.delete(id); } But I can't discover how to convert a search result into a document id. It is recorded in the private HitDoc class, but since it is not publicly accessible, there must be a reason why it would not work to add a public getter for it. Is there an alternative way that I can do this? My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Faith Hill - Exclusive Performances, Videos More http://faith.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Deleting a document found in a search
No, I mean HitDoc.id, the document number field stored in the HitDoc class. This number is needed when calling IndexReader.delete(int docnum) but it is not publicly accessible. -- Adrian At 06:32 09/10/2002 -0700, Otis Gospodnetic wrote: You mean d.get(Id); ? --- [EMAIL PROTECTED] wrote: I am just getting started with Lucene and I think I have a problem understanding some basic concepts. I am using two-part identifiers to uniquely identify a document in the index. So whenever I want to index a document, I first want to find and delete the old form. To find it, I intend to use: BooleanQuery findOurs = new BooleanQuery(); findOurs.add(new TermQuery(new Term(Id, id)), true, false); findOurs.add(new TermQuery(new Term(Domain, domain)), true, false); System.out.println(Deleting document matching: \ + findOurs.toString() + ''); Searcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(findOurs); // Assert: hits.length() = 1 for (int i = 0 ; i hits.length() i 10; i++) { Document d = hits.doc(i); // Now what can I do to find document id? int id = ?? searcher.delete(id); } But I can't discover how to convert a search result into a document id. It is recorded in the private HitDoc class, but since it is not publicly accessible, there must be a reason why it would not work to add a public getter for it. Is there an alternative way that I can do this? My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE : Enumerating all Terms
Yes You can. IQ-Computing, one of the contributors, has already made the job for you, when they implement the highlighting for Lucene. http://www.iq-computing.de/lucene/highlight.htm Follow their instructions and you will be able to use a getTerms(). Laurent Trillaud -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Envoyé : mercredi 9 octobre 2002 13:51 À : [EMAIL PROTECTED] Objet : Enumerating all Terms Is there a way of getting a list of all Terms that have been indexed? I guess it would approximate a wildcard query of the form *:* if that were valid, and instead of returning matching documents, just returning the fields and values. -- Thanks, Adrian. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Deleting a document found in a search
[EMAIL PROTECTED] wrote: My first thought is to define a Field.Keyword(composite-key, domain + \u + id). This would allow me to use the delete(Term) interface to delete the key. That sounds like a good way to solve this. You could also use a HitCollector with a Query, but I think the composite key is a better approach. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: IndexSearcher on JAR resources?
I wrote: I need to do almost exactly the same thing as Erik - create a read-only index on our help webapp that will be packaged inside an ear file. I figured out a way around the lack of a Jar index searcher. Basically I created the jar file from the index dir and added a bean for my search page with scope=application that locates the jar file as a resource in my war and extracts the files from the jar into a temp dir. Not pretty, but it works. Tim -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED]] Sent: Thursday, September 12, 2002 4:24 AM To: Lucene Users List Subject: Re: IndexSearcher on JAR resources? Tim Dawson wrote: I need to do almost exactly the same thing as Erik - create a read-only index on our help webapp that will be packaged inside an ear file. Eventually I'll have a look at implementing this (and of course contributing it back to Lucene's codebase) - its on my to-do list. But if you want to beat me to it, even better! It could be a few months before I actually get to it, since the filesystem works fine for my demonstration environment. I'll probably end up creating an ant task to do the actual indexing. Save yourself a bit of leg-work - and reuse what I've already done. Its in the Lucene sandbox CVS area already. It could use a little work, but it does work nicely for what I've pushed through it to index text and HTML files. It also has quite speedy dependency checking, so if you index the same files a second time, its much much faster as it just compares dates and ignores them. If you aren't indexing filesystem files then this won't work out of the box for you, but might serve as a starting point. Has anybody packaged indexes into a jar before? Why is the API so restrictive as to require an open filesystem? I suspect that leveraging the read-only FSDirectory would work, although I have not looked at the code to see how tough or easy that might be. Erik -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Lucene and Geographic Searching
Hi, I'm very interested in migrating our current search engine to use Lucene. After evaluating Lucene, I have become very impressed and have been telling lots of people about it. One requirement that we have is to be able to search our documents by specifying a geographical boundary. I searched everything I could find on Lucene but I barely found any mention of anyone using it for such a purpose. My XML documents contain both temporal and spatial information that I would like my users to be able to search on. Does such a thing exist for Lucene? Is there an easy way to do this with Lucene? Is there interest in adding this type of functionality to Lucene if it doesn't exist? Could something like GeoTools or some other Java toolkit be integrated into Lucene. I would even offer my help to make it so, if there is a need. David Kendig Global Change Master Directory GSFC/NASA http://globalchange.nasa.gov -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Whats the type of inverted file Lucene is using??
Hi everybody I was just wondering the type of implementation used for the inverted file that its used by Lucene in the index. Is it using a sorted array?? Jacob Gutiérrez R. Cochabamba - Bolivia -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Web search engine size optimisation problems..
Hello, I've been trying for a while to create a web search engine to spider a small number of websites (around 1000 of them). Before even considering Lucene I used a dbms and tried crawling a site while taking in all keywords from the html files (filtering out stopwords etc). Unfortunately this simplistic approach resulted into huge amounts of data which made the whole project impractical. Then I looked into Lucene as a friend suggested because it's more efficient in storing indexes of this kind. Since most websites nowadays are dynamically produced based on templates much of the web page content remains the same over and over again meaning that the same words are re-added to the index making it larger without adding any useful information to it. I came up with the idea to approximately find which keywords remain the same over the site and index them only once in a document calling it the base. Now every page from the same website gets compared to the base document and only the differences are stored as a separate document with a field containing the link to the base document. This works as expected i.e. it substantially decreases the index size but introduces another problem; how do I search? Say I want to run a query with two terms being searched using the AND operator. For example search for home and test. Suppose that home is in the base document and test appears in a couple of documents of the same website but does not exist in the base document. The correct result is those two documents. How do I get Lucene to do this for me? I've not had any experience before with search engine programming so I might be doing it all wrong, I'd be glad if anyone could point me to the right direction if I am doing it all wrong. I'm expecting your suggestions or comments. Thanks in advance, Kyriakos Ktorides -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]