Using Lucene to store document
Hi all, I'm using Lucene to index XML document/ file (may be millions of documents in future, each about 5-10KB) Beside the index for searching, I want to use Lucene to store whole document content with UnIndexed fields -content field(instead of store each document in a XML file). All the document content will be stored on a separate index. Each time I want to get access to a document, I will let Lucene retrieve it. I am consider this issue with another one Use file system to store document content in separate XML document means, 400K document ill be stored in 400K XML file in file system. Purpose of this is that I can access each document rapidly. Can any body who has experience with this problem before give me advise which method is suitable ? Is this better to collect all documents to an Lucene index or store them separately in file system ? Thanks, Dang Nhan - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
Re: Need Help
Hi, Thank you for help. I got solution for this. lucene 1.3 index works with clucene 0.8.12 - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, November 08, 2004 11:01 PM Subject: Re: Need Help Hello, You should double-check with CLucene community, but from my research for Lucene in Action CLucene's index is not compatible with that of Lucene 1.4, so you will not be able to use the same index with both Lucene and CLucene. Otis --- Chandrashekhar [EMAIL PROTECTED] wrote: Hi, I have query regarding index file portability of lucene 1.4 and clucene 0.8.12. I have created index file in Java - lucene 1.4 and now want to search some term in the same index file by using clucene. I am not getting results if i do that. So just wanted to make sure, does it support such kind of interportability? With Regards, Chandrashekhar V Deshmukh Sr. System Analyst Cybage Software Pvt. Ltd. (a CMM Level 3 company) Phone(O) : 91-20-4041700, 91-20-4044700 Ext: 804 Cell : 91-9822749239 Fax : 91-20-4041701 , 4041702 [EMAIL PROTECTED] www.cybage.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching in keyword field ?
Hi All, Can i search only one word in a keyword field which contains few words. I know keyword field isn't tokenized. After many tests, i think is impossible. Someone can confirm me ? Why don't i use a text field? because the users know the category from a list (ex: category ABC, category DEF GHI, category JKL ...) and the keyword field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST). I use a SnowBallAnalyzer for text field in indexing. Perhaps the better way for me, is to use a text field with the value ABC DEF_GHI JKL_NOPQ where categorys are concatinated with a _. Thanks for your reply ! Thierry. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching in keyword field ?
You can add the category keyword multiple times to a document. Instead of seperating your categories with a delimiter, just add the keyword multiple times. doc.add(Field.Keyword(category, ABC); doc.add(Field.Keyword(category, DEF GHI); On Tue, 9 Nov 2004 17:18:19 +0100, Thierry Ferrero (Itldev.info) [EMAIL PROTECTED] wrote: Hi All, Can i search only one word in a keyword field which contains few words. I know keyword field isn't tokenized. After many tests, i think is impossible. Someone can confirm me ? Why don't i use a text field? because the users know the category from a list (ex: category ABC, category DEF GHI, category JKL ...) and the keyword field 'category' can contains severals terms (ABC, DEF GHI, OPQ RST). I use a SnowBallAnalyzer for text field in indexing. Perhaps the better way for me, is to use a text field with the value ABC DEF_GHI JKL_NOPQ where categorys are concatinated with a _. Thanks for your reply ! Thierry. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene to store document
It is difficult to give a general answer. You can certainly store the whole XML in the Lucene index, just don't tokenize it. The HEAD version of Lucene even has some compression that you may find handy. On the other hand, storing XML in the FS would allow you to store XML files wherever you wanted, even on separate disk(s). If these are lots of parallel searches/reads, this can be handy. If you want to be able to see XML files without going through the index, this can also be handy. So, it depends on how you like it, but both approaches are doable. Otis --- Nhan Nguyen Dang [EMAIL PROTECTED] wrote: Hi all, I'm using Lucene to index XML document/ file (may be millions of documents in future, each about 5-10KB) Beside the index for searching, I want to use Lucene to store whole document content with UnIndexed fields -content field(instead of store each document in a XML file). All the document content will be stored on a separate index. Each time I want to get access to a document, I will let Lucene retrieve it. I am consider this issue with another one Use file system to store document content in separate XML document means, 400K document ill be stored in 400K XML file in file system. Purpose of this is that I can access each document rapidly. Can any body who has experience with this problem before give me advise which method is suitable ? Is this better to collect all documents to an Lucene index or store them separately in file system ? Thanks, Dang Nhan - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What is the difference between these searches?
Hi, I've implemented a converter to translate our system's internal Query objects to Lucene's Query model. I recently realized that my implementation of OR NOT was not working as I would expect and I was wondering if anyone on this list could give me some advice. I am converting a query that means foo or not bar into the following: +item_type:xyz +(field_name:foo -field_name:bar) This returns only Documents where field_name contains foo. I would expect it to return all the Documents where field_name contains foo or field_name doesn't contain bar. Fiddling around with the Lucene Index Toolbox, I think that this query does what I want: +item_type:xyz field_name:foo -field_name:bar Can someone explain to me why these queries return different results? Thanks, Luke Francl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
On Nov 9, 2004, at 2:58 PM, Luke Francl wrote: I recently realized that my implementation of OR NOT was not working as I would expect and I was wondering if anyone on this list could give me some advice. Lucene's BooleanQuery does not really have the concept of OR NOT. It's really an AND NOT. I am converting a query that means foo or not bar into the following: +item_type:xyz +(field_name:foo -field_name:bar) This returns only Documents where field_name contains foo. I would expect it to return all the Documents where field_name contains foo or field_name doesn't contain bar. What you experienced is how Lucene operates. It's more of a fail-safe mode because doing pure NOT queries is more likely to get out of control. Fiddling around with the Lucene Index Toolbox, I think that this query does what I want: +item_type:xyz field_name:foo -field_name:bar Can someone explain to me why these queries return different results? This last query has a required clause, which is what BooleanQuery requires when there is a NOT clause. You're getting what you want here because you've got an item_type:xyz clause as required. In your first example, you're requiring field_name:foo, whereas in this one it is not mandatory. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
can lucene be backed to have an update field
Is it possible to modify the lucene source to create an updateDocument(doc#, FIELD, value) function ? I know there's a lot of work that goes on being the scene when an .add(doc) is called, but can some of that functionality be adapter to make the update a reality? -Chris - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
Luke, On Tuesday 09 November 2004 20:58, you wrote: Hi, I've implemented a converter to translate our system's internal Query objects to Lucene's Query model. I recently realized that my implementation of OR NOT was not working as I would expect and I was wondering if anyone on this list could give me some advice. Could you explain OR NOT ? Lucene has no provision for matching by being prohibited only. This can be achieved by indexing something for each document that can be used in queries to match always, combined with something prohibited in a query. But doing this is bad for performance for querying larger nrs of docs. Lucene's - prefix in queries means AND NOT, ie. the term with the - prefix prohibits the matching of a document. I am converting a query that means foo or not bar into the following: +item_type:xyz +(field_name:foo -field_name:bar) This returns only Documents where field_name contains foo. I would expect it to return all the Documents where field_name contains foo or field_name doesn't contain bar. Fiddling around with the Lucene Index Toolbox, I think that this query does what I want: +item_type:xyz field_name:foo -field_name:bar Can someone explain to me why these queries return different results? A bit dense, but anyway: Anything prefixed with + is required. Anything not having + or - prefix is optional and only influences the score. In case there is nothing required by a + prefix, at least one of the things without prefix is required. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
On Tue, 2004-11-09 at 15:48, Erik Hatcher wrote: This last query has a required clause, which is what BooleanQuery requires when there is a NOT clause. You're getting what you want here because you've got an item_type:xyz clause as required. In your first example, you're requiring field_name:foo, whereas in this one it is not mandatory. So, essentially, my query: +item_type:xyz +(field_name:foo -field_name:bar) Gets translated to: +item_type:xyz +field_name:foo -field_name:bar Whereas the more lenient one does not require field_name:foo and returns what I expect. Is that right? Now, to decide whether to try to make this work the way I thought it would, or just document that it doesn't. ;) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
On Tue, 2004-11-09 at 16:00, Paul Elschot wrote: Lucene has no provision for matching by being prohibited only. This can be achieved by indexing something for each document that can be used in queries to match always, combined with something prohibited in a query. But doing this is bad for performance for querying larger nrs of docs. I'm familiar with Lucene's restrictions on prohibited queries, and I have a required clause for a field that will always be part of the query (it's not a nonsense value, it's the item type of the object in a CMS). My problem is that I have been considering the whole query object that I've generated. Every BooleanQuery that's a part of my finished query must also have a required clause if it has a prohibited clause. I'm thinking of refactoring my code so that instead of joining together Query objects into a large BooleanQuery, it passes around BooleanClauses and assembles them into a single BooleanQuery. Thanks for your help, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: can lucene be backed to have an update field
Chris, On Tuesday 09 November 2004 22:54, Chris Fraschetti wrote: Is it possible to modify the lucene source to create an updateDocument(doc#, FIELD, value) function ? It's possible, but an implementation would not be efficient when the field is indexed. The current index structure has no room to spare for insertions, and no provision for deleted terms. Some time ago an extra level was added in the index for skipping ahead more efficiently. Perhaps that could be combined with a gap for insertions. But when such a gap would fill up there would again be no choice but to delete and add the changed document. Also adding a document without optimizing is quite efficient already, so there is probably not much interest in adding such gaps. In case the field is stored only and the value would have the same length as the currently stored value it would be possible to replace the value efficiently. The only updates available are on the field norms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
On Tuesday 09 November 2004 23:14, Luke Francl wrote: On Tue, 2004-11-09 at 16:00, Paul Elschot wrote: Lucene has no provision for matching by being prohibited only. This can be achieved by indexing something for each document that can be used in queries to match always, combined with something prohibited in a query. But doing this is bad for performance for querying larger nrs of docs. I'm familiar with Lucene's restrictions on prohibited queries, and I have a required clause for a field that will always be part of the query (it's not a nonsense value, it's the item type of the object in a CMS). That might also be mapped to a filter. My problem is that I have been considering the whole query object that I've generated. Every BooleanQuery that's a part of my finished query must also have a required clause if it has a prohibited clause. I'm thinking of refactoring my code so that instead of joining together Query objects into a large BooleanQuery, it passes around BooleanClauses and assembles them into a single BooleanQuery. It may not be possible to flatten a boolean query to a single level, eg: (+aa +bb) (+cc +dd) +(a1 a2) +(b1 b2) These will generate nested BooleanQuery's iirc. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene external field storage contribution.
Kevin, Sorry for the delay in replying. I think your idea for an external field storage mechanism is excellent. I'd love to see it, and if I can, will be willing to help make that happen. Regards, Terry - Original Message - From: Kevin A. Burton To: Lucene Users List Sent: Sunday, November 07, 2004 4:47 PM Subject: Lucene external field storage contribution. About 3 months ago I developed a external storage engine which ties into lucene. I'd like to discuss making a contribution so that this is integrated into a future version of Lucene. I'm going to paste my original PROPOSAL in this email. There wasn't a ton of feedback first time around but I figure squeaky wheel gets the grease... I created this proposal because we need this fixed at work. I want to go ahead and work on a vertical fix for our version of lucene and then submit this back to Jakarta. There seems to be a lot of interest here and I wanted to get feedback from the list before moving forward ... Should I put this in the wiki?! Kevin ** OVERVIEW ** Currently Lucene supports 'stored fields; where the content of these fields are kept within the lucene index for use in the future. While acceptable for small indexes, larger amounts of stored fields prevent: - Fast index merges since the full content has to be continually merged. - Storing the indexes in memory (since a LOT of memory would be required and this is cost prohibitive) - Fast queries since block caching can't be used on the index data. For example in our current setup our index size is 20G. Nearly 90% of this is content. If we could store the content outside of Lucene our merges and searches would be MUCH faster. If we could store the index in MEMORY this could be orders of magnitude faster. ** PROPOSAL ** Provide an external field storage mechanism which supports legacy indexes without modification. Content is stored in a content segment. The only changes would be a field with 3(or 4 if checksum enabled) values. - CS_SEGMENT Logical ID of the content segment. This is an integer value. There is a global Lucene property named CS_ROOT which stores all the content. The segments are just flat files with pointers. Segments are broken into logical pieces by time and size. Usually 100M of content would be in one segment. - CS_OFFSET The byte offset of the field. - CS_LENGTH The length of the field. - CS_CHECKSUM Optional checksum to verify that the content is correct when fetched from the index. - The field value here would be exactly 'N:O:L' where N is the segment number, O is the offset, and L is the length. O and L are 64bit values. N is a 32 bit value (though 64bit wouldn't really hurt). This mechanism allows for the external storage of any named field. CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO code for efficient content lookup. (Though filehandle caching should probably be used). Since content is broken into logical 100M segments the underlying filesystem can orgnize the file into contiguous blocks for efficient non-fragmented lookup. File manipulation is easy and indexes can be merged by simply concatenating the second file to the end of the first. (Though the segment, offset, and length need to be updated). (FIXME: I think I need to think about this more since I will have 100M per syncs) Supporting full unicode is important. Full java.lang.String storage is used with String.getBytes() so we should be able to avoid unicode issues. If Java has a correct java.lang.String representation it's possible easily add unicode support just by serializing the byte representation. (Note that the JDK says that the DEFAULT system char encoding is used so if this is ever changed it might break the index) While Linux and modern versions of Windows (not sure about OSX) support 64bit filesystems the 4G storage boundary of 32bit filesystems (ext2 is an example) are an issue. Using smaller indexes can prevent this but eventually segment lookup in the filesystem will be slow. This will only happen within terabyte storage systems so hopefully the developer has migrated to another (modern) filesystem such as XFS. ** FEATURES ** - Must be able to replicate indexes easily to other hosts. - Adding content to the index must be CHEAP - Deletes need to be cheap (these are cheap for older content. Just discard older indexes) - Filesystem needs to be able to optimize storage - Must support UNICODE and binary content (images,
LUCENE + DATA RETRIVAL
Hi guys, Apologies... Has any one on the form attempted to retrieved data and Indexed Macromedia FLASH based Files If there is some example please distrubute ,it may be usefull for developer's. Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene1.4.1 + OutOf Memory
Hi Guys Apologies.. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field Only for 2000 hits = Exception [ Too many Files Open ] IInd type : 40 Mergerd Indexes [1000 subindexes each] + MultiSearcher /ParallelSearcher + Search on Content Field Only for 2 hits = Exception [ OutOf Memeory ] System Config [same for both type] Amd Processor [High End Single] RAM 1GB O/s Linux ( jantoo type ) Appserver Tomcat 5.05 Jdk [ IBM Blackdown-1.4.1-01 ( == Jdk1.4.1) ] Index contains 15 Fields Search Done only on 1 field Retrieve 11 corrosponding fields 3 Fields are for debug details Switched from Ist type to IInd Type Can some body suggest me Why is this Happening Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene1.4.1 + OutOf Memory
There is a memory leak in the sorting code of Lucene 1.4.1. 1.4.2 has the fix! --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys Apologies.. History Ist type : 4 subindexes + MultiSearcher + Search on Content Field Only for 2000 hits = Exception [ Too many Files Open ] IInd type : 40 Mergerd Indexes [1000 subindexes each] + MultiSearcher /ParallelSearcher + Search on Content Field Only for 2 hits = Exception [ OutOf Memeory ] System Config [same for both type] Amd Processor [High End Single] RAM 1GB O/s Linux ( jantoo type ) Appserver Tomcat 5.05 Jdk [ IBM Blackdown-1.4.1-01 ( == Jdk1.4.1) ] Index contains 15 Fields Search Done only on 1 field Retrieve 11 corrosponding fields 3 Fields are for debug details Switched from Ist type to IInd Type Can some body suggest me Why is this Happening Thx in advance WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking issue
Otis or Erik, do you know if a Reader continously opening should cause the Writer to fail with a Lock obtain timed out error? --- Lucene Users List [EMAIL PROTECTED] wrote: The attached Java file shows a locking issue that occurs with Lucene. One thread opens and closes an IndexReader. The other thread opens an IndexWriter, adds a document and then closes the IndexWriter. I would expect that this app should be able to happily run without an issues. It fails with: java.io.IOException: Lock obtain timed out Is this expected? I thought a Reader could be opened while a Writer is adding a document. Any help is appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]