Re: Index-Format difference between 1.4.3 and 2.0
Hi Andrzej, a month ago you mentioned a new Lucene 2.0 compatible Version of luke. Does it exist somewhere? Thanks lude On 7/20/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: lude wrote: >> As Luke was release with a Lucene-1.9 > > Where did you get this information? From all I know Luke is based on > Lucene > Version 1.4.3. > The latest version of Luke was released with an early snapshot of 1.9. I plan to release a 2.0-based version in a few days. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index-Format difference between 1.4.3 and 2.0
Not sure if it helps, but I have been using Luke (webstart version) from it's website for quite sometime now for inspecting and manipulating my indexes built using Lucene 2.0. I may not be a power user of Luke in that sense, but I haven't found any issues using the basic features. Gopi On 8/25/06, lude <[EMAIL PROTECTED]> wrote: Hi Andrzej, a month ago you mentioned a new Lucene 2.0 compatible Version of luke. Does it exist somewhere? Thanks lude On 7/20/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > lude wrote: > >> As Luke was release with a Lucene-1.9 > > > > Where did you get this information? From all I know Luke is based on > > Lucene > > Version 1.4.3. > > > > The latest version of Luke was released with an early snapshot of 1.9. I > plan to release a 2.0-based version in a few days. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Upgrade from 1.4.3 to 1.9.1. Any problems with using existing index files?
We are upgrading from Lucene 1.4.3 to 1.9.1, and have many customers with large existing index files. In our testing we have reused large indexes created in 1.4.3 in 1.9.1 without incident. We have looked through the changelog and the code and can't see any reason there should be any problems doing so. So, we're just wondering, has anyone had any problems, or is there anything we need to look out for? Looking at the code and also at the file formats specification: http://lucene.apache.org/java/docs/fileformats.html I believe this is completely fine. Meaning, the 1.9.x code can open the older index format for both searching and writing (either deletes or added docs), without issue. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index Stat Functions
Hi All, I am trying to get some stats on my Index such as: 1) When it was created 2) Size in MB of the index 3) If I can get the size, date of each file in the index. For example: I index 100 files, is it possible for me to get their name, size, and date when the last modification of that file (similar to a unix "ls -la /path/to/file) tia
what do i get with FieldCache.DEFAULT.getStrings(...);
hello, I am using FieldCache.DEFAULT.getStrings in combination with an own HitCollector (I loop through all results and count the number of occurences of a fieldvalue in the results). My Problem is that I have Filed values like dt.|lat or ger.|eng. an it seems that only the last token of the fields value is stored in the returned array of FieldCache.DEFAULT.getStrings(is.getIndexReader(), category). But both values are Stored in the Index (I can find dt. and lat.) The same issue is with another field which contains the word cd-rom an I get only "rom" back. Is this an Analyzer Problem? How do I get all tokens? tia, martin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: controlled vocabulary
Hi, Thank you for your reply. I had thought about the first two solutions before. If we apply one doc for each MeSH term, it would be 26 docs for each item digested(we actually need the top 25 MeSH terms generated), would it be any problem if there are too many documents? If we apply field name like "mesh_1", "mesh_2"..., when it comes to search, we will have to generate a loop for each single one of the query terms( there will be more than 20-30 terms on average, since we are using sematic web to implement concept search), do you think it would affect the performance in a very bad way? Regards, Xin - Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]> To: ; "Zhao, Xin" <[EMAIL PROTECTED]> Sent: Thursday, August 24, 2006 4:22 PM Subject: Re: controlled library in my solution, you can apply one doc for each mesh term, or apply different keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can group your mesh terms as one string then add into a field, which requires a simple string parser for the group string when you wanna read the terms... not sure if that works or answers your question... On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I have a design question. Here is what we try to do for indexing: We designed an indexing tool to generate standard MeSH terms from medical citations, and then use Lucene to save the terms and citations for future search. The information we need to save are: a) the exact mesh terms (top 10) b) the score for each term so the codings are like --- for the top 10 MeSH Terms myField=Field.Keyword("mesh", mesh.toLowerCase()); myField.setBoost(score); doc.add(myFiled); end for as you could see we generate all the terms under named field "mesh". If I understand correctly, all the fields under the same name would eventually save into one field, with all the scores be normalized into filed boost. In this case, we wouldn't be able to save separate score, so the information is lost. Am I correct? Is there anyway we could change it? I understand Lucene is for keyword search, and what we try to do is Controlled Vocabulary search, Any other tool we could use? Thank you, Xin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs Database Search
Performance wise, Lucene search is much faster for full-text search. If you only do "Employee ID" search, or exact match of Names, database's search can do a good job already. If it's regarding the index maintenance, you should have a updated_at column for each record, and select the latest records out. And do an "update" to the index periodically. Chris Lu - Lucene Search for Any Databases/Applications http://www.dbsight.net On 8/24/06, kalpesh patel <[EMAIL PROTECTED]> wrote: Hi, I have an application. It has large number of records around (1.2 million) with a possibility of doubling every year. The average records being added per day is around 3000 distributed over the day. The inserted record has to be searchable immediately once it is entered into the database and the index updated. I have created a Lucene index, and the size is around 0.5 GB. The search DOES not require text search. It just includes search by First Name, Last Name, Employee ID. What would be better solution in the existing situation and long run? Keeping all the searchable records in one database table (issuing a select query against one table) or using Lucene index. Thanks in advance. -Kalpesh - How low will we go? Check out Yahoo! Messenger's low PC-to-Phone call rates. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what do i get with FieldCache.DEFAULT.getStrings(...);
Not sure of the solution though. But FieldCache.DEFAULT.getStrings() is returning a String[], with one String for each document. Seems your field is analyzed into multiple String values. Chris Lu --- Lucene Search on Any Databases/Applications http://www.dbsight.net On 8/25/06, Martin Braun <[EMAIL PROTECTED]> wrote: hello, I am using FieldCache.DEFAULT.getStrings in combination with an own HitCollector (I loop through all results and count the number of occurences of a fieldvalue in the results). My Problem is that I have Filed values like dt.|lat or ger.|eng. an it seems that only the last token of the fields value is stored in the returned array of FieldCache.DEFAULT.getStrings(is.getIndexReader(), category). But both values are Stored in the Index (I can find dt. and lat.) The same issue is with another field which contains the word cd-rom an I get only "rom" back. Is this an Analyzer Problem? How do I get all tokens? tia, martin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WIll storing docs affect lucene's search performance ?
Where can I find information which version / tag to checkout so as to get the lazy loading verity of lucene Grant Ingersoll wrote: Large stored fields can affect performance when you are iterating over your hits (assuming you are not interested in the value of the stored field at that point in time) for a results display since all Fields are loaded when getting the Document. The SVN trunk has a version of lazy loading that allows you to specify which fields are loaded and which ones are lazy, so you can avoid loading fields that a user will never view. -Grant On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote: I have a requirement ( use highlighter) to store the doc content somewhere., and I am not allowed to use a RDBMS. I am thinking of using Lucene's Field with (Field.Store.YES and Field.Index.NO) to store the doc content. Will it have any negative affect on my search performance ? I think I have read somewhere that Lucene shouldn't be used(or misused) to provide RDBMS like storage. --prasen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Skype: grant_ingersoll Fax: 315-443-6886 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: controlled vocabulary
hi Xin this is take a look at this you can add multiple fields with the name mesh for ( i=0; i< meshList.size() ; i++ ){ meshTerm = meshList.get(i) document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, Field.Store.YES , Field.Index.NO_NORMS ); } when querying this index, create a analyzer that infers the text string and generates id's that correspond to the mesh term in the semantic web Zhao, Xin wrote: Hi, Thank you for your reply. I had thought about the first two solutions before. If we apply one doc for each MeSH term, it would be 26 docs for each item digested(we actually need the top 25 MeSH terms generated), would it be any problem if there are too many documents? If we apply field name like "mesh_1", "mesh_2"..., when it comes to search, we will have to generate a loop for each single one of the query terms( there will be more than 20-30 terms on average, since we are using sematic web to implement concept search), do you think it would affect the performance in a very bad way? Regards, Xin - Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]> To: ; "Zhao, Xin" <[EMAIL PROTECTED]> Sent: Thursday, August 24, 2006 4:22 PM Subject: Re: controlled library in my solution, you can apply one doc for each mesh term, or apply different keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can group your mesh terms as one string then add into a field, which requires a simple string parser for the group string when you wanna read the terms... not sure if that works or answers your question... On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I have a design question. Here is what we try to do for indexing: We designed an indexing tool to generate standard MeSH terms from medical citations, and then use Lucene to save the terms and citations for future search. The information we need to save are: a) the exact mesh terms (top 10) b) the score for each term so the codings are like --- for the top 10 MeSH Terms myField=Field.Keyword("mesh", mesh.toLowerCase()); myField.setBoost(score); doc.add(myFiled); end for as you could see we generate all the terms under named field "mesh". If I understand correctly, all the fields under the same name would eventually save into one field, with all the scores be normalized into filed boost. In this case, we wouldn't be able to save separate score, so the information is lost. Am I correct? Is there anyway we could change it? I understand Lucene is for keyword search, and what we try to do is Controlled Vocabulary search, Any other tool we could use? Thank you, Xin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: controlled vocabulary
Hi, Rupinder, My understanding is Field.Index.NO_NORMS disables index-time boosting and field length normalization at the same time. But I do need index-time boosting to store the scoring of each mesh term. Have I missed anything? Thank you very much for your help, Xin - Original Message - From: "Rupinder Singh Mazara" <[EMAIL PROTECTED]> To: Sent: Friday, August 25, 2006 10:49 AM Subject: Re: controlled vocabulary hi Xin this is take a look at this you can add multiple fields with the name mesh for ( i=0; i< meshList.size() ; i++ ){ meshTerm = meshList.get(i) document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, Field.Store.YES , Field.Index.NO_NORMS ); } when querying this index, create a analyzer that infers the text string and generates id's that correspond to the mesh term in the semantic web Zhao, Xin wrote: Hi, Thank you for your reply. I had thought about the first two solutions before. If we apply one doc for each MeSH term, it would be 26 docs for each item digested(we actually need the top 25 MeSH terms generated), would it be any problem if there are too many documents? If we apply field name like "mesh_1", "mesh_2"..., when it comes to search, we will have to generate a loop for each single one of the query terms( there will be more than 20-30 terms on average, since we are using sematic web to implement concept search), do you think it would affect the performance in a very bad way? Regards, Xin - Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]> To: ; "Zhao, Xin" <[EMAIL PROTECTED]> Sent: Thursday, August 24, 2006 4:22 PM Subject: Re: controlled library in my solution, you can apply one doc for each mesh term, or apply different keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can group your mesh terms as one string then add into a field, which requires a simple string parser for the group string when you wanna read the terms... not sure if that works or answers your question... On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I have a design question. Here is what we try to do for indexing: We designed an indexing tool to generate standard MeSH terms from medical citations, and then use Lucene to save the terms and citations for future search. The information we need to save are: a) the exact mesh terms (top 10) b) the score for each term so the codings are like --- for the top 10 MeSH Terms myField=Field.Keyword("mesh", mesh.toLowerCase()); myField.setBoost(score); doc.add(myFiled); end for as you could see we generate all the terms under named field "mesh". If I understand correctly, all the fields under the same name would eventually save into one field, with all the scores be normalized into filed boost. In this case, we wouldn't be able to save separate score, so the information is lost. Am I correct? Is there anyway we could change it? I understand Lucene is for keyword search, and what we try to do is Controlled Vocabulary search, Any other tool we could use? Thank you, Xin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: controlled vocabulary
Hi Xin then perhaps you can change it to Field.Index.TOKENIZED, but i was not aware that pubmed boosts mesh terms, they broadly classify terms as major and minor, if you plan to use this simple system of classification consider adding the major terms twice to the document ? Zhao, Xin wrote: Hi, Rupinder, My understanding is Field.Index.NO_NORMS disables index-time boosting and field length normalization at the same time. But I do need index-time boosting to store the scoring of each mesh term. Have I missed anything? Thank you very much for your help, Xin - Original Message - From: "Rupinder Singh Mazara" <[EMAIL PROTECTED]> To: Sent: Friday, August 25, 2006 10:49 AM Subject: Re: controlled vocabulary hi Xin this is take a look at this you can add multiple fields with the name mesh for ( i=0; i< meshList.size() ; i++ ){ meshTerm = meshList.get(i) document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, Field.Store.YES , Field.Index.NO_NORMS ); } when querying this index, create a analyzer that infers the text string and generates id's that correspond to the mesh term in the semantic web Zhao, Xin wrote: Hi, Thank you for your reply. I had thought about the first two solutions before. If we apply one doc for each MeSH term, it would be 26 docs for each item digested(we actually need the top 25 MeSH terms generated), would it be any problem if there are too many documents? If we apply field name like "mesh_1", "mesh_2"..., when it comes to search, we will have to generate a loop for each single one of the query terms( there will be more than 20-30 terms on average, since we are using sematic web to implement concept search), do you think it would affect the performance in a very bad way? Regards, Xin - Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]> To: ; "Zhao, Xin" <[EMAIL PROTECTED]> Sent: Thursday, August 24, 2006 4:22 PM Subject: Re: controlled library in my solution, you can apply one doc for each mesh term, or apply different keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can group your mesh terms as one string then add into a field, which requires a simple string parser for the group string when you wanna read the terms... not sure if that works or answers your question... On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I have a design question. Here is what we try to do for indexing: We designed an indexing tool to generate standard MeSH terms from medical citations, and then use Lucene to save the terms and citations for future search. The information we need to save are: a) the exact mesh terms (top 10) b) the score for each term so the codings are like --- for the top 10 MeSH Terms myField=Field.Keyword("mesh", mesh.toLowerCase()); myField.setBoost(score); doc.add(myFiled); end for as you could see we generate all the terms under named field "mesh". If I understand correctly, all the fields under the same name would eventually save into one field, with all the scores be normalized into filed boost. In this case, we wouldn't be able to save separate score, so the information is lost. Am I correct? Is there anyway we could change it? I understand Lucene is for keyword search, and what we try to do is Controlled Vocabulary search, Any other tool we could use? Thank you, Xin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: controlled vocabulary
Hi, Rupinder, Our algorithm is a little different from what PubMed does. We have scoring for each mesh term, which will affect the search result. What do you think the difference would be for these two: document.addField(Field.Keyword("mesh", "")); and document.addField( new Field( "mesh", "", Field.Store.YES , Field.Index.TOKENIZED ); Thank you, Xin - Original Message - From: "Rupinder Singh Mazara" <[EMAIL PROTECTED]> To: Sent: Friday, August 25, 2006 11:27 AM Subject: Re: controlled vocabulary Hi Xin then perhaps you can change it to Field.Index.TOKENIZED, but i was not aware that pubmed boosts mesh terms, they broadly classify terms as major and minor, if you plan to use this simple system of classification consider adding the major terms twice to the document ? Zhao, Xin wrote: Hi, Rupinder, My understanding is Field.Index.NO_NORMS disables index-time boosting and field length normalization at the same time. But I do need index-time boosting to store the scoring of each mesh term. Have I missed anything? Thank you very much for your help, Xin - Original Message - From: "Rupinder Singh Mazara" <[EMAIL PROTECTED]> To: Sent: Friday, August 25, 2006 10:49 AM Subject: Re: controlled vocabulary hi Xin this is take a look at this you can add multiple fields with the name mesh for ( i=0; i< meshList.size() ; i++ ){ meshTerm = meshList.get(i) document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, Field.Store.YES , Field.Index.NO_NORMS ); } when querying this index, create a analyzer that infers the text string and generates id's that correspond to the mesh term in the semantic web Zhao, Xin wrote: Hi, Thank you for your reply. I had thought about the first two solutions before. If we apply one doc for each MeSH term, it would be 26 docs for each item digested(we actually need the top 25 MeSH terms generated), would it be any problem if there are too many documents? If we apply field name like "mesh_1", "mesh_2"..., when it comes to search, we will have to generate a loop for each single one of the query terms( there will be more than 20-30 terms on average, since we are using sematic web to implement concept search), do you think it would affect the performance in a very bad way? Regards, Xin - Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]> To: ; "Zhao, Xin" <[EMAIL PROTECTED]> Sent: Thursday, August 24, 2006 4:22 PM Subject: Re: controlled library in my solution, you can apply one doc for each mesh term, or apply different keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can group your mesh terms as one string then add into a field, which requires a simple string parser for the group string when you wanna read the terms... not sure if that works or answers your question... On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I have a design question. Here is what we try to do for indexing: We designed an indexing tool to generate standard MeSH terms from medical citations, and then use Lucene to save the terms and citations for future search. The information we need to save are: a) the exact mesh terms (top 10) b) the score for each term so the codings are like --- for the top 10 MeSH Terms myField=Field.Keyword("mesh", mesh.toLowerCase()); myField.setBoost(score); doc.add(myFiled); end for as you could see we generate all the terms under named field "mesh". If I understand correctly, all the fields under the same name would eventually save into one field, with all the scores be normalized into filed boost. In this case, we wouldn't be able to save separate score, so the information is lost. Am I correct? Is there anyway we could change it? I understand Lucene is for keyword search, and what we try to do is Controlled Vocabulary search, Any other tool we could use? Thank you, Xin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Test new query parser?
I have received a few inquires about my new query parser. I apologize for making that announcement a little premature. My current implementation only allows simple mixing of proximity queries with boolean queries...complex mixing would result in an incorrect search. A reply to my first email made me consider this more (I had done that part a while ago) and I came to the conclusion that it was obviously unacceptable to release the parser to anyone in this hobbled form. The parser must support arbitrary mixing of boolean and proximity searchers. I think I have cracked this. I would say I am at 90% of the way to a solution and can see the light at the end of the tunnel. When I have resolved this issue, I will contact those that have expressed interest and provide them with the parser. With some feedback and improvements I will think about how to release it generally. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: controlled vocabulary
now. i have a second thought about one meah term per document. the scoring formula(hits too) is based on document, right? does it mean that we shouldn't have more than one document for each object indexed? for example, i try to index a publication, for some of the information, like title, abstract i would like to store and index them using default similarity, while the other information i would like to use customized similarity. i probably should use a different indexing directory and writer instead of two documents in the same index, right? thank you for helping me. you could see that i am in the early learning stage now. xin - Original Message - From: "Zhao, Xin" <[EMAIL PROTECTED]> To: Sent: Friday, August 25, 2006 10:21 AM Subject: Re: controlled vocabulary Hi, Thank you for your reply. I had thought about the first two solutions before. If we apply one doc for each MeSH term, it would be 26 docs for each item digested(we actually need the top 25 MeSH terms generated), would it be any problem if there are too many documents? If we apply field name like "mesh_1", "mesh_2"..., when it comes to search, we will have to generate a loop for each single one of the query terms( there will be more than 20-30 terms on average, since we are using sematic web to implement concept search), do you think it would affect the performance in a very bad way? Regards, Xin - Original Message - From: "Dedian Guo" <[EMAIL PROTECTED]> To: ; "Zhao, Xin" <[EMAIL PROTECTED]> Sent: Thursday, August 24, 2006 4:22 PM Subject: Re: controlled library in my solution, you can apply one doc for each mesh term, or apply different keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can group your mesh terms as one string then add into a field, which requires a simple string parser for the group string when you wanna read the terms... not sure if that works or answers your question... On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: Hi, I have a design question. Here is what we try to do for indexing: We designed an indexing tool to generate standard MeSH terms from medical citations, and then use Lucene to save the terms and citations for future search. The information we need to save are: a) the exact mesh terms (top 10) b) the score for each term so the codings are like --- for the top 10 MeSH Terms myField=Field.Keyword("mesh", mesh.toLowerCase()); myField.setBoost(score); doc.add(myFiled); end for as you could see we generate all the terms under named field "mesh". If I understand correctly, all the fields under the same name would eventually save into one field, with all the scores be normalized into filed boost. In this case, we wouldn't be able to save separate score, so the information is lost. Am I correct? Is there anyway we could change it? I understand Lucene is for keyword search, and what we try to do is Controlled Vocabulary search, Any other tool we could use? Thank you, Xin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WIll storing docs affect lucene's search performance ?
It is on the HEAD version in SVN. See http://wiki.apache.org/jakarta-lucene/SourceRepository for info on checking out from SVN. -Grant On Aug 25, 2006, at 10:44 AM, Rupinder Singh Mazara wrote: Where can I find information which version / tag to checkout so as to get the lazy loading verity of lucene Grant Ingersoll wrote: Large stored fields can affect performance when you are iterating over your hits (assuming you are not interested in the value of the stored field at that point in time) for a results display since all Fields are loaded when getting the Document. The SVN trunk has a version of lazy loading that allows you to specify which fields are loaded and which ones are lazy, so you can avoid loading fields that a user will never view. -Grant On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote: I have a requirement ( use highlighter) to store the doc content somewhere., and I am not allowed to use a RDBMS. I am thinking of using Lucene's Field with (Field.Store.YES and Field.Index.NO) to store the doc content. Will it have any negative affect on my search performance ? I think I have read somewhere that Lucene shouldn't be used(or misused) to provide RDBMS like storage. --prasen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Skype: grant_ingersoll Fax: 315-443-6886 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Skype: grant_ingersoll Fax: 315-443-6886 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WIll storing docs affect lucene's search performance ?
It is on the HEAD version in SVN. See http://wiki.apache.org/jakarta-lucene/SourceRepository for info on checking out from SVN. On Aug 25, 2006, at 10:44 AM, Rupinder Singh Mazara wrote: Where can I find information which version / tag to checkout so as to get the lazy loading verity of lucene Grant Ingersoll wrote: Large stored fields can affect performance when you are iterating over your hits (assuming you are not interested in the value of the stored field at that point in time) for a results display since all Fields are loaded when getting the Document. The SVN trunk has a version of lazy loading that allows you to specify which fields are loaded and which ones are lazy, so you can avoid loading fields that a user will never view. -Grant On Aug 11, 2006, at 9:07 AM, Prasenjit Mukherjee wrote: I have a requirement ( use highlighter) to store the doc content somewhere., and I am not allowed to use a RDBMS. I am thinking of using Lucene's Field with (Field.Store.YES and Field.Index.NO) to store the doc content. Will it have any negative affect on my search performance ? I think I have read somewhere that Lucene shouldn't be used(or misused) to provide RDBMS like storage. --prasen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Skype: grant_ingersoll Fax: 315-443-6886 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.grantingersoll.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Sharing Documents between Lucene and DotLucene
Hello- I am just wondering if any one has encountered any good strategies for sharing search records between a Linux based server using Lucene and a Windows based client using DotLucene. I am doing all the indexing on the server ( i.e. the master index is contained on the server) and I would like to transfer parts of that index across the wire to a client. Presently I am creating a temporary sub-index on the server, adding the appropriate Document objects to that index, then transferring the the entire index to the client which then merges the index into any existing index it may already have. However, I would like to avoid building/transferring a sub-index. I would like to know if anyone has attempted to directly marshall Document objects from Java to C#. Or if there are any other good approaches sharing individual Document objects between Lucene and DotLucene. Thanks. -drj
Re: controlled vocabulary
Hi, Xin, in my understanding , the document in Lucene is a term of collection of fields, while a field is pair of keyword and value, tough it can be indexed or stored or both. That is plain structure. if you wanna index a deep tree structure such as complex objects and keep those relationship inside, i guess we need do some tricky of that. so in my mentioned solution, i will do something on the keyword of a document(here, a document represent a object...) . the score problem you mentioned in your question is similar, i mean, score is actually an attribute of mesh object, so u wanna index the information which has a tree-like structure (i met the similar problem when i indexing xml-based pages. esp. for those have lots of deep element nodes, deep index needed for deep searching). correct me if i was wrong or there are some better solutions... On 8/25/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: now. i have a second thought about one meah term per document. the scoring formula(hits too) is based on document, right? does it mean that we shouldn't have more than one document for each object indexed? for example, i try to index a publication, for some of the information, like title, abstract i would like to store and index them using default similarity, while the other information i would like to use customized similarity. i probably should use a different indexing directory and writer instead of two documents in the same index, right? thank you for helping me. you could see that i am in the early learning stage now. xin - Original Message - From: "Zhao, Xin" <[EMAIL PROTECTED]> To: Sent: Friday, August 25, 2006 10:21 AM Subject: Re: controlled vocabulary > Hi, > Thank you for your reply. I had thought about the first two solutions > before. If we apply one doc for each MeSH term, it would be 26 docs for > each item digested(we actually need the top 25 MeSH terms generated), > would it be any problem if there are too many documents? If we apply field > name like "mesh_1", "mesh_2"..., when it comes to search, we will have to > generate a loop for each single one of the query terms( there will be more > than 20-30 terms on average, since we are using sematic web to implement > concept search), do you think it would affect the performance in a very > bad way? > Regards, > Xin > > > - Original Message - > From: "Dedian Guo" <[EMAIL PROTECTED]> > To: ; "Zhao, Xin" <[EMAIL PROTECTED]> > Sent: Thursday, August 24, 2006 4:22 PM > Subject: Re: controlled library > > >> in my solution, you can apply one doc for each mesh term, or apply >> different >> keyword such as "mesh_1""mesh_10" for your top 10 terms...or u can >> group >> your mesh terms as one string then add into a field, which requires a >> simple >> string parser for the group string when you wanna read the terms... >> >> not sure if that works or answers your question... >> >> On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote: >>> >>> Hi, >>> I have a design question. Here is what we try to do for indexing: >>> We designed an indexing tool to generate standard MeSH terms from >>> medical >>> citations, and then use Lucene to save the terms and citations for >>> future >>> search. The information we need to save are: >>> a) the exact mesh terms (top 10) >>> b) the score for each term >>> so the codings are like >>> --- >>> for the top 10 MeSH Terms >>> myField=Field.Keyword("mesh", mesh.toLowerCase()); >>> myField.setBoost(score); >>> doc.add(myFiled); >>> end for >>> >>> as you could see we generate all the terms under named field "mesh". If >>> I >>> understand correctly, all the fields under the same name would >>> eventually save into one field, with all the scores be normalized into >>> filed boost. In this case, we wouldn't be able to save separate score, >>> so >>> the information is lost. Am I correct? Is there anyway we could change >>> it? I >>> understand Lucene is for keyword search, and what we try to do is >>> Controlled >>> Vocabulary search, Any other tool we could use? >>> >>> Thank you, >>> Xin >>> >>> >>> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sharing Documents between Lucene and DotLucene
Hi, I am the developer and maintainer of Lucene.Net. DotLucene is the old name, Lucene.Net is the official name. You can find out more about Lucene.Net by visiting this link: http://incubator.apache.org/lucene.net/ I am not sure what you mean by "marshall Document objects from Java to C#". However, if mean sharing an index that's created by Jakarta Lucene be searched/updated by Lucene.Net and via-vise, then the answer is yes. In fact, if you share the lock file, you can have concurrent access and update to the Lucene index from the Jakarta Lucene and Lucene.Net As part of Lucene.Net release, I always test and validate this test-case. Regards, -- George Aroush -Original Message- From: d rj [mailto:[EMAIL PROTECTED] Sent: Friday, August 25, 2006 5:33 PM To: java-user@lucene.apache.org Subject: Sharing Documents between Lucene and DotLucene Hello- I am just wondering if any one has encountered any good strategies for sharing search records between a Linux based server using Lucene and a Windows based client using DotLucene. I am doing all the indexing on the server ( i.e. the master index is contained on the server) and I would like to transfer parts of that index across the wire to a client. Presently I am creating a temporary sub-index on the server, adding the appropriate Document objects to that index, then transferring the the entire index to the client which then merges the index into any existing index it may already have. However, I would like to avoid building/transferring a sub-index. I would like to know if anyone has attempted to directly marshall Document objects from Java to C#. Or if there are any other good approaches sharing individual Document objects between Lucene and DotLucene. Thanks. -drj - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: what do i get with FieldCache.DEFAULT.getStrings(...);
FieldCache was designed with searching in mind, where there can only be a single indexed Term for each doc (otherwise how would you sort a doc that had two Terms "a" and "z" ?) I'm acctually suprised you are getting any values out instead of an Exception If you index your Field as UN_TOKENIZED you should get the resultss you expect -- but then searching on individual words may not work the way you expect, adding the data to two differnet fields (on TOKENIZED for search and one UN_TOKENIZED for sorting/FieldCache) is the typicaly solution. You also may want to look at hte LazyFieldLoading using the Fieldable APIs they are for accessing the STORED fields of a Document, and are aparently much faster then the old method of pulling out the whole Document ... wether they are as fast as FieldCache or not I don't know. : Date: Fri, 25 Aug 2006 15:26:38 +0200 : From: Martin Braun <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED] : To: java-user@lucene.apache.org : Subject: what do i get with FieldCache.DEFAULT.getStrings(...); : : hello, : I am using FieldCache.DEFAULT.getStrings in combination with an own : HitCollector (I loop through all results and count the number of : occurences of a fieldvalue in the results). : : My Problem is that I have Filed values like dt.|lat or ger.|eng. an it : seems that only the last token of the fields value is stored in the : returned array of FieldCache.DEFAULT.getStrings(is.getIndexReader(), : category). : : But both values are Stored in the Index (I can find dt. and lat.) : : The same issue is with another field which contains the word cd-rom an I : get only "rom" back. : : Is this an Analyzer Problem? How do I get all tokens? : : : : tia, : martin : : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
A problem on performance
I have got nearly 4 million chinese documents, each size ranges from 1k - 300k. So I use org.apache.lucene.analysis.cn.ChineseAnalyzer as the analyzer for the text. The index have four fields: content - tokenized not stored title - tokenized and stored path - stored only date - stored only For some reason, I divide these documents into 12 sets and use IndexSearcher over MultiReader for search. For all the english query, the speed is very fast, only cost about 10-100ms. But when I use the Chinese words for query, the situation is a bit confused: If the word is only one char, so the Query is actually a TermQuery, the speed is very fast. however, If the word is more than one char, the Query is actually a PhraseQuery with slop 0, IndexSearcher usually cost 3000-5000ms to return the Hits. I have also tested with the QueryParser and get the same results, and my environment is a Dell PE2600 2G*2 Xeon, 2GRAM, 1R/s SCSI, Debian/sarge, Sun JDK 1.5 + lucene 2.0.0 thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Incemental Updating
Hi, i have two applications on an windows machine. One is the searchengine where the index is can be searched. The second application runs one time on a day which updates (deletions/adding) the index. My question: The index is already opened (Indexreader) by the frist application. Is there a problem when second application accesses the same indexfiles for updating at the same time? I tried it and i get no exception, but when i search for the for documents where the values where changed (first delete, than add new document), i can only find it with the old values, not with new ones. Thanks a lot for you help :-)) -- View this message in context: http://www.nabble.com/Incemental-Updating-tf2168389.html#a5995363 Sent from the Lucene - Java Users forum at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]