Problem in unicode field value retrival
Hi I am trying to index and search unicode (utf - 8) . the code i am using to index the documents is as follows : /**/ IndexWriter iw = new IndexWriter(d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index, new SimpleAnalyzer(), true); String dirBase = d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs; File docDir = new File(dirBase); String[] docFiles = docDir.list(); InputStreamReader isr; InputStream is; Document doc; for(int i=0;idocFiles.length;i++) { File tempFile = new File(dirBase + \\ + docFiles[i]); if(tempFile.isFile()==true) { System.out.println(Indexing File : + docFiles[i]); is = new FileInputStream(tempFile); isr=new InputStreamReader(is,utf-8); doc= new Document(); doc.add(Field.UnIndexed(path,tempFile.toString())); doc.add(Field.Text(abc,(Reader)isr)); doc.add(Field.Text(all,sansui)); iw.addDocument(doc); is.close(); isr.close(); doc=null; } } iw.close(); is=null; isr=null; iw=null; docDir=null; System.out.println(Indexing Complete); /**/ Now when i try to search the contents and get the field called abc by using the method doc.get(abc) , i get null as the output. Can anyone please tell me where i am going wrong . Thanks And Regards Harpreet
Re: Problem in unicode field value retrival
I don't think you can retrieve the contents of Fields that have been loaded by a Reader. From the javadoc for Field: Text(String name, Reader value) Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in the index verbatim. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Harpreet S Walia) wrote Hi I am trying to index and search unicode (utf - 8) . the code i am using to index the documents is as follows : /**/ IndexWriter iw = new IndexWriter(d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index, new SimpleAnalyzer(), true); String dirBase = d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs; File docDir = new File(dirBase); String[] docFiles = docDir.list(); InputStreamReader isr; InputStream is; Document doc; for(int i=0;idocFiles.length;i++) { File tempFile = new File(dirBase + \\ + docFiles[i]); if(tempFile.isFile()==true) { System.out.println(Indexing File : + docFiles[i]); is = new FileInputStream(tempFile); isr=new InputStreamReader(is,utf-8); doc= new Document(); doc.add(Field.UnIndexed(path,tempFile.toString())); doc.add(Field.Text(abc,(Reader)isr)); doc.add(Field.Text(all,sansui)); iw.addDocument(doc); is.close(); isr.close(); doc=null; } } iw.close(); is=null; isr=null; iw=null; docDir=null; System.out.println(Indexing Complete); /**/ Now when i try to search the contents and get the field called abc by using the method doc.get(abc) , i get null as the output. Can anyone please tell me where i am going wrong . Thanks And Regards Harpreet -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Problem in unicode field value retrival
Hi, That was the problem , Thanks :-) . still i am strugling to get lucene to search non english unicode content . it works partially will simple analyser but doesn't return any results with standard analyser . is there a way by which i can output the exact contents that are going into the index. Thanks and regards, Harpreet - Original Message - From: Ian Lea [EMAIL PROTECTED] To: Harpreet S Walia [EMAIL PROTECTED] Cc: Lucene Users List [EMAIL PROTECTED] Sent: Monday, June 10, 2002 5:15 PM Subject: Re: Problem in unicode field value retrival I don't think you can retrieve the contents of Fields that have been loaded by a Reader. From the javadoc for Field: Text(String name, Reader value) Constructs a Reader-valued Field that is tokenized and indexed, but is not stored in the index verbatim. -- Ian. [EMAIL PROTECTED] [EMAIL PROTECTED] (Harpreet S Walia) wrote Hi I am trying to index and search unicode (utf - 8) . the code i am using to index the documents is as follows : /*** ***/ IndexWriter iw = new IndexWriter(d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index, new SimpleAnalyzer(), true); String dirBase = d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs; File docDir = new File(dirBase); String[] docFiles = docDir.list(); InputStreamReader isr; InputStream is; Document doc; for(int i=0;idocFiles.length;i++) { File tempFile = new File(dirBase + \\ + docFiles[i]); if(tempFile.isFile()==true) { System.out.println(Indexing File : + docFiles[i]); is = new FileInputStream(tempFile); isr=new InputStreamReader(is,utf-8); doc= new Document(); doc.add(Field.UnIndexed(path,tempFile.toString())); doc.add(Field.Text(abc,(Reader)isr)); doc.add(Field.Text(all,sansui)); iw.addDocument(doc); is.close(); isr.close(); doc=null; } } iw.close(); is=null; isr=null; iw=null; docDir=null; System.out.println(Indexing Complete); /*** ***/ Now when i try to search the contents and get the field called abc by using the method doc.get(abc) , i get null as the output. Can anyone please tell me where i am going wrong . Thanks And Regards Harpreet -- Searchable personal storage and archiving from http://www.digimem.net/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Problem in unicode field value retrival
Hello, That was the problem , Thanks :-) . still i am strugling to get lucene to search non english unicode content . it works partially will simple analyser but doesn't return any results with standard analyser . is there a way by which i can output the exact contents that are going into the index Perhaps something like this will help. This is a very recent post from the searchable mailing list archives at http://nagoya.apache.org/: http://nagoya.apache.org/eyebrowse/ReadMsg?[EMAIL PROTECTED]msgId=352570 Otis __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Within Search
Hello, I'm sending this to lucene-user list, as that seems more appropriate. I haven't used Lucene's slop feature, but it looks like both QueryParser and PhraseQuery have support for slop. I am not sure what the syntax for it is, but if nothign else you should be able to call setSlop(int) method on an instance of PhraseQuery. Oh, it looks like you missed it in the Query Parser Syntax document: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Otis --- none none [EMAIL PROTECTED] wrote: hi, i asked some help about this feature some time ago, but no answer. What do i need to do is the WithinPhraseSearch. An example can be: search for: car w/10 rent. This mean, look for documents that contains 'car' and within 10 words 'rent'. So, what i think i need is: 1.Change the QueryParser.jj to reconize the operator w/xx as the within operator. 2.The QueryParser should return a PhraseQuery with a slop factor equals to '10' for the example above. Should also ignore w/xx if xx is not numeric. An other question: what should i do if i want the query operator (AND,OR,NOT,etc) to be case insensitive? what should i change inside the QueryParser.jj ? PLEASE HELP, because i really don't know how to use the JavaCC utility. Thanks, bye. ___ WIN a first class trip to Hawaii. Live like the King of Rock and Roll on the big Island. Enter Now! http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Within Search
thanks, i saw the queryparser documentation and works fine. now how can i make the query operator like 'AND', 'OR', etc, case insensitive? also how can i change the '~' to 'w/' ? I really don't know how use JavaCC, but may be for someone is easy, someone can help me? thank you. -- On Mon, 10 Jun 2002 09:01:29 none none wrote: hi, i asked some help about this feature some time ago, but no answer. What do i need to do is the WithinPhraseSearch. An example can be: search for: car w/10 rent. This mean, look for documents that contains 'car' and within 10 words 'rent'. So, what i think i need is: 1.Change the QueryParser.jj to reconize the operator w/xx as the within operator. 2.The QueryParser should return a PhraseQuery with a slop factor equals to '10' for the example above. Should also ignore w/xx if xx is not numeric. An other question: what should i do if i want the query operator (AND,OR,NOT,etc) to be case insensitive? what should i change inside the QueryParser.jj ? PLEASE HELP, because i really don't know how to use the JavaCC utility. Thanks, bye. ___ WIN a first class trip to Hawaii. Live like the King of Rock and Roll on the big Island. Enter Now! http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] ___ WIN a first class trip to Hawaii. Live like the King of Rock and Roll on the big Island. Enter Now! http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Within Search
This is a bit more complicated. We have had this discussion a while ago about having a NEAR operator. The queryParser.jj of how to do this in the developer mailling list. The problem is that the solution is not generic. That is what if the term is a wildcard or a more complicated sub query (a query in parentheses). For example (a AND b) w/ c This type of query is not supported by the Lucene Slop factor. That's why it's not support in Lucene as part of the general QueryParser. If you are willing to live with these limitations, the queryParser.jj with the NEAR operator should work. --Peter On 6/10/02 1:19 PM, none none [EMAIL PROTECTED] wrote: thanks, i saw the queryparser documentation and works fine. now how can i make the query operator like 'AND', 'OR', etc, case insensitive? also how can i change the '~' to 'w/' ? I really don't know how use JavaCC, but may be for someone is easy, someone can help me? thank you. -- On Mon, 10 Jun 2002 09:01:29 none none wrote: hi, i asked some help about this feature some time ago, but no answer. What do i need to do is the WithinPhraseSearch. An example can be: search for: car w/10 rent. This mean, look for documents that contains 'car' and within 10 words 'rent'. So, what i think i need is: 1.Change the QueryParser.jj to reconize the operator w/xx as the within operator. 2.The QueryParser should return a PhraseQuery with a slop factor equals to '10' for the example above. Should also ignore w/xx if xx is not numeric. An other question: what should i do if i want the query operator (AND,OR,NOT,etc) to be case insensitive? what should i change inside the QueryParser.jj ? PLEASE HELP, because i really don't know how to use the JavaCC utility. Thanks, bye. ___ WIN a first class trip to Hawaii. Live like the King of Rock and Roll on the big Island. Enter Now! http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] ___ WIN a first class trip to Hawaii. Live like the King of Rock and Roll on the big Island. Enter Now! http://r.lycos.com/r/sagel_mail/http://www.elvis.lycos.com/sweepstakes -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
How does simple analyser work
Hi, Are there any resources available which explain how the simple analyser processes the data given to it . what i want to know is that suppose i have a set of words , what exact rules are applied to tokenize and index these words and how can i customize them. My requirement is that the words be broken only by spaces and not at any other character . I understand that this can be done by writing a parser in JAVACC . but is there any simpler way of achieving this . I would really appriciate the help . Thanks and regards Harpreet