RE: searching using the CJKAnalyzer
I didn't need to make any changes to Entities to get Japanese searches working. Are you using the CJKAnalyzer when you perform the search, not only when building the index? -Original Message- From: Daan Hoogland [mailto:[EMAIL PROTECTED] Sent: Sunday, October 10, 2004 10:48 PM To: Lucene Users List Subject: Re: searching using the CJKAnalyzer Importance: Low Che Dong wrote: Seem not Analyser problem but html parser charset detecting error. Could you show me the detail of the problem? Thank Che, I got it working by making the decode() from the Entities in demo public. I wrote a scanner to tranlate any entities in the query. I want to translate back to entities in the results, but I'm not sure what the criteria should be. It seems to be just binary data. How to conclude that 04?03?04 means ? Thanks Che Dong Daan Hoogland wrote: LS, in http://issues.apache.org/eyebrowse/ReadMsg?listId=30msgNo=8980 Jon Schuster explains how to get a Japanese search system working. I followed his advice and got a index that luke shows as what I expected it to be. I don't know how to enter a search so that it gets passed to the engine properly. It works in luke but not in weblucene or in my own app. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt.
RE: Lucene Search Applet
Hi all, The changes I made to get past the System.getProperty issues are essentially the same in the three files org.apache.lucene.index.IndexWriter, org.apache.lucene.store.FSDirectory, and org.apache.lucene.search.BooleanQuery. Change the static initializations from a form like this: public static long WRITE_LOCK_TIMEOUT = Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout, 1000)); to a separate declaration and static initializer block like this: public static long WRITE_LOCK_TIMEOUT; static { try { WRITE_LOCK_TIMEOUT = Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout, 1000)); } catch ( Exception e ) { WRITE_LOCK_TIMEOUT = 1000; } }; As before, the variables are initialized when the class is loaded, but if the System.getProperty fails, the variable still gets initialized to its default value in the catch block. You can use a separate static block for each variable, or put them all into a single static block. You could also add a setter for each variable if you want the ability to set the value separately from the class init. In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are marked final, which I had to remove to do the initialization as described. I've also attached the three modified files if you want to just copy and paste. --Jon -Original Message- From: Simon mcIlwaine [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 7:37 AM To: Lucene Users List Subject: Re: Lucene Search Applet Hi, Just used the RODirectory and I'm now getting the following error: java.security.AccessControlException: access denied (java.util.PropertyPermission user.dir read) I'm reckoning that this is what Jon was on about with System.getProperty() within certain files because im using an applet. Is this correct and if so can someone show me one of the hacked files so that I know what I need to modify. Many Thanks Simon . - Original Message - From: Simon mcIlwaine [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 3:12 PM Subject: Re: Lucene Search Applet Hi Stephane, A bit of a stupid question but how do you mean set the system property disableLuceneLocks=true? Can I do it from a call from FSDirectory API or do I have to actually hack the code? Also if I do use RODirectory how do I go about using it? Do I have to update the Lucene JAR archive file with RODirectory class included as I tried using it and its not recognising the class? Many Thanks Simon - Original Message - From: Stephane James Vaucher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, August 23, 2004 2:22 PM Subject: Re: Lucene Search Applet Hi Simon, Does this work? From FSDirectory api: If the system property 'disableLuceneLocks' has the String value of true, lock creation will be disabled. Otherwise, I think there was a Read-Only Directory hack: http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html HTH, sv On Mon, 23 Aug 2004, Simon mcIlwaine wrote: Thanks Jon that works by putting the jar file in the archive attribute. Now im getting the disablelock error cause of the unsigned applet. Do I just comment out the code anywhere where System.getProperty() appears in the files that you specified and then update the JAR Archive?? Is it possible you could show me one of the hacked files so that I know what I'm modifying? Does anyone else know if there is another way of doing this without having to hack the source code? Many thanks. Simon - Original Message - From: Jon Schuster [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Saturday, August 21, 2004 2:08 AM Subject: Re: Lucene Search Applet I have Lucene working in an applet and I've seen this problem only when the jar file really was not available (typo in the jar name), which is what you'd expect. It's possible that the classpath for your application is not the same as the classpath for the applet; perhaps they're using different VMs or JREs from different locations. Try referencing the Lucene jar file in the archive attribute of the applet tag. Also, to get Lucene to work from an unsigned applet, I had to modify a few classes that call System.getProperty(), because the properties that were being requested were disallowed for applets. I think the classes were IndexWriter, FSDirectory, and BooleanQuery. --Jon On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote: Im a new Lucene User and I'm not too familiar with Applets either but I've been doing a bit of testing on java applet security and if im correct in saying that applets can read anything below there codebase then my problem is not a security
Re: Lucene Search Applet
I have Lucene working in an applet and I've seen this problem only when the jar file really was not available (typo in the jar name), which is what you'd expect. It's possible that the classpath for your application is not the same as the classpath for the applet; perhaps they're using different VMs or JREs from different locations. Try referencing the Lucene jar file in the archive attribute of the applet tag. Also, to get Lucene to work from an unsigned applet, I had to modify a few classes that call System.getProperty(), because the properties that were being requested were disallowed for applets. I think the classes were IndexWriter, FSDirectory, and BooleanQuery. --Jon On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote: Im a new Lucene User and I'm not too familiar with Applets either but I've been doing a bit of testing on java applet security and if im correct in saying that applets can read anything below there codebase then my problem is not a security restriction one. The error is reading java.lang.NoClassDefFoundError and the classpath is set as I have it working in a Swing App. Does someone actually have Lucene working in an Applet? Can it be done?? Please help. Thanks Simon - Original Message - From: Terry Steichen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 4:17 PM Subject: Re: Lucene Search Applet I suspect it has to do with the security restrictions of the applet, 'cause it doesn't appear to be finding your Lucene jar file. Also, regarding the lock files, I believe you can disable the locking stuff just for purposes like yours (read-only index). Regards, Terry - Original Message - From: Simon mcIlwaine To: Lucene Users List Sent: Wednesday, August 18, 2004 11:03 AM Subject: Lucene Search Applet Im developing a Lucene CD-ROM based search which will search html pages on CD-ROM, using an applet as the UI. I know that theres a problem with lock files and also security restrictions on applets so I am using the RAMDirectory. I have it working in a Swing application however when I put it into an applet its giving me problems. It compiles but when I go to run the applet I get the error below. Can anyone help? Thanks in advance. Simon Error: Java.lang.noClassDefFoundError: org/apache/lucene/store/Directory At: Java.lang.Class.getDeclaredConstructors0(Native Method) At: Java.lang.Class.privateGetDeclaredConstructors(Class.java:1610) At: Java.lang.Class.getConstructor0(Class.java:1922) At: Java.lang.Class.newInstance0(Class.java:278) At: Java.lang.Class.newInstance(Class.java:261) At: sun.applet.AppletPanel.createApplet(AppletPanel.java:617) At: sun.applet.AppletPanel.runloader(AppletPanel.java:546) At: sun.applet.AppletPanel.run(AppletPanel.java:298) At: java.lang.Thread.run(Thread.java:534) Code: import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.store.Directory; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import java.awt.*; import java.awt.event.*; import javax.swing.*; import java.io.*; public class MemorialApp2 extends JApplet implements ActionListener{ JLabel prompt; JTextField input; JButton search; JPanel panel; String indexDir = C:/Java/lucene/index-list; private static RAMDirectory idx; public void init(){ Container cp = getContentPane(); panel = new JPanel(); panel.setLayout(new FlowLayout(FlowLayout.CENTER, 4, 4)); prompt = new JLabel(Keyword search:); input = new JTextField(,20); search = new JButton(Search); search.addActionListener(this); panel.add(prompt); panel.add(input); panel.add(search); cp.add(panel); } public void actionPerformed(ActionEvent e){ if (e.getSource() == search){ String surname = (input.getText()); try { findSurname(indexDir, surname); } catch(Exception ex) { System.err.println(ex); } } } public static void findSurname(String indexDir, String surname) throws Exception{ idx = new RAMDirectory(indexDir); IndexSearcher searcher = new IndexSearcher(idx); Query query = new TermQuery(new Term(surname, surname)); Hits hits = searcher.search(query); for (int i = 0; i hits.length(); i++) { //Document doc = hits.doc(i); System.out.println(Surname: + hits.doc(i).get(surname)); } } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,
RE: Problems indexing Japanese with CJKAnalyzer ... Or French wit h UTF-8 and MetaData
If you're specifying the correct encoding in HTMLDocument when you create an InputStreamReader for the HTML file, and if you're specifying UTF8 as the encoding for the InputStreamReader and OutputStreamWriter in HTMLParser.getReader, I don't see how the meta tag data would have a different encoding than the other content that gets indexed. The stuff in the JDK docs about Properties being stored as 8859-1 applies only when properties are saved or loaded from a stream using the Properties.save and Properties.load methods. In Lucene, meta tag information is stored in a Properties structure only while parsing and tokenizing. Some strategically placed System.out.printlns should let you see if the meta tag strings are what you think they are. --Jon -Original Message- From: Bruno Tirel [mailto:[EMAIL PROTECTED] Sent: Thursday, July 15, 2004 8:07 AM To: 'Lucene Users List' Subject: RE: Problems indexing Japanese with CJKAnalyzer ... Or French with UTF-8 and MetaData I don't think I understand correctly your proposal. As a basis, I am using Demo3 with indexHTML, HTMLDocument and HTMLParser. Inside HTML parser, I am calling getMetaTags (calling addMetaData) wich return Properties object. My issue is coming fron this definition : Properties are stored into ISO-8859-1 encoding, when all my data encodings inside and outside are UTF-8. I am not successful in getting UTF-8 values from this Parser.GetMetaTags() through any conversion. These data are extracted from an HTML page, with UTF-8 encoding declared at the beginning of the file. I do not see how to call a request.setEncoding(UTF-8) : I need the Parser to have knowledge of UTF-8 encoding... And it doesn't appear when using Properties object. Any feedback? -Message d'origine- De : Praveen Peddi [mailto:[EMAIL PROTECTED] Envoyé : jeudi 15 juillet 2004 15:12 À : Lucene Users List Objet : Re: Problems indexing Japanese with CJKAnalyzer If its a web application, you have to cal request.setEncoding(UTF-8) before reading any parameters. Also make sure html page encoding is specified as UTF-8 in the metatag. most web app servers decode the request paramaters in the system's default encoding algorithm. If u call above method, I think it will solve ur problem. Praveen - Original Message - From: Bruno Tirel [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Thursday, July 15, 2004 6:15 AM Subject: RE: Problems indexing Japanese with CJKAnalyzer Hi All, I am also trying to localize everything for French application, using UTF-8 encoding. I have already applied what Jon described. I fully confirm his recommandation for HTML Parser and HTML Document changes with UNICODE and UTF-8 encoding specification. In my case, I have still one case not functional : using meta-data from HTML document, as in demo3 example. Trying to convert to UTF-8, or ISO-8859-1, it is still not correctly encoded when I check with Luke. A word Propriété is seen either as Propri?t? with a square, or as Propriã©tã©. My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same result when I use local FileEncoding parameter. All the other fields are correctly encoded into UTF-8, tokenized and successfully searched through JSP page. Is anybody already facing this issue? Any help available? Best regards, Bruno -Message d'origine- De : Jon Schuster [mailto:[EMAIL PROTECTED] Envoyé : mercredi 14 juillet 2004 22:51 À : 'Lucene Users List' Objet : RE: Problems indexing Japanese with CJKAnalyzer Hi all, Thanks for the help on indexing Japanese documents. I eventually got things working, and here's an update so that other folks might have an easier time in similar situations. The problem I had was indeed with the encoding, but it was more than just the encoding on the initial creation of the HTMLParser (from the Lucene demo package). In HTMLDocument, doing this: InputStreamReader reader = new InputStreamReader( new FileInputStream(f), SJIS); HTMLParser parser = new HTMLParser( reader ); creates the parser and feeds it Unicode from the original Shift-JIS encoding document, but then when the document contents is fetched using this line: Field fld = Field.Text(contents, parser.getReader() ); HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter using the default encoding, which in my case was Windows 1252 (essentially Latin-1). That was bad. In the HTMLParser.jj grammar file, adding an explicit encoding of UTF8 on both the Reader and Writer got things mostly working. The one missing piece was in the options section of the HTMLParser.jj file. The original grammar file generates an input character stream class that treats the input as a stream of 1-byte characters. To have JavaCC generate a stream class that handles double-byte characters, you need the option UNICODE_INPUT=true. So, there were essentially three changes in two files: HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit UTF8 encoding
RE: Problems indexing Japanese with CJKAnalyzer
Hi all, Thanks for the help on indexing Japanese documents. I eventually got things working, and here's an update so that other folks might have an easier time in similar situations. The problem I had was indeed with the encoding, but it was more than just the encoding on the initial creation of the HTMLParser (from the Lucene demo package). In HTMLDocument, doing this: InputStreamReader reader = new InputStreamReader( new FileInputStream(f), SJIS); HTMLParser parser = new HTMLParser( reader ); creates the parser and feeds it Unicode from the original Shift-JIS encoding document, but then when the document contents is fetched using this line: Field fld = Field.Text(contents, parser.getReader() ); HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter using the default encoding, which in my case was Windows 1252 (essentially Latin-1). That was bad. In the HTMLParser.jj grammar file, adding an explicit encoding of UTF8 on both the Reader and Writer got things mostly working. The one missing piece was in the options section of the HTMLParser.jj file. The original grammar file generates an input character stream class that treats the input as a stream of 1-byte characters. To have JavaCC generate a stream class that handles double-byte characters, you need the option UNICODE_INPUT=true. So, there were essentially three changes in two files: HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit UTF8 encoding on Reader and Writer creation in getReader(). As far as I can tell, this changes works fine for all of the languages I need to handle, which are English, French, German, and Japanese. HTMLDocument - add explicit encoding of SJIS when creating the Reader used to create the HTMLParser. (For western languages, I use encoding of ISO8859_1.) And of course, use the right language tokenizer. --Jon earlier responses snipped; see the list archive - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Compile errors in FrenchAnalyzer
I ran into this problem as well. I just added the throws IOException to the constructor and to the setStemExclusionTable method and everything seems to work fine. The FrenchAnalyzer has dependencies on the GermanAnalyzer, and from the cvs history, it appears that the throws clauses were added to the GermanAnalyzer at the end of March, but the current version of the FrenchAnalyzer was checked in a couple of weeks before the edits to GermanAnalyzer. --Jon -Original Message- From: Praveen Peddi [mailto:[EMAIL PROTECTED] Sent: Friday, July 02, 2004 6:08 AM To: Lucene Users List Subject: Compile errors in FrenchAnalyzer I get compile time errors with FrenchAnalyzer in the constructor with file name and the method setStemExclusionTable. Unhandled exception type IOException How do I fix these errors? Should I just throw IOException or catch the exception in the method and ignore. I am using lucene 1.4 final. Praveen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Problems indexing Japanese with CJKAnalyzer
Hi, I've gone through all of the past messages regarding the CJKAnalyzer but I still must be doing something wrong because my searches don't work. I'm using the IndexHTML application from the org.apache.lucene.demo package to do the indexing, and I've changed the analyzer to use the CJKAnalyzer. I've also tried with and without setting the file.encoding to Shift-JIS. I've tried indexing the HTML files, which contain Shift-JIS, without conversion to Unicode and I get assorted Parse Aborted: Lexical error... messages. I've also tried converting the Shift-JIS HTML files to Unicode by first running them through the native2ascii tool. When the files are converted via native2ascii, they index without errors, but the index appears to contain the Unicode characters as literal strings such as u7aef, u7af6, etc. Searching for an English word produces results that have text like code \u5c5e\u6027. Since others have gotten Japanese indexing to work, what's the secret I'm missing? Thanks, Jon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]