RE: searching using the CJKAnalyzer

2004-10-11 Thread Jon Schuster
I didn't need to make any changes to Entities to get Japanese searches working. Are 
you using the CJKAnalyzer when you perform the search, not only when building the 
index?

-Original Message-
From: Daan Hoogland [mailto:[EMAIL PROTECTED] 
Sent: Sunday, October 10, 2004 10:48 PM
To: Lucene Users List
Subject: Re: searching using the CJKAnalyzer
Importance: Low


Che Dong wrote:

 Seem not Analyser problem but html parser charset detecting error.

 Could you show me the detail of the problem?

Thank Che,
I got it working by making the decode() from the Entities in demo 
public. I wrote a scanner to tranlate any entities in the query.
I want to translate back to entities in the results, but I'm not sure 
what the criteria should be. It seems to be just binary data.
How to conclude that 04?03?04 means ?


 Thanks

 Che Dong

 Daan Hoogland wrote:

 LS,
 in
 http://issues.apache.org/eyebrowse/ReadMsg?listId=30msgNo=8980
 Jon Schuster explains how to get a Japanese search system working. I 
 followed his advice and got a index that luke shows as what I 
 expected it to be.
 I don't know how to enter a search so that it gets passed to the 
 engine properly. It works in luke but not in weblucene or in my own app.




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-- 
The information contained in this communication and any attachments is confidential 
and may be privileged, and is for the sole use of the intended recipient(s). Any 
unauthorized review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please notify the sender immediately by replying to this message 
and destroy all copies of this message and any attachments. ASML is neither liable for 
the proper and complete transmission of the information contained in this 
communication, nor for any delay in its receipt.


RE: Lucene Search Applet

2004-08-23 Thread Jon Schuster
Hi all,

The changes I made to get past the System.getProperty issues are essentially
the same in the three files org.apache.lucene.index.IndexWriter,
org.apache.lucene.store.FSDirectory, and
org.apache.lucene.search.BooleanQuery.

Change the static initializations from a form like this:

  public static long WRITE_LOCK_TIMEOUT = 
 
Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout,
  1000));

to a separate declaration and static initializer block like this:

   public static long WRITE_LOCK_TIMEOUT;
   static
   {
try
{
WRITE_LOCK_TIMEOUT =
Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout,
1000));
}
catch ( Exception e )
{
WRITE_LOCK_TIMEOUT = 1000;
}
   };

As before, the variables are initialized when the class is loaded, but if
the System.getProperty fails, the variable still gets initialized to its
default value in the catch block.

You can use a separate static block for each variable, or put them all into
a single static block. You could also add a setter for each variable if you
want the ability to set the value separately from the class init.

In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are
marked final, which I had to remove to do the initialization as described.

I've also attached the three modified files if you want to just copy and
paste.

--Jon

-Original Message-
From: Simon mcIlwaine [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 23, 2004 7:37 AM
To: Lucene Users List
Subject: Re: Lucene Search Applet


Hi,

Just used the RODirectory and I'm now getting the following error:
java.security.AccessControlException: access denied
(java.util.PropertyPermission user.dir read) I'm reckoning that this is what
Jon was on about with System.getProperty() within certain files because im
using an applet. Is this correct and if so can someone show me one of the
hacked files so that I know what I need to modify.

Many Thanks

Simon
.
- Original Message -
From: Simon mcIlwaine [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 3:12 PM
Subject: Re: Lucene Search Applet

 Hi Stephane,

 A bit of a stupid question but how do you mean set the system property
 disableLuceneLocks=true? Can I do it from a call from FSDirectory API or
do
 I have to actually hack the code? Also if I do use RODirectory how do I go
 about using it? Do I have to update the Lucene JAR archive file with
 RODirectory class included as I tried using it and its not recognising the
 class?

 Many Thanks

 Simon

 - Original Message -
 From: Stephane James Vaucher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Monday, August 23, 2004 2:22 PM
 Subject: Re: Lucene Search Applet


  Hi Simon,
 
  Does this work? From FSDirectory api:
 
  If the system property 'disableLuceneLocks' has the String value of
  true, lock creation will be disabled.
 
  Otherwise, I think there was a Read-Only Directory hack:
 
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html
 
  HTH,
  sv
 
  On Mon, 23 Aug 2004, Simon mcIlwaine wrote:
 
   Thanks Jon that works by putting the jar file in the archive
attribute.
 Now
   im getting the disablelock error cause of the unsigned applet. Do I
just
   comment out the code anywhere where System.getProperty() appears in
the
   files that you specified and then update the JAR Archive?? Is it
 possible
   you could show me one of the hacked files so that I know what I'm
 modifying?
   Does anyone else know if there is another way of doing this without
 having
   to hack the source code?
  
   Many thanks.
  
   Simon
  
   - Original Message -
   From: Jon Schuster [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Saturday, August 21, 2004 2:08 AM
   Subject: Re: Lucene Search Applet
  
  
I have Lucene working in an applet and I've seen this problem only
 when
the jar file really was not available (typo in the jar name), which
is
what you'd expect. It's possible that the classpath for your
application is not the same as the classpath for the applet; perhaps
they're using different VMs or JREs from different locations.
   
Try referencing the Lucene jar file in the archive attribute of the
applet tag.
   
Also, to get Lucene to work from an unsigned applet, I had to modify
a
few classes that call System.getProperty(), because the properties
 that
were being requested were disallowed for applets. I think the
classes
were IndexWriter, FSDirectory, and BooleanQuery.
   
--Jon
   
   
On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote:
   
 Im a new Lucene User and I'm not too familiar with Applets either
 but
 I've
 been doing a bit of testing on java applet security and if im
 correct
 in
 saying that applets can read anything below there codebase then my
 problem
 is not a security

Re: Lucene Search Applet

2004-08-20 Thread Jon Schuster
I have Lucene working in an applet and I've seen this problem only when 
the jar file really was not available (typo in the jar name), which is 
what you'd expect. It's possible that the classpath for your 
application is not the same as the classpath for the applet; perhaps 
they're using different VMs or JREs from different locations.

Try referencing the Lucene jar file in the archive attribute of the 
applet tag.

Also, to get Lucene to work from an unsigned applet, I had to modify a 
few classes that call System.getProperty(), because the properties that 
were being requested were disallowed for applets. I think the classes 
were IndexWriter, FSDirectory, and BooleanQuery.

--Jon


On Aug 20, 2004, at 6:57 AM, Simon mcIlwaine wrote:

 Im a new Lucene User and I'm not too familiar with Applets either but 
 I've
 been doing a bit of testing on java applet security and if im correct 
 in
 saying that applets can read anything below there codebase then my 
 problem
 is not a security restriction one. The error is reading
 java.lang.NoClassDefFoundError and the classpath is set as I have it 
 working
 in a Swing App. Does someone actually have Lucene working in an 
 Applet? Can
 it be done?? Please help.

 Thanks

 Simon

 - Original Message -

 From: Terry Steichen [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, August 18, 2004 4:17 PM
 Subject: Re: Lucene Search Applet


 I suspect it has to do with the security restrictions of the applet, 
 'cause
 it doesn't appear to be finding your Lucene jar file.  Also, regarding 
 the
 lock files, I believe you can disable the locking stuff just for 
 purposes
 like yours (read-only index).

 Regards,

 Terry
   - Original Message -
   From: Simon mcIlwaine
   To: Lucene Users List
   Sent: Wednesday, August 18, 2004 11:03 AM
   Subject: Lucene Search Applet


   Im developing a Lucene CD-ROM based search which will search html 
 pages on
 CD-ROM, using an applet as the UI. I know that theres a problem with 
 lock
 files and also security restrictions on applets so I am using the
 RAMDirectory. I have it working in a Swing application however when I 
 put it
 into an applet its giving me problems. It compiles but when I go to 
 run the
 applet I get the error below. Can anyone help? Thanks in advance.
   Simon

   Error:

   Java.lang.noClassDefFoundError: org/apache/lucene/store/Directory

   At: Java.lang.Class.getDeclaredConstructors0(Native Method)

   At: Java.lang.Class.privateGetDeclaredConstructors(Class.java:1610)

   At: Java.lang.Class.getConstructor0(Class.java:1922)

   At: Java.lang.Class.newInstance0(Class.java:278)

   At: Java.lang.Class.newInstance(Class.java:261)

   At: sun.applet.AppletPanel.createApplet(AppletPanel.java:617)

   At: sun.applet.AppletPanel.runloader(AppletPanel.java:546)

   At: sun.applet.AppletPanel.run(AppletPanel.java:298)

   At: java.lang.Thread.run(Thread.java:534)

   Code:

   import org.apache.lucene.search.IndexSearcher;

   import org.apache.lucene.search.Query;

   import org.apache.lucene.search.TermQuery;

   import org.apache.lucene.store.RAMDirectory;

   import org.apache.lucene.store.Directory;

   import org.apache.lucene.index.Term;

   import org.apache.lucene.search.Hits;

   import java.awt.*;

   import java.awt.event.*;

   import javax.swing.*;

   import java.io.*;

   public class MemorialApp2 extends JApplet implements ActionListener{

   JLabel prompt;

   JTextField input;

   JButton search;

   JPanel panel;

   String indexDir = C:/Java/lucene/index-list;

   private static RAMDirectory idx;

   public void init(){

   Container cp = getContentPane();

   panel = new JPanel();

   panel.setLayout(new FlowLayout(FlowLayout.CENTER, 4, 4));

   prompt = new JLabel(Keyword search:);

   input = new JTextField(,20);

   search = new JButton(Search);

   search.addActionListener(this);

   panel.add(prompt);

   panel.add(input);

   panel.add(search);

   cp.add(panel);

   }

   public void actionPerformed(ActionEvent e){

   if (e.getSource() == search){

   String surname = (input.getText());

   try {

   findSurname(indexDir, surname);

   } catch(Exception ex) {

   System.err.println(ex);

   }

   }

   }

   public static void findSurname(String indexDir, String surname) 
 throws
 Exception{

   idx = new RAMDirectory(indexDir);

   IndexSearcher searcher = new IndexSearcher(idx);

   Query query = new TermQuery(new Term(surname, surname));

   Hits hits = searcher.search(query);

   for (int i = 0; i  hits.length(); i++) {

   //Document doc = hits.doc(i);

   System.out.println(Surname:  + hits.doc(i).get(surname));

   }

   }

   }



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, 

RE: Problems indexing Japanese with CJKAnalyzer ... Or French wit h UTF-8 and MetaData

2004-07-16 Thread Jon Schuster
If you're specifying the correct encoding in HTMLDocument when you create an
InputStreamReader for the HTML file, and if you're specifying UTF8 as the
encoding for the InputStreamReader and OutputStreamWriter in
HTMLParser.getReader, I don't see how the meta tag data would have a
different encoding than the other content that gets indexed.

The stuff in the JDK docs about Properties being stored as 8859-1 applies
only when properties are saved or loaded from a stream using the
Properties.save and Properties.load methods. In Lucene, meta tag information
is stored in a Properties structure only while parsing and tokenizing.

Some strategically placed System.out.printlns should let you see if the meta
tag strings are what you think they are.

--Jon 

-Original Message-
From: Bruno Tirel [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 15, 2004 8:07 AM
To: 'Lucene Users List'
Subject: RE: Problems indexing Japanese with CJKAnalyzer ... Or French with
UTF-8 and MetaData


I don't think I understand correctly your proposal.
As a basis, I am using Demo3 with indexHTML, HTMLDocument and HTMLParser.
Inside HTML parser, I am calling getMetaTags (calling addMetaData) wich
return Properties object. My issue is coming fron this definition :
Properties are stored into ISO-8859-1 encoding, when all my data encodings
inside and outside are UTF-8.
I am not successful in getting UTF-8 values from this Parser.GetMetaTags()
through any conversion.
These data are extracted from an HTML page, with UTF-8 encoding declared at
the beginning of the file.
I do not see how to call a request.setEncoding(UTF-8) : I need the Parser
to have knowledge of UTF-8 encoding... And it doesn't appear when using
Properties object.

Any feedback?

-Message d'origine-
De : Praveen Peddi [mailto:[EMAIL PROTECTED] 
Envoyé : jeudi 15 juillet 2004 15:12
À : Lucene Users List
Objet : Re: Problems indexing Japanese with CJKAnalyzer

If its a web application, you have to cal request.setEncoding(UTF-8)
before reading any parameters. Also make sure html page encoding is
specified as UTF-8 in the metatag. most web app servers decode the request
paramaters in the system's default encoding algorithm. If u call above
method, I think it will solve ur problem.

Praveen
- Original Message -
From: Bruno Tirel [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Thursday, July 15, 2004 6:15 AM
Subject: RE: Problems indexing Japanese with CJKAnalyzer


Hi All,

I am also trying to localize everything for French application, using UTF-8
encoding. I have already applied what Jon described. I fully confirm his
recommandation for HTML Parser and HTML Document changes with UNICODE and
UTF-8 encoding specification.

In my case, I have still one case not functional : using meta-data from HTML
document, as in demo3 example. Trying to convert to UTF-8, or
ISO-8859-1, it is still not correctly encoded when I check with Luke.
A word Propriété is seen either as Propri?t? with a square, or as
Propriã©tã©.
My local codepage is Cp1252, so should be viewed as ISO-8859-1. Same result
when I use local FileEncoding parameter.
All the other fields are correctly encoded into UTF-8, tokenized and
successfully searched through JSP page.

Is anybody already facing this issue? Any help available?
Best regards,

Bruno


-Message d'origine-
De : Jon Schuster [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 14 juillet 2004 22:51
À : 'Lucene Users List'
Objet : RE: Problems indexing Japanese with CJKAnalyzer

Hi all,

Thanks for the help on indexing Japanese documents. I eventually got things
working, and here's an update so that other folks might have an easier time
in similar situations.

The problem I had was indeed with the encoding, but it was more than just
the encoding on the initial creation of the HTMLParser (from the Lucene demo
package). In HTMLDocument, doing this:

InputStreamReader reader = new InputStreamReader( new
FileInputStream(f), SJIS);
HTMLParser parser = new HTMLParser( reader );

creates the parser and feeds it Unicode from the original Shift-JIS encoding
document, but then when the document contents is fetched using this line:

Field fld = Field.Text(contents, parser.getReader() );

HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter
using the default encoding, which in my case was Windows 1252 (essentially
Latin-1). That was bad.

In the HTMLParser.jj grammar file, adding an explicit encoding of UTF8 on
both the Reader and Writer got things mostly working. The one missing piece
was in the options section of the HTMLParser.jj file. The original grammar
file generates an input character stream class that treats the input as a
stream of 1-byte characters. To have JavaCC generate a stream class that
handles double-byte characters, you need the option UNICODE_INPUT=true.

So, there were essentially three changes in two files:

HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit
UTF8 encoding

RE: Problems indexing Japanese with CJKAnalyzer

2004-07-14 Thread Jon Schuster
Hi all,

Thanks for the help on indexing Japanese documents. I eventually got things
working, and here's an update so that other folks might have an easier time
in similar situations.

The problem I had was indeed with the encoding, but it was more than just
the encoding on the initial creation of the HTMLParser (from the Lucene demo
package). In HTMLDocument, doing this:

InputStreamReader reader = new InputStreamReader( new
FileInputStream(f), SJIS);
HTMLParser parser = new HTMLParser( reader );

creates the parser and feeds it Unicode from the original Shift-JIS encoding
document, but then when the document contents is fetched using this line:

Field fld = Field.Text(contents, parser.getReader() );

HTMLParser.getReader creates an InputStreamReader and OutputStreamWriter
using the default encoding, which in my case was Windows 1252 (essentially
Latin-1). That was bad.

In the HTMLParser.jj grammar file, adding an explicit encoding of UTF8 on
both the Reader and Writer got things mostly working. The one missing piece
was in the options section of the HTMLParser.jj file. The original grammar
file generates an input character stream class that treats the input as a
stream of 1-byte characters. To have JavaCC generate a stream class that
handles double-byte characters, you need the option UNICODE_INPUT=true.

So, there were essentially three changes in two files:

HTMLParser.jj - add UNICODE_INPUT=true to options section; add explicit
UTF8 encoding on Reader and Writer creation in getReader(). As far as I
can tell, this changes works fine for all of the languages I need to handle,
which are English, French, German, and Japanese.

HTMLDocument - add explicit encoding of SJIS when creating the Reader used
to create the HTMLParser. (For western languages, I use encoding of
ISO8859_1.)

And of course, use the right language tokenizer.

--Jon

earlier responses snipped; see the list archive

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Compile errors in FrenchAnalyzer

2004-07-02 Thread Jon Schuster
I ran into this problem as well. I just added the throws IOException to the
constructor and to the setStemExclusionTable method and everything seems to
work fine.

The FrenchAnalyzer has dependencies on the GermanAnalyzer, and from the cvs
history, it appears that the throws clauses were added to the GermanAnalyzer
at the end of March, but the current version of the FrenchAnalyzer was
checked in a couple of weeks before the edits to GermanAnalyzer.

--Jon

-Original Message-
From: Praveen Peddi [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 02, 2004 6:08 AM
To: Lucene Users List
Subject: Compile errors in FrenchAnalyzer


I get compile time errors with FrenchAnalyzer in the constructor with file
name and the method setStemExclusionTable.
Unhandled exception type IOException

How do I fix these errors? Should I just throw IOException or catch the
exception in the method and ignore.

I am using lucene 1.4 final.

Praveen

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Problems indexing Japanese with CJKAnalyzer

2004-07-02 Thread Jon Schuster
Hi,

I've gone through all of the past messages regarding the CJKAnalyzer but I
still must be doing something wrong because my searches don't work.

I'm using the IndexHTML application from the org.apache.lucene.demo package
to do the indexing, and I've changed the analyzer to use the CJKAnalyzer.
I've also tried with and without setting the file.encoding to Shift-JIS.
I've tried indexing the HTML files, which contain Shift-JIS, without
conversion to Unicode and I get assorted Parse Aborted: Lexical error...
messages. I've also tried converting the Shift-JIS HTML files to Unicode by
first running them through the native2ascii tool.

When the files are converted via native2ascii, they index without errors,
but the index appears to contain the Unicode characters as literal strings
such as u7aef, u7af6, etc. Searching for an English word produces
results that have text like code \u5c5e\u6027.

Since others have gotten Japanese indexing to work, what's the secret I'm
missing?

Thanks,
Jon


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]