permission control or category-wise search with Lucene

2005-08-30 Thread seema pai
Hi My site has large database of Television and Movie titles, in English, Spanish language. The movie data starts from year 1928 till date for selected studios like MGM, Disney etc . The site user should be capable to search movie or tv series by title, description, actors or characters. The

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
The temporary char[] buffer is cached per InputStream instance, so the extra memory allocation shouldn't be a big deal. One could also use String(byte[],offset,len,"UTF-8"), and that creates a char[] that is used directly by the string instead of being copied. It remains to be seen how fast the

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> How will the difference impact String memory allocations? Looking at the > String code, I can't see where it would make an impact. This is from Lucene InputStream: public final String readString() throws IOException { int length = readVInt(); if (chars == null || length > chars.length) chars =

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> How will the difference impact String memory allocations? Looking at the > String code, I can't see where it would make an impact. This is from Lucene InputStream: public final String readString() throws IOException { int length = readVInt(); if (chars == null || length > chars.length) chars =

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
That method should easily be changed to public final String readString() throws IOException { int length = readVInt(); return new String(readBytes(length),"UTF-8); } readBytes(0 could reuse the same array if it was large enough. Then only the single char[] is created in the String code. -Ori

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
[EMAIL PROTECTED] wrote: How will the difference impact String memory allocations? Looking at the String code, I can't see where it would make an impact. I spoke a bit too soon. I should have looked at the code first. You're right, I don't think it would require more allocations. When con

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
Ken Krugler wrote: The remaining issue is dealing with old-format indexes. I think that revving the version number on the segments file would be a good start. This file must be read before any others. Its current version is -1 and would become -2. (All positive values are version 0, for b

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler
On Monday 29 August 2005 19:56, Ken Krugler wrote: "Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data." But wouldn't UTF-16 mean 2 bytes per character? Yes, UTF-16 means two bytes per code unit. A Unicod

Re: Lucene does NOT use UTF-8

2005-08-30 Thread DM Smith
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: "Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data." But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding classes - avoiding all of

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread DM Smith
Ken Krugler wrote: I think the VInt should be the numbers of bytes to be stored using the UTF-8 encoding. It is trivial to use the String methods identified before to do the conversion. The String(char[]) allocates a new char array. For performance, you can use the actual CharSet encoding c

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Steven Rowe
DM Smith wrote: Daniel Naber wrote: But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the case. UTF-16 is a fixed 2 byte/char representation. Except when it's not. I.e., above the BMP. From the Unicode 4.0 standard

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
I think you guys are WAY overcomplicating things, or you just don't know enough about the Java class libraries. If you use the java.nio.charset.CharsetEncoder class, then you can reuse the byte[] array, and then it is a simple write of the length, and a blast copy of the required number of bytes t

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> Sure you can. Do a "tell" to get the position. Write any number. The representation of the number is variable sized... you can't use a placeholder. -Yonik Now hiring -- http://tinyurl.com/7m67g

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
At bit more clarity... Using CharBuffer and ByteBuffer allows for easy reuse and expansion. You also need to use the CharSetDecoder class as well. -Original Message- From: Robert Engels [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 12:40 PM To: java-dev@lucene.apache.org Subjec

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> I think you guys are WAY overcomplicating things, or you just don't know > enough about the Java class libraries. People were just pointing out that if the vint isn't String.length(), then one has to either buffer the entire string, or pre-scan it. It's a valid point, and CharsetEncoder doesn

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
Not true. You do not need to pre-scan it. When you use CharSet encoder, it will write the bytes to a buffer (expanding as needed). At the end of the encoding you can get the actual number of bytes needed. The pseudo-code is use CharsetEncoder to write String to ByteBuffer write VInt using ByteBu

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
On 8/30/05, Robert Engels <[EMAIL PROTECTED]> wrote: > > Not true. You do not need to pre-scan it. What I previously wrote, with emphasis on key words added: "one has to *either* buffer the entire string, *or* pre-scan it." -Yonik Now hiring -- http://tinyurl.com/7m67g

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels
Since the buffer can be reused, seems that is the proper choice, and the "increased memory" you cited originally is not an issue. -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 30, 2005 1:07 PM To: java-dev@lucene.apache.org; [EMAIL PROTECTED] Subject

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. -Yonik Now hiring -- http://tinyurl.com/7m67g On

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
Yonik Seeley wrote: I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. TermBuffer.java:66 Things

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Ken Krugler
Daniel Naber wrote: On Monday 29 August 2005 19:56, Ken Krugler wrote: "Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data." But wouldn't UTF-16 mean 2 bytes per character? That doesn't seem to be the cas

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> The inefficiency would be if prefix were re-converted from UTF-8 > for each term, e.g., in order to compare it to the target. Ahhh, gotcha. A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right?

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
Yonik Seeley wrote: A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right? The prefix length is no longer the offset into the char[] to put the suffix. Yes, I suppose this is a problem too. Sigh

Re: Lucene does NOT use UTF-8

2005-08-30 Thread Tom White
On 8/30/05, Ken Krugler <[EMAIL PROTECTED]> wrote: > > >Daniel Naber wrote: > > > >>On Monday 29 August 2005 19:56, Ken Krugler wrote: > >> > >>>"Lucene writes strings as a VInt representing the length of the > >>>string in Java chars (UTF-16 code units), followed by the character > >>>data." > >>

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Ken Krugler
Yonik Seeley wrote: A related problem exists even if the prefix length vInt is changed to represent the number of unicode chars (as opposed to number of java chars), right? The prefix length is no longer the offset into the char[] to put the suffix. Yes, I suppose this is a problem too. Sigh

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Yonik Seeley
> Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For sorting queries, one is able to specify a Locale. -

Re: Lucene does NOT use UTF-8.

2005-08-30 Thread Doug Cutting
Yonik Seeley wrote: Where/how is the Lucene ordering of terms used? An ordering is necessary to be able to find things in the index. For the most part, the ordering doesn't seem matter... the only query that comes to mind where it does matter is RangeQuery. For back-compatibility it would be