SV: Changing the Scoring api

2006-09-12 Thread Marcus Falck
Hi Hoss,

No it wasn't any thing wrong with your suggestions except that they had landed 
in my junk mail for some reason, stupid outlook.

However I haven't had any chance testing all of your suggestions but I already 
had implemented my own similarity class that has the coord fixed to 1. And it 
doesn't work as excepted.


/
Marcus

-Ursprungligt meddelande-
Från: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Skickat: den 11 september 2006 20:15
Till: Lucene Users
Ämne: Re: Changing the Scoring api


: I want to override the default scoring when it comes to queries
: containing the OR operator.

this mesages seems to be an exact repost of your question from last friday
... was theresomething wrong with teh suggestions i included in my reply
to it?

http://www.nabble.com/Changing-the-Scoring-api-for-OR-parameters-tf2237565.html



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



SV: Changing the Scoring api

2006-09-12 Thread Marcus Falck
However the BooleanQuery's disableCoord seems to make effect. 
But I still have the problem when I'm constructing queries with wildcards.


/
Marcus

-Ursprungligt meddelande-
Från: Marcus Falck [mailto:[EMAIL PROTECTED] 
Skickat: den 12 september 2006 09:34
Till: java-user@lucene.apache.org
Ämne: SV: Changing the Scoring api

Hi Hoss,

No it wasn't any thing wrong with your suggestions except that they had landed 
in my junk mail for some reason, stupid outlook.

However I haven't had any chance testing all of your suggestions but I already 
had implemented my own similarity class that has the coord fixed to 1. And it 
doesn't work as excepted.


/
Marcus

-Ursprungligt meddelande-
Från: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Skickat: den 11 september 2006 20:15
Till: Lucene Users
Ämne: Re: Changing the Scoring api


: I want to override the default scoring when it comes to queries
: containing the OR operator.

this mesages seems to be an exact repost of your question from last friday
... was theresomething wrong with teh suggestions i included in my reply
to it?

http://www.nabble.com/Changing-the-Scoring-api-for-OR-parameters-tf2237565.html



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Highligher Example

2006-09-12 Thread Tom Emerson

Autonomy's KeyView is an alternative to Stellent. It does not cover all of
the file formats that Stellent does, though many of them are probably not
interesting for most applications. When I last looked at it it did not
handle mail archives, though there was a plan to add it. I found it more
stable than Stellent, and it has a JNI interface that works quite well. It
is still quite expensive, however.

PDFBox works, but we found it to be really really slow.

YMMV,

-tree

--
Tom Emerson
[EMAIL PROTECTED]
http://www.dreamersrealm.net/~tree


Re: getCurrentVersion question

2006-09-12 Thread Tom Emerson

As far as I know there isn't a way to do this. What we do is add a
"metadata" document to each index that includes the creation date, the user
name of the creating user, and various other tidbits. This gets updated on
incremental updates to the index as well. Easily done and makes it easy to
query.

On 9/9/06, Mag Gam <[EMAIL PROTECTED]> wrote:


Hi All,

I am trying to get the exact date when my index was created. I am assuming
getCurrentVersion() is the right way of doing it. However, I am getting a
result something like this: 1157817833085

According to the API reference,
"Reads version number from segments files. The version number is
initialized
with a timestamp and then increased by one for each change of the index."

So, to get the date of this, I should be doing something like this:
date=1157817833085-1;

Any thoughts?
tia





--
Tom Emerson
[EMAIL PROTECTED]
http://www.dreamersrealm.net/~tree


Storing fields without term positions

2006-09-12 Thread Timo Nentwig
Hi everybody,

is it possible to store fields without term position (the .prx file) data? We 
store sort of custom
data in the field and use it as some sort of a filter for queries, so we just 
don't need any term
position data and it bloats the index' size nearly by factor 3.

Thanks
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SV: Changing the Scoring api

2006-09-12 Thread Chris Hostetter

: However the BooleanQuery's disableCoord seems to make effect.
: But I still have the problem when I'm constructing queries with wildcards.

really? ... that's strange, WildcardQuery uses the disableCoord feature of
BooleanQuery.  Do you have an example of what you mean?

: already had implemented my own similarity class that has the coord fixed
: to 1. And it doesn't work as excepted.

are you setting your Similarity as the default on your IndexSearcher prior
to executing your Queries?


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



group field selection of the form field:(a b c)

2006-09-12 Thread Pramodh Shenoy
Hi Eric/Usergroup,

I am working on a help content index-search project based on Lucene.
One of my requirements is to search for a particular text in the content
of files from specific directories. When I index the content

 

Eg. guides/accountmanagement/index.htm and
guides/databasemanagement/index.htm 

 

doc.add(new Field("booktype", "guides", Field.Store.YES,
Field.Index.UN_TOKENIZED))

doc.add(new Field("subtype", "accountmanagement", Field.Store.YES,
Field.Index.UN_TOKENIZED))

doc.add(new Field("subtype", "databasemanagement", Field.Store.YES,
Field.Index.UN_TOKENIZED))

doc.add(new Field("content",
all-content-read-from-html-body-as-a-string, Field.Store.NO,
Field.Index.TOKENIZED))

 

Now I want to search for all occurrences of "management" in the
"content" field (which already exists in both the above index.htm files
body), in files under subtype/accountmanagement and under subtype/
databasemanagement

 

Iam creating the query as below:

 

String [] queries = new String [3];// =new String[4]

String [] fields = new String [3] ];// =new String[4]

BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];//
=new String[4]

 

queries[0]= " guides ";

fields[0]=" booktype ";

flags[0] = BooleanClause.Occur.MUST;

 

queries[1]= " management ";

fields[1]="content";

flags[1] = BooleanClause.Occur.MUST;



/*  A ### */ 

queries[2]= " accountmanagement databasemanagement ";

fields[2]=" subtype ";

flags[2] = BooleanClause.Occur.MUST;



/* # B ###

queries[2]= " accountmanagement";

fields[2]="subtype";

flags[2] = BooleanClause.Occur.MUST;

 

queries[3]= " databasemanagement ";

fields[3]=" subtype ";

flags[3] = BooleanClause.Occur.MUST;

*/

 

Query queryObj = null;

//parse the query string

try {

queryObj = MultiFieldQueryParser.parse(queries, fields,
flags, new StandardAnalyzer());

} catch (ParseException exp) { }

 

 

With option A , the query generated looks like:

+booktype:guides +content:management +(subtype: accountmanagement
subtype: databasemanagement)

 

With option B , the query generated looks like:

+booktype:guides +content:management +subtype: accountmanagement
+subtype: databasemanagement

 

 

Both return no Hits.!

 

Any idea how I should create the query. In Lucene In Action, this is
explained as "you can group field selection over several terms using
field:(a b c)". How can I achieve this with the code above ?

 

Thanks

Pramodh

 

 

 

 



Re: getCurrentVersion question

2006-09-12 Thread Mag Gam

Tom:

great! Now do you do you add metadata? I am new to Lucene API + Java, but
willing to learn.

Got an example?

TIA

On 9/12/06, Tom Emerson <[EMAIL PROTECTED]> wrote:


As far as I know there isn't a way to do this. What we do is add a
"metadata" document to each index that includes the creation date, the
user
name of the creating user, and various other tidbits. This gets updated on
incremental updates to the index as well. Easily done and makes it easy to
query.

On 9/9/06, Mag Gam <[EMAIL PROTECTED]> wrote:
>
> Hi All,
>
> I am trying to get the exact date when my index was created. I am
assuming
> getCurrentVersion() is the right way of doing it. However, I am getting
a
> result something like this: 1157817833085
>
> According to the API reference,
> "Reads version number from segments files. The version number is
> initialized
> with a timestamp and then increased by one for each change of the
index."
>
> So, to get the date of this, I should be doing something like this:
> date=1157817833085-1;
>
> Any thoughts?
> tia
>
>


--
Tom Emerson
[EMAIL PROTECTED]
http://www.dreamersrealm.net/~tree




Re: Using Hibernate to store Lucene Indexes in a Database

2006-09-12 Thread Beady Geraghty

I don't know if the use of a DATALINK data type would be relevant in your
case.
Here are some references.
http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/start/c0005450.htm
http://www.oracle.com/technology/sample_code/tech/java/codesnippet/jdbc/datalink/readme.html





On 9/9/06, Néstor Boscán <[EMAIL PROTECTED]> wrote:


Tomi thanks for your thoughts. I'm new to Lucene, so coming from an Oracle
background my mind is set that everything goes inside the database. Now
that
I know some of the loses I can have a better picture.

Regards,

Néstor Boscán

-Mensaje original-
De: Tomi NA [mailto:[EMAIL PROTECTED]
Enviado el: Viernes, 08 de Septiembre de 2006 05:21 p.m.
Para: java-user@lucene.apache.org
Asunto: Re: Using Hibernate to store Lucene Indexes in a Database

On 9/8/06, Néstor Boscán <[EMAIL PROTECTED]> wrote:
> To reduce administration tasks. If you want to move your application
from
> server to server you'll have to move the index files. I want to be able
to
> move my application by just moving my database schema and deploying an
ear.
>
> Regards,
>
> Néstor Boscán

Funny, I felt the same way about file-based storage: you simply pack
it up using any of the numerous file transfer tools available and you
don't have to worry about any of the database issues (possible
uncompressed large dump over the network, is the database server
running etc.).
On the other hand, if your application utilizes a database anyway, it
might be doable, assuming the app can take the performance penalty.
I'd be hard pressed to come up with a scenario where the gains
(simpler backup) would outweigh the losses (having to learn to store
the index into the database, performance, database bloat), though.
Still, it might only be my lack of imagination, that's the problem. :)

t.n.a.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: group field selection of the form field:(a b c)

2006-09-12 Thread Erick Erickson

Interestingly, you have extra spaces when you construct your queries, e.g.
queries[2]= " accountmanagement" has an extra space at the beginning but
when you index the document, there are no spaces. I believe that since
you're indexing the fields UN_TOKENIZED, that the spaces are preserved in
the query (but I'm not entirely clear on this point, so don't take my word
for it completely ).

Have you used Luke to examine your index? You can also put parsed form of
the query into Luke and play around with that to see what *should* work.
Google lucene luke and you'll find it right away.

Best
Erick

On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote:


Hi Eric/Usergroup,

I am working on a help content index-search project based on Lucene.
One of my requirements is to search for a particular text in the content
of files from specific directories. When I index the content



Eg. guides/accountmanagement/index.htm and
guides/databasemanagement/index.htm



doc.add(new Field("booktype", "guides", Field.Store.YES,
Field.Index.UN_TOKENIZED))

doc.add(new Field("subtype", "accountmanagement", Field.Store.YES,
Field.Index.UN_TOKENIZED))

doc.add(new Field("subtype", "databasemanagement", Field.Store.YES,
Field.Index.UN_TOKENIZED))

doc.add(new Field("content",
all-content-read-from-html-body-as-a-string, Field.Store.NO,
Field.Index.TOKENIZED))



Now I want to search for all occurrences of "management" in the
"content" field (which already exists in both the above index.htm files
body), in files under subtype/accountmanagement and under subtype/
databasemanagement



Iam creating the query as below:



String [] queries = new String [3];// =new String[4]

String [] fields = new String [3] ];// =new String[4]

BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];//
=new String[4]



queries[0]= " guides ";

fields[0]=" booktype ";

flags[0] = BooleanClause.Occur.MUST;



queries[1]= " management ";

fields[1]="content";

flags[1] = BooleanClause.Occur.MUST;



/*  A ### */

queries[2]= " accountmanagement databasemanagement ";

fields[2]=" subtype ";

flags[2] = BooleanClause.Occur.MUST;



/* # B ###

queries[2]= " accountmanagement";

fields[2]="subtype";

flags[2] = BooleanClause.Occur.MUST;



queries[3]= " databasemanagement ";

fields[3]=" subtype ";

flags[3] = BooleanClause.Occur.MUST;

*/



Query queryObj = null;

//parse the query string

try {

queryObj = MultiFieldQueryParser.parse(queries, fields,
flags, new StandardAnalyzer());

} catch (ParseException exp) { }





With option A , the query generated looks like:

+booktype:guides +content:management +(subtype: accountmanagement
subtype: databasemanagement)



With option B , the query generated looks like:

+booktype:guides +content:management +subtype: accountmanagement
+subtype: databasemanagement





Both return no Hits.!



Any idea how I should create the query. In Lucene In Action, this is
explained as "you can group field selection over several terms using
field:(a b c)". How can I achieve this with the code above ?



Thanks

Pramodh













Re: getCurrentVersion question

2006-09-12 Thread Erick Erickson

Just add another document (I do something similar). The key is to remember
that documents in the same index do NOT have to have the same fields. So,
say for your "regular" documents, you have fields (f1, f2, f3, f4). For your
meta-data document, you index fields (md1, md2, md3...). The value for one
of these fields should be a known value (note, the value is completely
bogus, just so you remember it). Say index a value of "1" for md1 in your
meta-data document.

Now, to get your meta-data document, do a simple search on your known value
(e.g. md1="1") and read the rest of the document in whatever form is most
convenient. You can stuff anything you want in there, however you want. You
could index one field for everything you care about, or put it in a glob
that you parse. It's completely up to you.

The beauty of this is that, if you want to change your meta-data, all you
have to do is delete your meta-data doc and re-add it with new values, you
don't have to regenerate your index. And since your fields are orthogonal,
there's no danger of getting your meta-data doc as part of your regular
search.

One word of warning. Do NOT depend on the internal Lucene doc IDs (e.g.
reader.doc(idx)) being consistent. These internal numbers are not guaranteed
to be the same across an index optimize.

Hope this helps
Erick

On 9/12/06, Mag Gam <[EMAIL PROTECTED]> wrote:


Tom:

great! Now do you do you add metadata? I am new to Lucene API + Java, but
willing to learn.

Got an example?

TIA

On 9/12/06, Tom Emerson <[EMAIL PROTECTED]> wrote:
>
> As far as I know there isn't a way to do this. What we do is add a
> "metadata" document to each index that includes the creation date, the
> user
> name of the creating user, and various other tidbits. This gets updated
on
> incremental updates to the index as well. Easily done and makes it easy
to
> query.
>
> On 9/9/06, Mag Gam <[EMAIL PROTECTED]> wrote:
> >
> > Hi All,
> >
> > I am trying to get the exact date when my index was created. I am
> assuming
> > getCurrentVersion() is the right way of doing it. However, I am
getting
> a
> > result something like this: 1157817833085
> >
> > According to the API reference,
> > "Reads version number from segments files. The version number is
> > initialized
> > with a timestamp and then increased by one for each change of the
> index."
> >
> > So, to get the date of this, I should be doing something like this:
> > date=1157817833085-1;
> >
> > Any thoughts?
> > tia
> >
> >
>
>
> --
> Tom Emerson
> [EMAIL PROTECTED]
> http://www.dreamersrealm.net/~tree
>
>




RE: group field selection of the form field:(a b c)

2006-09-12 Thread Pramodh Shenoy
The spaces just came i guess when i copied the code to outlook :-), actually 
there arent any. Let me take a look at Luke , especially testing to see what 
should be returned when i run the aprsed query.. sounds very interesting..
 
Thanks a lot
Pramodh



From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: Tue 9/12/2006 11:19 PM
To: java-user@lucene.apache.org
Subject: Re: group field selection of the form field:(a b c)



Interestingly, you have extra spaces when you construct your queries, e.g.
queries[2]= " accountmanagement" has an extra space at the beginning but
when you index the document, there are no spaces. I believe that since
you're indexing the fields UN_TOKENIZED, that the spaces are preserved in
the query (but I'm not entirely clear on this point, so don't take my word
for it completely ).

Have you used Luke to examine your index? You can also put parsed form of
the query into Luke and play around with that to see what *should* work.
Google lucene luke and you'll find it right away.

Best
Erick

On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote:
>
> Hi Eric/Usergroup,
>
> I am working on a help content index-search project based on Lucene.
> One of my requirements is to search for a particular text in the content
> of files from specific directories. When I index the content
>
>
>
> Eg. guides/accountmanagement/index.htm and
> guides/databasemanagement/index.htm
>
>
>
> doc.add(new Field("booktype", "guides", Field.Store.YES,
> Field.Index.UN_TOKENIZED))
>
> doc.add(new Field("subtype", "accountmanagement", Field.Store.YES,
> Field.Index.UN_TOKENIZED))
>
> doc.add(new Field("subtype", "databasemanagement", Field.Store.YES,
> Field.Index.UN_TOKENIZED))
>
> doc.add(new Field("content",
> all-content-read-from-html-body-as-a-string, Field.Store.NO,
> Field.Index.TOKENIZED))
>
>
>
> Now I want to search for all occurrences of "management" in the
> "content" field (which already exists in both the above index.htm files
> body), in files under subtype/accountmanagement and under subtype/
> databasemanagement
>
>
>
> Iam creating the query as below:
>
>
>
> String [] queries = new String [3];// =new String[4]
>
> String [] fields = new String [3] ];// =new String[4]
>
> BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];//
> =new String[4]
>
>
>
> queries[0]= " guides ";
>
> fields[0]=" booktype ";
>
> flags[0] = BooleanClause.Occur.MUST;
>
>
>
> queries[1]= " management ";
>
> fields[1]="content";
>
> flags[1] = BooleanClause.Occur.MUST;
>
>
>
> /*  A ### */
>
> queries[2]= " accountmanagement databasemanagement ";
>
> fields[2]=" subtype ";
>
> flags[2] = BooleanClause.Occur.MUST;
>
>
>
> /* # B ###
>
> queries[2]= " accountmanagement";
>
> fields[2]="subtype";
>
> flags[2] = BooleanClause.Occur.MUST;
>
>
>
> queries[3]= " databasemanagement ";
>
> fields[3]=" subtype ";
>
> flags[3] = BooleanClause.Occur.MUST;
>
> */
>
>
>
> Query queryObj = null;
>
> //parse the query string
>
> try {
>
> queryObj = MultiFieldQueryParser.parse(queries, fields,
> flags, new StandardAnalyzer());
>
> } catch (ParseException exp) { }
>
>
>
>
>
> With option A , the query generated looks like:
>
> +booktype:guides +content:management +(subtype: accountmanagement
> subtype: databasemanagement)
>
>
>
> With option B , the query generated looks like:
>
> +booktype:guides +content:management +subtype: accountmanagement
> +subtype: databasemanagement
>
>
>
>
>
> Both return no Hits.!
>
>
>
> Any idea how I should create the query. In Lucene In Action, this is
> explained as "you can group field selection over several terms using
> field:(a b c)". How can I achieve this with the code above ?
>
>
>
> Thanks
>
> Pramodh
>
>
>
>
>
>
>
>
>
>
>




UTF8 accents & umlauts filter?

2006-09-12 Thread Michael Imbeault
Right now Lucene has an accent filter (ISOLatin1AccentFilter) that 
remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it 
planned to add such a filter (which would be very useful, as 
ISOLatin1AccentFilter isn't able to remove some complex accents on some 
languages encoded in UTF8. I would paste examples but I'm not sure that 
they would display correctly).? I think I saw a post long ago on this 
mailing list about something like that, but it has never been released 
officially.


See

2001, first post about utf8 accents: 
http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648
2004, a good solution, but still incomplete : 
http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792
2006, best attempt yet, but sadly undelivered : 
http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142


I think Lucene would benefit from a complete UTF8 accents remover... 
right now the best solution I have is to process everything in PHP 
before indexing and at query time (and its a little slow).


Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: UTF8 accents & umlauts filter?

2006-09-12 Thread Yonik Seeley

Thanks for the links Michael... this one does look interesting:
http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
The challenge would be to make it fast... perhaps a custom hash table,
or look into the cost of a perfect hash function.

Just to clear up some unicode/terminology issues:

There are latin1 characters (the actual glyphs) represented by unicode
code points 0->255
There is also a latin1 encoding for unicode (which can only represent
unicode code points 0->255)
UTF8 is another encoding for unicode characters (or code points), but
that's not really relevant to a filter.

So ISOLatin1AccentFilter removes accents from characters <= 255, and
it doesn't matter what the original encoding was (ascii, latin1, UTF8,
UTF16, etc)

-Yonik


On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it
planned to add such a filter (which would be very useful, as
ISOLatin1AccentFilter isn't able to remove some complex accents on some
languages encoded in UTF8. I would paste examples but I'm not sure that
they would display correctly).? I think I saw a post long ago on this
mailing list about something like that, but it has never been released
officially.

See

2001, first post about utf8 accents:
http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648
2004, a good solution, but still incomplete :
http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792
2006, best attempt yet, but sadly undelivered :
http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142

I think Lucene would benefit from a complete UTF8 accents remover...
right now the best solution I have is to process everything in PHP
before indexing and at query time (and its a little slow).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: group field selection of the form field:(a b c)

2006-09-12 Thread Doron Cohen
It think option B cannot work because due to the MUST operator it requires
both "databasemanagement" and "accountmanagement" to be in the subtype
field.

Option A however should work, once the padding blank spaces are removed
from the field name - notice that while the standard analyzer would trim
spaces from the processed query text, the field names provided remain
unchanged - in this case - most probably - with the spaces.

Additional comment - that I'm not sure that is relevant to your case - on
the solution to this problem:
In this case you had two paths:
/a/b
/a/c
And you managed (or would soon manage:-) to ask for a document in either
two paths by asking for "a" in first part and "b" or "c" in second part.
However if the "taxonomy" becomes more complex this may turn more tricky.
For instance if the scenario would have the following possible paths:
   /a/b/c/d/e
   /a/b/c/x/y/z
   /a/b/d/x/f
etc., and assume you want all docs under the sub-tree defined by /a/b/c.
One possibility would be to index for each document all path prefixes -
i.e. for document in /a/b/c/d/e add path tokens (un-tokenized) - / ,
/a/ . /a/b/ , /a/b/c/d/ , /a/b/c/d/e/ , /a/b/c/d/e   (the latter token
would allow to search also "exact node" matches, i.e. not sub-tree
matches.) I believe you can find useful discussions on this by searching in
the user mailing list for "path" or "hierarchy", and for sure there are
other approaches.

"Pramodh Shenoy" <[EMAIL PROTECTED]> wrote on 12/09/2006 10:54:13:

> The spaces just came i guess when i copied the code to outlook :-),
> actually there arent any. Let me take a look at Luke , especially
> testing to see what should be returned when i run the aprsed query..
> sounds very interesting..
>
> Thanks a lot
> Pramodh
>
> 
>
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Tue 9/12/2006 11:19 PM
> To: java-user@lucene.apache.org
> Subject: Re: group field selection of the form field:(a b c)
>
>
>
> Interestingly, you have extra spaces when you construct your queries,
e.g.
> queries[2]= " accountmanagement" has an extra space at the beginning but
> when you index the document, there are no spaces. I believe that since
> you're indexing the fields UN_TOKENIZED, that the spaces are preserved in
> the query (but I'm not entirely clear on this point, so don't take my
word
> for it completely ).
>
> Have you used Luke to examine your index? You can also put parsed form of
> the query into Luke and play around with that to see what *should* work.
> Google lucene luke and you'll find it right away.
>
> Best
> Erick
>
> On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote:
> >
> > Hi Eric/Usergroup,
> >
> > I am working on a help content index-search project based on
Lucene.
> > One of my requirements is to search for a particular text in the
content
> > of files from specific directories. When I index the content
> >
> >
> >
> > Eg. guides/accountmanagement/index.htm and
> > guides/databasemanagement/index.htm
> >
> >
> >
> > doc.add(new Field("booktype", "guides", Field.Store.YES,
> > Field.Index.UN_TOKENIZED))
> >
> > doc.add(new Field("subtype", "accountmanagement", Field.Store.YES,
> > Field.Index.UN_TOKENIZED))
> >
> > doc.add(new Field("subtype", "databasemanagement", Field.Store.YES,
> > Field.Index.UN_TOKENIZED))
> >
> > doc.add(new Field("content",
> > all-content-read-from-html-body-as-a-string, Field.Store.NO,
> > Field.Index.TOKENIZED))
> >
> >
> >
> > Now I want to search for all occurrences of "management" in the
> > "content" field (which already exists in both the above index.htm files
> > body), in files under subtype/accountmanagement and under subtype/
> > databasemanagement
> >
> >
> >
> > Iam creating the query as below:
> >
> >
> >
> > String [] queries = new String [3];// =new String[4]
> >
> > String [] fields = new String [3] ];// =new String[4]
> >
> > BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];//
> > =new String[4]
> >
> >
> >
> > queries[0]= " guides ";
> >
> > fields[0]=" booktype ";
> >
> > flags[0] = BooleanClause.Occur.MUST;
> >
> >
> >
> > queries[1]= " management ";
> >
> > fields[1]="content";
> >
> > flags[1] = BooleanClause.Occur.MUST;
> >
> >
> >
> > /*  A ### */
> >
> > queries[2]= " accountmanagement databasemanagement ";
> >
> > fields[2]=" subtype ";
> >
> > flags[2] = BooleanClause.Occur.MUST;
> >
> >
> >
> > /* # B ###
> >
> > queries[2]= " accountmanagement";
> >
> > fields[2]="subtype";
> >
> > flags[2] = BooleanClause.Occur.MUST;
> >
> >
> >
> > queries[3]= " databasemanagement ";
> >
> > fields[3]=" subtype ";
> >
> > flags[3] = BooleanClause.Occur.MUST;
> >
> > */
> >
> >
> >
> > Query queryObj = null;
> >
> > //parse the query string
> >
> > try {
> >
> > queryO

Re: UTF8 accents & umlauts filter?

2006-09-12 Thread Ken Krugler

Thanks for the links Michael... this one does look interesting:
http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt
The challenge would be to make it fast... perhaps a custom hash table,
or look into the cost of a perfect hash function.

Just to clear up some unicode/terminology issues:


Some additional clarification below:


There are latin1 characters (the actual glyphs) represented by unicode
code points 0->255


Just U+00A0-> U+00FF would be considered Latin-1 by Unicode.

Unicode calls the block of Unicode code points from U+ -> U+007F 
"C0 Controls and Basic Latin".


From U+0080 to U+00FF is "C1 Controls and Latin-1 Supplement".


There is also a latin1 encoding for unicode (which can only represent
unicode code points 0->255)


There's an ISO 8859-1 charset (combination of character set, code 
points and encoding) that matches Unicode code points for 0x00 -> 
0x7F and 0xA0 -> 0xFF. Or rather, the Unicode code points for these 
two ranges were selected to match ISO 8859-1.



UTF8 is another encoding for unicode characters (or code points), but
that's not really relevant to a filter.

So ISOLatin1AccentFilter removes accents from characters <= 255, and
it doesn't matter what the original encoding was (ascii, latin1, UTF8,
UTF16, etc)


This isn't really about the "original encoding" - by the time 
ISOLatin1AccentFilter is called, it's dealing with Java strings, 
which use the UTF-16 Unicode encoding.


I think what Michael is asking for is the implementation of one of 
the Unicode-defined normalization forms (see 
http://www.unicode.org/reports/tr15/) along with diacritical 
stripping and other folding. Basically it's a way of mapping 
characters to a primary sort key.


This is pretty complex, especially when you start considering 
locale-specific details - we used ICU support for this in the past, 
which is where I'd probably start. ICU needs a lot of data to handle 
this properly across most locales, so it's not lightweight, but it 
would give you a general (albeit slower) solution.


-- Ken



On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Right now Lucene has an accent filter (ISOLatin1AccentFilter) that
remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it
planned to add such a filter (which would be very useful, as
ISOLatin1AccentFilter isn't able to remove some complex accents on some
languages encoded in UTF8. I would paste examples but I'm not sure that
they would display correctly).? I think I saw a post long ago on this
mailing list about something like that, but it has never been released
officially.

See

2001, first post about utf8 accents:
http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648
2004, a good solution, but still incomplete :
http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792
2006, best attempt yet, but sadly undelivered :
http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142

I think Lucene would benefit from a complete UTF8 accents remover...
right now the best solution I have is to process everything in PHP
before indexing and at query time (and its a little slow).


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: group field selection of the form field:(a b c)

2006-09-12 Thread Erick Erickson

As long as the field is added to the *same* document, I don't see a problem
with option B, although I'll admit that I haven't used
MultiFieldQueryParser. But there was a discussion a while ago about adding
tokens with the same field name to a document via document.add being exactly
the same as adding a larger batch of text in a single doc.add.

That said, though, I'm totally unclear about how that interacts with
UN_TOKENIZED. Hm.

Pramodh:
I hadn't thought of this. You may want to store TOKENIZED first to see what
happens, that's more intuitive. UN_TOKENIZED may be a culprit here. Luke
should tell you a lot

Erick

On 9/12/06, Doron Cohen <[EMAIL PROTECTED]> wrote:


It think option B cannot work because due to the MUST operator it requires
both "databasemanagement" and "accountmanagement" to be in the subtype
field.

Option A however should work, once the padding blank spaces are removed
from the field name - notice that while the standard analyzer would trim
spaces from the processed query text, the field names provided remain
unchanged - in this case - most probably - with the spaces.

Additional comment - that I'm not sure that is relevant to your case - on
the solution to this problem:
In this case you had two paths:
/a/b
/a/c
And you managed (or would soon manage:-) to ask for a document in either
two paths by asking for "a" in first part and "b" or "c" in second part.
However if the "taxonomy" becomes more complex this may turn more tricky.
For instance if the scenario would have the following possible paths:
   /a/b/c/d/e
   /a/b/c/x/y/z
   /a/b/d/x/f
etc., and assume you want all docs under the sub-tree defined by /a/b/c.
One possibility would be to index for each document all path prefixes -
i.e. for document in /a/b/c/d/e add path tokens (un-tokenized) - / ,
/a/ . /a/b/ , /a/b/c/d/ , /a/b/c/d/e/ , /a/b/c/d/e   (the latter token
would allow to search also "exact node" matches, i.e. not sub-tree
matches.) I believe you can find useful discussions on this by searching
in
the user mailing list for "path" or "hierarchy", and for sure there are
other approaches.

"Pramodh Shenoy" <[EMAIL PROTECTED]> wrote on 12/09/2006 10:54:13:

> The spaces just came i guess when i copied the code to outlook :-),
> actually there arent any. Let me take a look at Luke , especially
> testing to see what should be returned when i run the aprsed query..
> sounds very interesting..
>
> Thanks a lot
> Pramodh
>
> 
>
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Tue 9/12/2006 11:19 PM
> To: java-user@lucene.apache.org
> Subject: Re: group field selection of the form field:(a b c)
>
>
>
> Interestingly, you have extra spaces when you construct your queries,
e.g.
> queries[2]= " accountmanagement" has an extra space at the beginning but
> when you index the document, there are no spaces. I believe that since
> you're indexing the fields UN_TOKENIZED, that the spaces are preserved
in
> the query (but I'm not entirely clear on this point, so don't take my
word
> for it completely ).
>
> Have you used Luke to examine your index? You can also put parsed form
of
> the query into Luke and play around with that to see what *should* work.
> Google lucene luke and you'll find it right away.
>
> Best
> Erick
>
> On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote:
> >
> > Hi Eric/Usergroup,
> >
> > I am working on a help content index-search project based on
Lucene.
> > One of my requirements is to search for a particular text in the
content
> > of files from specific directories. When I index the content
> >
> >
> >
> > Eg. guides/accountmanagement/index.htm and
> > guides/databasemanagement/index.htm
> >
> >
> >
> > doc.add(new Field("booktype", "guides", Field.Store.YES,
> > Field.Index.UN_TOKENIZED))
> >
> > doc.add(new Field("subtype", "accountmanagement", Field.Store.YES,
> > Field.Index.UN_TOKENIZED))
> >
> > doc.add(new Field("subtype", "databasemanagement", Field.Store.YES,
> > Field.Index.UN_TOKENIZED))
> >
> > doc.add(new Field("content",
> > all-content-read-from-html-body-as-a-string, Field.Store.NO,
> > Field.Index.TOKENIZED))
> >
> >
> >
> > Now I want to search for all occurrences of "management" in the
> > "content" field (which already exists in both the above index.htmfiles
> > body), in files under subtype/accountmanagement and under subtype/
> > databasemanagement
> >
> >
> >
> > Iam creating the query as below:
> >
> >
> >
> > String [] queries = new String [3];// =new String[4]
> >
> > String [] fields = new String [3] ];// =new String[4]
> >
> > BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];//
> > =new String[4]
> >
> >
> >
> > queries[0]= " guides ";
> >
> > fields[0]=" booktype ";
> >
> > flags[0] = BooleanClause.Occur.MUST;
> >
> >
> >
> > queries[1]= " management ";
> >
> > fields[1]="content";
> >
> > flags[1] = BooleanClause.Occur.MUST;
> >
> >
> >
>