SV: Changing the Scoring api
Hi Hoss, No it wasn't any thing wrong with your suggestions except that they had landed in my junk mail for some reason, stupid outlook. However I haven't had any chance testing all of your suggestions but I already had implemented my own similarity class that has the coord fixed to 1. And it doesn't work as excepted. / Marcus -Ursprungligt meddelande- Från: Chris Hostetter [mailto:[EMAIL PROTECTED] Skickat: den 11 september 2006 20:15 Till: Lucene Users Ämne: Re: Changing the Scoring api : I want to override the default scoring when it comes to queries : containing the OR operator. this mesages seems to be an exact repost of your question from last friday ... was theresomething wrong with teh suggestions i included in my reply to it? http://www.nabble.com/Changing-the-Scoring-api-for-OR-parameters-tf2237565.html -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
SV: Changing the Scoring api
However the BooleanQuery's disableCoord seems to make effect. But I still have the problem when I'm constructing queries with wildcards. / Marcus -Ursprungligt meddelande- Från: Marcus Falck [mailto:[EMAIL PROTECTED] Skickat: den 12 september 2006 09:34 Till: java-user@lucene.apache.org Ämne: SV: Changing the Scoring api Hi Hoss, No it wasn't any thing wrong with your suggestions except that they had landed in my junk mail for some reason, stupid outlook. However I haven't had any chance testing all of your suggestions but I already had implemented my own similarity class that has the coord fixed to 1. And it doesn't work as excepted. / Marcus -Ursprungligt meddelande- Från: Chris Hostetter [mailto:[EMAIL PROTECTED] Skickat: den 11 september 2006 20:15 Till: Lucene Users Ämne: Re: Changing the Scoring api : I want to override the default scoring when it comes to queries : containing the OR operator. this mesages seems to be an exact repost of your question from last friday ... was theresomething wrong with teh suggestions i included in my reply to it? http://www.nabble.com/Changing-the-Scoring-api-for-OR-parameters-tf2237565.html -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highligher Example
Autonomy's KeyView is an alternative to Stellent. It does not cover all of the file formats that Stellent does, though many of them are probably not interesting for most applications. When I last looked at it it did not handle mail archives, though there was a plan to add it. I found it more stable than Stellent, and it has a JNI interface that works quite well. It is still quite expensive, however. PDFBox works, but we found it to be really really slow. YMMV, -tree -- Tom Emerson [EMAIL PROTECTED] http://www.dreamersrealm.net/~tree
Re: getCurrentVersion question
As far as I know there isn't a way to do this. What we do is add a "metadata" document to each index that includes the creation date, the user name of the creating user, and various other tidbits. This gets updated on incremental updates to the index as well. Easily done and makes it easy to query. On 9/9/06, Mag Gam <[EMAIL PROTECTED]> wrote: Hi All, I am trying to get the exact date when my index was created. I am assuming getCurrentVersion() is the right way of doing it. However, I am getting a result something like this: 1157817833085 According to the API reference, "Reads version number from segments files. The version number is initialized with a timestamp and then increased by one for each change of the index." So, to get the date of this, I should be doing something like this: date=1157817833085-1; Any thoughts? tia -- Tom Emerson [EMAIL PROTECTED] http://www.dreamersrealm.net/~tree
Storing fields without term positions
Hi everybody, is it possible to store fields without term position (the .prx file) data? We store sort of custom data in the field and use it as some sort of a filter for queries, so we just don't need any term position data and it bloats the index' size nearly by factor 3. Thanks Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SV: Changing the Scoring api
: However the BooleanQuery's disableCoord seems to make effect. : But I still have the problem when I'm constructing queries with wildcards. really? ... that's strange, WildcardQuery uses the disableCoord feature of BooleanQuery. Do you have an example of what you mean? : already had implemented my own similarity class that has the coord fixed : to 1. And it doesn't work as excepted. are you setting your Similarity as the default on your IndexSearcher prior to executing your Queries? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
group field selection of the form field:(a b c)
Hi Eric/Usergroup, I am working on a help content index-search project based on Lucene. One of my requirements is to search for a particular text in the content of files from specific directories. When I index the content Eg. guides/accountmanagement/index.htm and guides/databasemanagement/index.htm doc.add(new Field("booktype", "guides", Field.Store.YES, Field.Index.UN_TOKENIZED)) doc.add(new Field("subtype", "accountmanagement", Field.Store.YES, Field.Index.UN_TOKENIZED)) doc.add(new Field("subtype", "databasemanagement", Field.Store.YES, Field.Index.UN_TOKENIZED)) doc.add(new Field("content", all-content-read-from-html-body-as-a-string, Field.Store.NO, Field.Index.TOKENIZED)) Now I want to search for all occurrences of "management" in the "content" field (which already exists in both the above index.htm files body), in files under subtype/accountmanagement and under subtype/ databasemanagement Iam creating the query as below: String [] queries = new String [3];// =new String[4] String [] fields = new String [3] ];// =new String[4] BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];// =new String[4] queries[0]= " guides "; fields[0]=" booktype "; flags[0] = BooleanClause.Occur.MUST; queries[1]= " management "; fields[1]="content"; flags[1] = BooleanClause.Occur.MUST; /* A ### */ queries[2]= " accountmanagement databasemanagement "; fields[2]=" subtype "; flags[2] = BooleanClause.Occur.MUST; /* # B ### queries[2]= " accountmanagement"; fields[2]="subtype"; flags[2] = BooleanClause.Occur.MUST; queries[3]= " databasemanagement "; fields[3]=" subtype "; flags[3] = BooleanClause.Occur.MUST; */ Query queryObj = null; //parse the query string try { queryObj = MultiFieldQueryParser.parse(queries, fields, flags, new StandardAnalyzer()); } catch (ParseException exp) { } With option A , the query generated looks like: +booktype:guides +content:management +(subtype: accountmanagement subtype: databasemanagement) With option B , the query generated looks like: +booktype:guides +content:management +subtype: accountmanagement +subtype: databasemanagement Both return no Hits.! Any idea how I should create the query. In Lucene In Action, this is explained as "you can group field selection over several terms using field:(a b c)". How can I achieve this with the code above ? Thanks Pramodh
Re: getCurrentVersion question
Tom: great! Now do you do you add metadata? I am new to Lucene API + Java, but willing to learn. Got an example? TIA On 9/12/06, Tom Emerson <[EMAIL PROTECTED]> wrote: As far as I know there isn't a way to do this. What we do is add a "metadata" document to each index that includes the creation date, the user name of the creating user, and various other tidbits. This gets updated on incremental updates to the index as well. Easily done and makes it easy to query. On 9/9/06, Mag Gam <[EMAIL PROTECTED]> wrote: > > Hi All, > > I am trying to get the exact date when my index was created. I am assuming > getCurrentVersion() is the right way of doing it. However, I am getting a > result something like this: 1157817833085 > > According to the API reference, > "Reads version number from segments files. The version number is > initialized > with a timestamp and then increased by one for each change of the index." > > So, to get the date of this, I should be doing something like this: > date=1157817833085-1; > > Any thoughts? > tia > > -- Tom Emerson [EMAIL PROTECTED] http://www.dreamersrealm.net/~tree
Re: Using Hibernate to store Lucene Indexes in a Database
I don't know if the use of a DATALINK data type would be relevant in your case. Here are some references. http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/start/c0005450.htm http://www.oracle.com/technology/sample_code/tech/java/codesnippet/jdbc/datalink/readme.html On 9/9/06, Néstor Boscán <[EMAIL PROTECTED]> wrote: Tomi thanks for your thoughts. I'm new to Lucene, so coming from an Oracle background my mind is set that everything goes inside the database. Now that I know some of the loses I can have a better picture. Regards, Néstor Boscán -Mensaje original- De: Tomi NA [mailto:[EMAIL PROTECTED] Enviado el: Viernes, 08 de Septiembre de 2006 05:21 p.m. Para: java-user@lucene.apache.org Asunto: Re: Using Hibernate to store Lucene Indexes in a Database On 9/8/06, Néstor Boscán <[EMAIL PROTECTED]> wrote: > To reduce administration tasks. If you want to move your application from > server to server you'll have to move the index files. I want to be able to > move my application by just moving my database schema and deploying an ear. > > Regards, > > Néstor Boscán Funny, I felt the same way about file-based storage: you simply pack it up using any of the numerous file transfer tools available and you don't have to worry about any of the database issues (possible uncompressed large dump over the network, is the database server running etc.). On the other hand, if your application utilizes a database anyway, it might be doable, assuming the app can take the performance penalty. I'd be hard pressed to come up with a scenario where the gains (simpler backup) would outweigh the losses (having to learn to store the index into the database, performance, database bloat), though. Still, it might only be my lack of imagination, that's the problem. :) t.n.a. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: group field selection of the form field:(a b c)
Interestingly, you have extra spaces when you construct your queries, e.g. queries[2]= " accountmanagement" has an extra space at the beginning but when you index the document, there are no spaces. I believe that since you're indexing the fields UN_TOKENIZED, that the spaces are preserved in the query (but I'm not entirely clear on this point, so don't take my word for it completely ). Have you used Luke to examine your index? You can also put parsed form of the query into Luke and play around with that to see what *should* work. Google lucene luke and you'll find it right away. Best Erick On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote: Hi Eric/Usergroup, I am working on a help content index-search project based on Lucene. One of my requirements is to search for a particular text in the content of files from specific directories. When I index the content Eg. guides/accountmanagement/index.htm and guides/databasemanagement/index.htm doc.add(new Field("booktype", "guides", Field.Store.YES, Field.Index.UN_TOKENIZED)) doc.add(new Field("subtype", "accountmanagement", Field.Store.YES, Field.Index.UN_TOKENIZED)) doc.add(new Field("subtype", "databasemanagement", Field.Store.YES, Field.Index.UN_TOKENIZED)) doc.add(new Field("content", all-content-read-from-html-body-as-a-string, Field.Store.NO, Field.Index.TOKENIZED)) Now I want to search for all occurrences of "management" in the "content" field (which already exists in both the above index.htm files body), in files under subtype/accountmanagement and under subtype/ databasemanagement Iam creating the query as below: String [] queries = new String [3];// =new String[4] String [] fields = new String [3] ];// =new String[4] BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];// =new String[4] queries[0]= " guides "; fields[0]=" booktype "; flags[0] = BooleanClause.Occur.MUST; queries[1]= " management "; fields[1]="content"; flags[1] = BooleanClause.Occur.MUST; /* A ### */ queries[2]= " accountmanagement databasemanagement "; fields[2]=" subtype "; flags[2] = BooleanClause.Occur.MUST; /* # B ### queries[2]= " accountmanagement"; fields[2]="subtype"; flags[2] = BooleanClause.Occur.MUST; queries[3]= " databasemanagement "; fields[3]=" subtype "; flags[3] = BooleanClause.Occur.MUST; */ Query queryObj = null; //parse the query string try { queryObj = MultiFieldQueryParser.parse(queries, fields, flags, new StandardAnalyzer()); } catch (ParseException exp) { } With option A , the query generated looks like: +booktype:guides +content:management +(subtype: accountmanagement subtype: databasemanagement) With option B , the query generated looks like: +booktype:guides +content:management +subtype: accountmanagement +subtype: databasemanagement Both return no Hits.! Any idea how I should create the query. In Lucene In Action, this is explained as "you can group field selection over several terms using field:(a b c)". How can I achieve this with the code above ? Thanks Pramodh
Re: getCurrentVersion question
Just add another document (I do something similar). The key is to remember that documents in the same index do NOT have to have the same fields. So, say for your "regular" documents, you have fields (f1, f2, f3, f4). For your meta-data document, you index fields (md1, md2, md3...). The value for one of these fields should be a known value (note, the value is completely bogus, just so you remember it). Say index a value of "1" for md1 in your meta-data document. Now, to get your meta-data document, do a simple search on your known value (e.g. md1="1") and read the rest of the document in whatever form is most convenient. You can stuff anything you want in there, however you want. You could index one field for everything you care about, or put it in a glob that you parse. It's completely up to you. The beauty of this is that, if you want to change your meta-data, all you have to do is delete your meta-data doc and re-add it with new values, you don't have to regenerate your index. And since your fields are orthogonal, there's no danger of getting your meta-data doc as part of your regular search. One word of warning. Do NOT depend on the internal Lucene doc IDs (e.g. reader.doc(idx)) being consistent. These internal numbers are not guaranteed to be the same across an index optimize. Hope this helps Erick On 9/12/06, Mag Gam <[EMAIL PROTECTED]> wrote: Tom: great! Now do you do you add metadata? I am new to Lucene API + Java, but willing to learn. Got an example? TIA On 9/12/06, Tom Emerson <[EMAIL PROTECTED]> wrote: > > As far as I know there isn't a way to do this. What we do is add a > "metadata" document to each index that includes the creation date, the > user > name of the creating user, and various other tidbits. This gets updated on > incremental updates to the index as well. Easily done and makes it easy to > query. > > On 9/9/06, Mag Gam <[EMAIL PROTECTED]> wrote: > > > > Hi All, > > > > I am trying to get the exact date when my index was created. I am > assuming > > getCurrentVersion() is the right way of doing it. However, I am getting > a > > result something like this: 1157817833085 > > > > According to the API reference, > > "Reads version number from segments files. The version number is > > initialized > > with a timestamp and then increased by one for each change of the > index." > > > > So, to get the date of this, I should be doing something like this: > > date=1157817833085-1; > > > > Any thoughts? > > tia > > > > > > > -- > Tom Emerson > [EMAIL PROTECTED] > http://www.dreamersrealm.net/~tree > >
RE: group field selection of the form field:(a b c)
The spaces just came i guess when i copied the code to outlook :-), actually there arent any. Let me take a look at Luke , especially testing to see what should be returned when i run the aprsed query.. sounds very interesting.. Thanks a lot Pramodh From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Tue 9/12/2006 11:19 PM To: java-user@lucene.apache.org Subject: Re: group field selection of the form field:(a b c) Interestingly, you have extra spaces when you construct your queries, e.g. queries[2]= " accountmanagement" has an extra space at the beginning but when you index the document, there are no spaces. I believe that since you're indexing the fields UN_TOKENIZED, that the spaces are preserved in the query (but I'm not entirely clear on this point, so don't take my word for it completely ). Have you used Luke to examine your index? You can also put parsed form of the query into Luke and play around with that to see what *should* work. Google lucene luke and you'll find it right away. Best Erick On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote: > > Hi Eric/Usergroup, > > I am working on a help content index-search project based on Lucene. > One of my requirements is to search for a particular text in the content > of files from specific directories. When I index the content > > > > Eg. guides/accountmanagement/index.htm and > guides/databasemanagement/index.htm > > > > doc.add(new Field("booktype", "guides", Field.Store.YES, > Field.Index.UN_TOKENIZED)) > > doc.add(new Field("subtype", "accountmanagement", Field.Store.YES, > Field.Index.UN_TOKENIZED)) > > doc.add(new Field("subtype", "databasemanagement", Field.Store.YES, > Field.Index.UN_TOKENIZED)) > > doc.add(new Field("content", > all-content-read-from-html-body-as-a-string, Field.Store.NO, > Field.Index.TOKENIZED)) > > > > Now I want to search for all occurrences of "management" in the > "content" field (which already exists in both the above index.htm files > body), in files under subtype/accountmanagement and under subtype/ > databasemanagement > > > > Iam creating the query as below: > > > > String [] queries = new String [3];// =new String[4] > > String [] fields = new String [3] ];// =new String[4] > > BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];// > =new String[4] > > > > queries[0]= " guides "; > > fields[0]=" booktype "; > > flags[0] = BooleanClause.Occur.MUST; > > > > queries[1]= " management "; > > fields[1]="content"; > > flags[1] = BooleanClause.Occur.MUST; > > > > /* A ### */ > > queries[2]= " accountmanagement databasemanagement "; > > fields[2]=" subtype "; > > flags[2] = BooleanClause.Occur.MUST; > > > > /* # B ### > > queries[2]= " accountmanagement"; > > fields[2]="subtype"; > > flags[2] = BooleanClause.Occur.MUST; > > > > queries[3]= " databasemanagement "; > > fields[3]=" subtype "; > > flags[3] = BooleanClause.Occur.MUST; > > */ > > > > Query queryObj = null; > > //parse the query string > > try { > > queryObj = MultiFieldQueryParser.parse(queries, fields, > flags, new StandardAnalyzer()); > > } catch (ParseException exp) { } > > > > > > With option A , the query generated looks like: > > +booktype:guides +content:management +(subtype: accountmanagement > subtype: databasemanagement) > > > > With option B , the query generated looks like: > > +booktype:guides +content:management +subtype: accountmanagement > +subtype: databasemanagement > > > > > > Both return no Hits.! > > > > Any idea how I should create the query. In Lucene In Action, this is > explained as "you can group field selection over several terms using > field:(a b c)". How can I achieve this with the code above ? > > > > Thanks > > Pramodh > > > > > > > > > > >
UTF8 accents & umlauts filter?
Right now Lucene has an accent filter (ISOLatin1AccentFilter) that remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it planned to add such a filter (which would be very useful, as ISOLatin1AccentFilter isn't able to remove some complex accents on some languages encoded in UTF8. I would paste examples but I'm not sure that they would display correctly).? I think I saw a post long ago on this mailing list about something like that, but it has never been released officially. See 2001, first post about utf8 accents: http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648 2004, a good solution, but still incomplete : http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792 2006, best attempt yet, but sadly undelivered : http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142 I think Lucene would benefit from a complete UTF8 accents remover... right now the best solution I have is to process everything in PHP before indexing and at query time (and its a little slow). Thanks, -- Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: UTF8 accents & umlauts filter?
Thanks for the links Michael... this one does look interesting: http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt The challenge would be to make it fast... perhaps a custom hash table, or look into the cost of a perfect hash function. Just to clear up some unicode/terminology issues: There are latin1 characters (the actual glyphs) represented by unicode code points 0->255 There is also a latin1 encoding for unicode (which can only represent unicode code points 0->255) UTF8 is another encoding for unicode characters (or code points), but that's not really relevant to a filter. So ISOLatin1AccentFilter removes accents from characters <= 255, and it doesn't matter what the original encoding was (ascii, latin1, UTF8, UTF16, etc) -Yonik On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Right now Lucene has an accent filter (ISOLatin1AccentFilter) that remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it planned to add such a filter (which would be very useful, as ISOLatin1AccentFilter isn't able to remove some complex accents on some languages encoded in UTF8. I would paste examples but I'm not sure that they would display correctly).? I think I saw a post long ago on this mailing list about something like that, but it has never been released officially. See 2001, first post about utf8 accents: http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648 2004, a good solution, but still incomplete : http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792 2006, best attempt yet, but sadly undelivered : http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142 I think Lucene would benefit from a complete UTF8 accents remover... right now the best solution I have is to process everything in PHP before indexing and at query time (and its a little slow). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: group field selection of the form field:(a b c)
It think option B cannot work because due to the MUST operator it requires both "databasemanagement" and "accountmanagement" to be in the subtype field. Option A however should work, once the padding blank spaces are removed from the field name - notice that while the standard analyzer would trim spaces from the processed query text, the field names provided remain unchanged - in this case - most probably - with the spaces. Additional comment - that I'm not sure that is relevant to your case - on the solution to this problem: In this case you had two paths: /a/b /a/c And you managed (or would soon manage:-) to ask for a document in either two paths by asking for "a" in first part and "b" or "c" in second part. However if the "taxonomy" becomes more complex this may turn more tricky. For instance if the scenario would have the following possible paths: /a/b/c/d/e /a/b/c/x/y/z /a/b/d/x/f etc., and assume you want all docs under the sub-tree defined by /a/b/c. One possibility would be to index for each document all path prefixes - i.e. for document in /a/b/c/d/e add path tokens (un-tokenized) - / , /a/ . /a/b/ , /a/b/c/d/ , /a/b/c/d/e/ , /a/b/c/d/e (the latter token would allow to search also "exact node" matches, i.e. not sub-tree matches.) I believe you can find useful discussions on this by searching in the user mailing list for "path" or "hierarchy", and for sure there are other approaches. "Pramodh Shenoy" <[EMAIL PROTECTED]> wrote on 12/09/2006 10:54:13: > The spaces just came i guess when i copied the code to outlook :-), > actually there arent any. Let me take a look at Luke , especially > testing to see what should be returned when i run the aprsed query.. > sounds very interesting.. > > Thanks a lot > Pramodh > > > > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Tue 9/12/2006 11:19 PM > To: java-user@lucene.apache.org > Subject: Re: group field selection of the form field:(a b c) > > > > Interestingly, you have extra spaces when you construct your queries, e.g. > queries[2]= " accountmanagement" has an extra space at the beginning but > when you index the document, there are no spaces. I believe that since > you're indexing the fields UN_TOKENIZED, that the spaces are preserved in > the query (but I'm not entirely clear on this point, so don't take my word > for it completely ). > > Have you used Luke to examine your index? You can also put parsed form of > the query into Luke and play around with that to see what *should* work. > Google lucene luke and you'll find it right away. > > Best > Erick > > On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote: > > > > Hi Eric/Usergroup, > > > > I am working on a help content index-search project based on Lucene. > > One of my requirements is to search for a particular text in the content > > of files from specific directories. When I index the content > > > > > > > > Eg. guides/accountmanagement/index.htm and > > guides/databasemanagement/index.htm > > > > > > > > doc.add(new Field("booktype", "guides", Field.Store.YES, > > Field.Index.UN_TOKENIZED)) > > > > doc.add(new Field("subtype", "accountmanagement", Field.Store.YES, > > Field.Index.UN_TOKENIZED)) > > > > doc.add(new Field("subtype", "databasemanagement", Field.Store.YES, > > Field.Index.UN_TOKENIZED)) > > > > doc.add(new Field("content", > > all-content-read-from-html-body-as-a-string, Field.Store.NO, > > Field.Index.TOKENIZED)) > > > > > > > > Now I want to search for all occurrences of "management" in the > > "content" field (which already exists in both the above index.htm files > > body), in files under subtype/accountmanagement and under subtype/ > > databasemanagement > > > > > > > > Iam creating the query as below: > > > > > > > > String [] queries = new String [3];// =new String[4] > > > > String [] fields = new String [3] ];// =new String[4] > > > > BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];// > > =new String[4] > > > > > > > > queries[0]= " guides "; > > > > fields[0]=" booktype "; > > > > flags[0] = BooleanClause.Occur.MUST; > > > > > > > > queries[1]= " management "; > > > > fields[1]="content"; > > > > flags[1] = BooleanClause.Occur.MUST; > > > > > > > > /* A ### */ > > > > queries[2]= " accountmanagement databasemanagement "; > > > > fields[2]=" subtype "; > > > > flags[2] = BooleanClause.Occur.MUST; > > > > > > > > /* # B ### > > > > queries[2]= " accountmanagement"; > > > > fields[2]="subtype"; > > > > flags[2] = BooleanClause.Occur.MUST; > > > > > > > > queries[3]= " databasemanagement "; > > > > fields[3]=" subtype "; > > > > flags[3] = BooleanClause.Occur.MUST; > > > > */ > > > > > > > > Query queryObj = null; > > > > //parse the query string > > > > try { > > > > queryO
Re: UTF8 accents & umlauts filter?
Thanks for the links Michael... this one does look interesting: http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt The challenge would be to make it fast... perhaps a custom hash table, or look into the cost of a perfect hash function. Just to clear up some unicode/terminology issues: Some additional clarification below: There are latin1 characters (the actual glyphs) represented by unicode code points 0->255 Just U+00A0-> U+00FF would be considered Latin-1 by Unicode. Unicode calls the block of Unicode code points from U+ -> U+007F "C0 Controls and Basic Latin". From U+0080 to U+00FF is "C1 Controls and Latin-1 Supplement". There is also a latin1 encoding for unicode (which can only represent unicode code points 0->255) There's an ISO 8859-1 charset (combination of character set, code points and encoding) that matches Unicode code points for 0x00 -> 0x7F and 0xA0 -> 0xFF. Or rather, the Unicode code points for these two ranges were selected to match ISO 8859-1. UTF8 is another encoding for unicode characters (or code points), but that's not really relevant to a filter. So ISOLatin1AccentFilter removes accents from characters <= 255, and it doesn't matter what the original encoding was (ascii, latin1, UTF8, UTF16, etc) This isn't really about the "original encoding" - by the time ISOLatin1AccentFilter is called, it's dealing with Java strings, which use the UTF-16 Unicode encoding. I think what Michael is asking for is the implementation of one of the Unicode-defined normalization forms (see http://www.unicode.org/reports/tr15/) along with diacritical stripping and other folding. Basically it's a way of mapping characters to a primary sort key. This is pretty complex, especially when you start considering locale-specific details - we used ICU support for this in the past, which is where I'd probably start. ICU needs a lot of data to handle this properly across most locales, so it's not lightweight, but it would give you a general (albeit slower) solution. -- Ken On 9/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Right now Lucene has an accent filter (ISOLatin1AccentFilter) that remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is it planned to add such a filter (which would be very useful, as ISOLatin1AccentFilter isn't able to remove some complex accents on some languages encoded in UTF8. I would paste examples but I'm not sure that they would display correctly).? I think I saw a post long ago on this mailing list about something like that, but it has never been released officially. See 2001, first post about utf8 accents: http://www.gossamer-threads.com/lists/lucene/java-user/648?search_string=accent;#648 2004, a good solution, but still incomplete : http://www.gossamer-threads.com/lists/lucene/java-user/10792?search_string=accent;#10792 2006, best attempt yet, but sadly undelivered : http://www.gossamer-threads.com/lists/lucene/java-user/32142?search_string=accent;#32142 I think Lucene would benefit from a complete UTF8 accents remover... right now the best solution I have is to process everything in PHP before indexing and at query time (and its a little slow). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers" - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: group field selection of the form field:(a b c)
As long as the field is added to the *same* document, I don't see a problem with option B, although I'll admit that I haven't used MultiFieldQueryParser. But there was a discussion a while ago about adding tokens with the same field name to a document via document.add being exactly the same as adding a larger batch of text in a single doc.add. That said, though, I'm totally unclear about how that interacts with UN_TOKENIZED. Hm. Pramodh: I hadn't thought of this. You may want to store TOKENIZED first to see what happens, that's more intuitive. UN_TOKENIZED may be a culprit here. Luke should tell you a lot Erick On 9/12/06, Doron Cohen <[EMAIL PROTECTED]> wrote: It think option B cannot work because due to the MUST operator it requires both "databasemanagement" and "accountmanagement" to be in the subtype field. Option A however should work, once the padding blank spaces are removed from the field name - notice that while the standard analyzer would trim spaces from the processed query text, the field names provided remain unchanged - in this case - most probably - with the spaces. Additional comment - that I'm not sure that is relevant to your case - on the solution to this problem: In this case you had two paths: /a/b /a/c And you managed (or would soon manage:-) to ask for a document in either two paths by asking for "a" in first part and "b" or "c" in second part. However if the "taxonomy" becomes more complex this may turn more tricky. For instance if the scenario would have the following possible paths: /a/b/c/d/e /a/b/c/x/y/z /a/b/d/x/f etc., and assume you want all docs under the sub-tree defined by /a/b/c. One possibility would be to index for each document all path prefixes - i.e. for document in /a/b/c/d/e add path tokens (un-tokenized) - / , /a/ . /a/b/ , /a/b/c/d/ , /a/b/c/d/e/ , /a/b/c/d/e (the latter token would allow to search also "exact node" matches, i.e. not sub-tree matches.) I believe you can find useful discussions on this by searching in the user mailing list for "path" or "hierarchy", and for sure there are other approaches. "Pramodh Shenoy" <[EMAIL PROTECTED]> wrote on 12/09/2006 10:54:13: > The spaces just came i guess when i copied the code to outlook :-), > actually there arent any. Let me take a look at Luke , especially > testing to see what should be returned when i run the aprsed query.. > sounds very interesting.. > > Thanks a lot > Pramodh > > > > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Tue 9/12/2006 11:19 PM > To: java-user@lucene.apache.org > Subject: Re: group field selection of the form field:(a b c) > > > > Interestingly, you have extra spaces when you construct your queries, e.g. > queries[2]= " accountmanagement" has an extra space at the beginning but > when you index the document, there are no spaces. I believe that since > you're indexing the fields UN_TOKENIZED, that the spaces are preserved in > the query (but I'm not entirely clear on this point, so don't take my word > for it completely ). > > Have you used Luke to examine your index? You can also put parsed form of > the query into Luke and play around with that to see what *should* work. > Google lucene luke and you'll find it right away. > > Best > Erick > > On 9/12/06, Pramodh Shenoy <[EMAIL PROTECTED]> wrote: > > > > Hi Eric/Usergroup, > > > > I am working on a help content index-search project based on Lucene. > > One of my requirements is to search for a particular text in the content > > of files from specific directories. When I index the content > > > > > > > > Eg. guides/accountmanagement/index.htm and > > guides/databasemanagement/index.htm > > > > > > > > doc.add(new Field("booktype", "guides", Field.Store.YES, > > Field.Index.UN_TOKENIZED)) > > > > doc.add(new Field("subtype", "accountmanagement", Field.Store.YES, > > Field.Index.UN_TOKENIZED)) > > > > doc.add(new Field("subtype", "databasemanagement", Field.Store.YES, > > Field.Index.UN_TOKENIZED)) > > > > doc.add(new Field("content", > > all-content-read-from-html-body-as-a-string, Field.Store.NO, > > Field.Index.TOKENIZED)) > > > > > > > > Now I want to search for all occurrences of "management" in the > > "content" field (which already exists in both the above index.htmfiles > > body), in files under subtype/accountmanagement and under subtype/ > > databasemanagement > > > > > > > > Iam creating the query as below: > > > > > > > > String [] queries = new String [3];// =new String[4] > > > > String [] fields = new String [3] ];// =new String[4] > > > > BooleanClause.Occur[] flags = new BooleanClause.Occur[3] ];// > > =new String[4] > > > > > > > > queries[0]= " guides "; > > > > fields[0]=" booktype "; > > > > flags[0] = BooleanClause.Occur.MUST; > > > > > > > > queries[1]= " management "; > > > > fields[1]="content"; > > > > flags[1] = BooleanClause.Occur.MUST; > > > > > > >