RE: Rebuild after corruption
Make sure you close your indexwriter. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite r.html#close() -Original Message- From: Steve Rajavuori [mailto:[EMAIL PROTECTED] Sent: Friday, May 21, 2004 7:49 PM To: '[EMAIL PROTECTED]' Subject: Rebuild after corruption I have a problem periodically where the process updating my Lucene files terminates abnormally. When I try to open the Lucene files afterward I get an exception indicating that files are missing. Does anyone know how I can recover at this point, without having to rebuild the whole index from scratch? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Rebuild after corruption
I have a problem periodically where the process updating my Lucene files terminates abnormally. When I try to open the Lucene files afterward I get an exception indicating that files are missing. Does anyone know how I can recover at this point, without having to rebuild the whole index from scratch?
Re: asktog on search problems
This is not specific advice, but an idea that I think Google leverages to build up search corrections. If a user searches for "100AW" and it doesn't match, but a moment later they try something different and immediately get to a product page, the system can make a loose connection between their original search and the product they soon thereafter found. Over time, the connections get stronger because others will do the same thing. I think term vectors could factor into making latent connections somehow also. Just postulating... Erik On May 21, 2004, at 12:09 PM, David Spencer wrote: Haven't seen this discussed here. See 7a at the link below: http://www.asktog.com/columns/062top10ReasonsToNotShop.html 7a talks about searching on a camera site for the "Lowepro 100 AW". He says this query works:"Lowepro 100 AW" and this query does not work: "Lowepro 100AW" Cross checking with google indeed shows that the 1st form is much more popular, however the 2nd form is used, and if you're a commerce site or a site that wants to make it easier for users to find things you should help them out. So the discussion question is what's the best way to handle this. I guess the somewhat general form of this is that in a query, and term might be split into 2 terms that are individually indexed (so "100AW" is not indexed, but "100" and "AW" is). In a way the flip side of this is that any 2 terms could be concatenated to form another term that was indexed (so in another universe it might be that passing "100 AW" is not as precise as passing "100AW" but how's the user to know). In the context of Lucene ways to handle this seem to be: - automagically run a fuzzy query (so if a query doesn't work, transform "Lowepro 100AW" to "Lowepro~ 100AW~" - write a query parser that breaks apart unindexed tokens into ones that are indexed (so "100AW" becomes "100 AW") - write a tokenizer that inserts dummy tokens for every pair of tokens, so the stream "Lowepro 100 AW" would also have "Lowepro100" and "100AW" inserted, presumably via magic w/ TokenStream.next() Comments on best way to handle this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer and e-mail
Further on this... If you are using StandardTokenizer, the token for an e-mail address has the type value of "", which you could use to pick up specifically in a custom TokenFilter implementation and split it how you like, passing through everything else. Take a look at StandardFilter's source code for an example of keying off the types emitted by StandardTokenizer. Erik On May 21, 2004, at 11:50 AM, Otis Gospodnetic wrote: Si, si. Write your own TokenFilter sub-class that overrides next() and extracts those other elements/tokens from an email address token and uses Token's setPositionIncrement(0) to store the extracted tokens in the same position as the original email. Otis --- Albert Vila <[EMAIL PROTECTED]> wrote: Hi all, I want to achieve the following, when I indexing the '[EMAIL PROTECTED]', I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token, the 'company' token and the 'com'token. This way, you'll be able to find the document searching for '[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only. How can I achieve that?, I need to write my own tokenizer? Thanks Albert -- Albert Vila Director de proyectos I+D http://www.imente.com 902 933 242 [iMente La información con más beneficios] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: asktog on search problems
I don't think the first solution will work because the "100AW~" term must match either 100 or AW which are your index terms. Coincidentally, I have been trying to deal with this very problem over the past few days. In my situation, I'm trying to help users find thing when the spacing of their queries doesn't match the spacing in an indexed term. Possible errors can be divided into 2 classes. 1) User leaves out a space where there ought to be one. Let's say the user is trying to find "blue bird" but types in the query "bluebird" thinking it is a single word. Lucene won't catch this because "blue" and "bird" are stored as single index tokens. 2) User errantly inserts a space where there shouldn't be one. An example would be an index where the word "blackbird" is stored but the user types in "black bird" as a query. What I tried to do was create an alternate tokenizer which stored the entire string in the index in a different field and perform fuzzy search on the entire string. This is possible because I am only doing searches on strings of less than 40 characters on average. To take the "black bird" example, I would store the entire string into a field which doesn't tokenize on word boundaries. The query, in turn, would look something like this: +title:black +title:bird OR fulltitle:black bird~ Where the tilde applies to the entire "black bird" term. When I tested it it appeared to work, but was really slow for large indexes. At about 4 entries, this query started to take 1 or 2 seconds which was worse than my performance requirement. Actually, I also thought of the last 2 things you suggested and I was about to try them out. However, you do need to apply both of them. Adding additional concatenated index terms addresses the problem where users leave out spaces. Add concatenated terms helps users match terms in your index when they inject spaces incorrectly. This may balloon the memory consumption of your Lucene index. However, you can use heuristics to avoid inserting extra terms which won't match likely errors. For example, you could decide that you only want to concatenate terms that are parts of model numbers. Or, if you are dealing with compound words, you can choose to only concatenate terms which are English words. For example, in my situation, concatenating "blue bird" as an extra term is useful while doing the same with "Roy Orbison" is not since people aren't likely to neglect the space in that situation. Hope this helps. Jeff On Fri, 21 May 2004, David Spencer wrote: > In the context of Lucene ways to handle this seem to be: > - automagically run a fuzzy query (so if a query doesn't work, transform > "Lowepro 100AW" to "Lowepro~ 100AW~"> > - write a query parser that breaks apart unindexed tokens into ones that > are indexed (so "100AW" becomes "100 AW") > - write a tokenizer that inserts dummy tokens for every pair of tokens, > so the stream "Lowepro 100 AW" would also have "Lowepro100" and "100AW" > inserted, presumably via magic w/ TokenStream.next() - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: org.apache.lucene.search.highlight.Highlighter
Hi Claude, that example code you provided is out of date. For all concerned - the highlighter code was refactored about a month ago and then moved into the Sandbox. Want the latest version? - get the latest code from the sandbox CVS. Want the latest docs? - Run javadoc on the above. There is a basic example of highlighter use in the package-level javadocs and more extensive examples in the JUnit test that accompanies the source code. Hope this helps clarify things. Mark ps Bruce, I know you were interested in providing an alternative Fragmenter implementation for the highlighter that detects sentence boundaries. You may want to look at LingPipe which has "a heuristic sentence boundary detector". ( http://threattracker.com:8080/lingpipe-demo/demo.html ) I took a quick look at it but it has its own tokenizer that would be difficult to make work with the tokenstream used to identify query terms. At least the code gives some examples of the heuristics involved in detecting sentence boundaries. For my own apps I find the standard Fragmenter implementation suffices. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: org.apache.lucene.search.highlight.Highlighter
Arrgh the attachment didn't make it here it goes, sorry: //perform a standard lucene query searcher = new IndexSearcher(ramDir); Analyzer analyzer=new StandardAnalyzer(); Query query = QueryParser.parse("Kenne*", FIELD_NAME, analyzer); query=query.rewrite(reader); //necessary to expand search terms Hits hits = searcher.search(query); //create an instance of the highlighter with the tags used to surround highlighted text QueryHighlightExtractor highlighter = new QueryHighlightExtractor(query, new StandardAnalyzer(), "", ""); for (int i = 0; i < hits.length(); i++) { String text = hits.doc(i).get(FIELD_NAME); //call to highlight text with chosen tags String highlightedText = highlighter.highlightText(text); System.out.println(highlightedText); } If your documents are large you can select only the best fragments from each document like this: //...as above example int highlightFragmentSizeInBytes = 80; int maxNumFragmentsRequired = 4; String fragmentSeparator="..."; for (int i = 0; i < hits.length(); i++) { String text = hits.doc(i).get(FIELD_NAME); String highlightedText = highlighter.getBestFragments(text, highlightFragmentSizeInBytes,maxNumFragmentsRequired,fragmentSeparator); System.out.println(highlightedText); } On May 21, 2004, at 9:22 AM, Claude Devarenne wrote: Hi, Here is the documentation Mark Harwood included in the original package. I followed his directorions and it worked for me. Let me know if this doesn't do it for you. Claude On May 21, 2004, at 4:29 AM, Karthik N S wrote: Hi Please can some body give me a simple Example of org.apache.lucene.search.highlight.Highlighter I am trying to use it but unsucessfull Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: org.apache.lucene.search.highlight.Highlighter
Hi, Here is the documentation Mark Harwood included in the original package. I followed his directorions and it worked for me. Let me know if this doesn't do it for you. Claude On May 21, 2004, at 4:29 AM, Karthik N S wrote: Hi Please can some body give me a simple Example of org.apache.lucene.search.highlight.Highlighter I am trying to use it but unsucessfull Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
asktog on search problems
Haven't seen this discussed here. See 7a at the link below: http://www.asktog.com/columns/062top10ReasonsToNotShop.html 7a talks about searching on a camera site for the "Lowepro 100 AW". He says this query works:"Lowepro 100 AW" and this query does not work: "Lowepro 100AW" Cross checking with google indeed shows that the 1st form is much more popular, however the 2nd form is used, and if you're a commerce site or a site that wants to make it easier for users to find things you should help them out. So the discussion question is what's the best way to handle this. I guess the somewhat general form of this is that in a query, and term might be split into 2 terms that are individually indexed (so "100AW" is not indexed, but "100" and "AW" is). In a way the flip side of this is that any 2 terms could be concatenated to form another term that was indexed (so in another universe it might be that passing "100 AW" is not as precise as passing "100AW" but how's the user to know). In the context of Lucene ways to handle this seem to be: - automagically run a fuzzy query (so if a query doesn't work, transform "Lowepro 100AW" to "Lowepro~ 100AW~" - write a query parser that breaks apart unindexed tokens into ones that are indexed (so "100AW" becomes "100 AW") - write a tokenizer that inserts dummy tokens for every pair of tokens, so the stream "Lowepro 100 AW" would also have "Lowepro100" and "100AW" inserted, presumably via magic w/ TokenStream.next() Comments on best way to handle this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Memo: RE: Query parser and minus signs
Doesn't "en UK" as a phrase query work? You're probably indexing it as a text field so it's being tokenised. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 21 May 2004 16:47 To: Lucene Users List Subject: Memo: RE: Query parser and minus signs Hmm, we may have to if there is no work around. We're not using java locales, but were trying to stick to the ISO standard which uses hyphens. "Ryan Sonnek" <[EMAIL PROTECTED]> on 21 May 2004 16:38 Please respond to "Lucene Users List" <[EMAIL PROTECTED]> To:"Lucene Users List" <[EMAIL PROTECTED]> cc: bcc: Subject:RE: Query parser and minus signs if you're dealing with locales, why not use java's built in locale syntax (ex: en_UK, zh_HK)? > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, May 21, 2004 10:36 AM > To: [EMAIL PROTECTED] > Subject: Query parser and minus signs > > > > > > > Hi All, > > I'm using Lucene on a site that has split content with a > branch containing > pages in English and a separate branch in Chinese. Some of > the chinese > pages include some (untranslatable) English words, so when a search is > carried out in either language you can get pages from the > wrong branch. To > combat this we introduced a language field into the index > which contains > the standard language codes: en-UK and zh-HK. > > When you parse a query e.g. language:"en\-UK" you could > reasonably expect > the search to recover all pages with the language field set > to "en-UK" (the > minus symbol should be escaped by the backslash according to the FAQ). > Unfortunately the parser seems to return "en UK" as the > parsed query and > hence returns no documents. > > Has anyone else had this problem, or could suggest a > workaround ?? as I > have > yet to find a solution in the mailing archives or elsewhere. > > Many thanks in advance, > > Alex Bourne > > > > _ > > This transmission has been issued by a member of the HSBC Group > ("HSBC") for the information of the addressee only and should not be > reproduced and / or distributed to any other person. Each page > attached hereto must be read in conjunction with any disclaimer which > forms part of it. This transmission is neither an offer nor > the solicitation > of an offer to sell or purchase any investment. Its contents > are based > on information obtained from sources believed to be reliable but HSBC > makes no representation and accepts no responsibility or > liability as to > its completeness or accuracy. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. ** _ This transmission has been issued by a member of the HSBC Group ("HSBC") for the information of the addressee only and should not be reproduced and / or distributed to any other person. Each page attached hereto must be read in conjunction with any disclaimer which forms part of it. This transmission is neither an offer nor the solicitation of an offer to sell or purchase any investment. Its contents are based on information obtained from sources believed to be reliable but HSBC makes no representation and accepts no responsibility or liability as to its completeness or accuracy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
now maybe Mozlla/IMAP URLs - Re: StandardTokenizer and e-mail
This reminds me - if you have a search engine that indexes a mail store and you present results in a web page to a browser, you want to (of course...well I think this is obvious) send back a URL that would cause the users native mail client to pull up the msg. IMAP has a URL format, and I use Mozilla on windows to browse & read mail, however when I've presented IMAP URLs on a results page the IMAP URL doesn't work - either nothing happens or the cursor changes to busy but still no mail comes up. Has anyone come across this? This may be more appropriate for a moz list but it's definitely a search issue. This page mentions the problem: http://www.mozilla.org/projects/security/known-vulnerabilities.html A writeup on an IMAP indexer I did a while ago: http://www.tropo.com/techno/java/lucene/imap.html Albert Vila wrote: Hi all, I want to achieve the following, when I indexing the '[EMAIL PROTECTED]', I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token, the 'company' token and the 'com'token. This way, you'll be able to find the document searching for '[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only. How can I achieve that?, I need to write my own tokenizer? Thanks Albert - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query parser and minus signs
- Original Message - From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, May 21, 2004 11:36 AM Subject: Query parser and minus signs > > > > > Hi All, > > I'm using Lucene on a site that has split content with a branch containing > pages in English and a separate branch in Chinese. Some of the chinese > pages include some (untranslatable) English words, so when a search is > carried out in either language you can get pages from the wrong branch. To > combat this we introduced a language field into the index which contains > the standard language codes: en-UK and zh-HK. > > When you parse a query e.g. language:"en\-UK" you could reasonably expect > the search to recover all pages with the language field set to "en-UK" (the > minus symbol should be escaped by the backslash according to the FAQ). > Unfortunately the parser seems to return "en UK" as the parsed query and > hence returns no documents. > > Has anyone else had this problem, or could suggest a workaround ?? as I > have > yet to find a solution in the mailing archives or elsewhere. Index the standard language code as a new Field(fieldName, code, false, true, false) This will bypass the Analyzer at indexing time, since tokenization is set to false. Then when you create your queries, add a new TermQuery(new Term(fieldName, desiredLanguageCode)) to the user query object. This will bypass the Analyzer at query time and give you the desired result. > > Many thanks in advance, > > Alex Bourne > > > > _ > > This transmission has been issued by a member of the HSBC Group > ("HSBC") for the information of the addressee only and should not be > reproduced and / or distributed to any other person. Each page > attached hereto must be read in conjunction with any disclaimer which > forms part of it. This transmission is neither an offer nor the solicitation > of an offer to sell or purchase any investment. Its contents are based > on information obtained from sources believed to be reliable but HSBC > makes no representation and accepts no responsibility or liability as to > its completeness or accuracy. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer and e-mail
Si, si. Write your own TokenFilter sub-class that overrides next() and extracts those other elements/tokens from an email address token and uses Token's setPositionIncrement(0) to store the extracted tokens in the same position as the original email. Otis --- Albert Vila <[EMAIL PROTECTED]> wrote: > Hi all, > > I want to achieve the following, when I indexing the > '[EMAIL PROTECTED]', > I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token, > the > 'company' token and the 'com'token. > This way, you'll be able to find the document searching for > '[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only. > > How can I achieve that?, I need to write my own tokenizer? > > Thanks > Albert > > -- > Albert Vila > Director de proyectos I+D > http://www.imente.com > 902 933 242 > [iMente La información con más beneficios] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Memo: RE: Query parser and minus signs
Hmm, we may have to if there is no work around. We're not using java locales, but were trying to stick to the ISO standard which uses hyphens. "Ryan Sonnek" <[EMAIL PROTECTED]> on 21 May 2004 16:38 Please respond to "Lucene Users List" <[EMAIL PROTECTED]> To:"Lucene Users List" <[EMAIL PROTECTED]> cc: bcc: Subject:RE: Query parser and minus signs if you're dealing with locales, why not use java's built in locale syntax (ex: en_UK, zh_HK)? > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, May 21, 2004 10:36 AM > To: [EMAIL PROTECTED] > Subject: Query parser and minus signs > > > > > > > Hi All, > > I'm using Lucene on a site that has split content with a > branch containing > pages in English and a separate branch in Chinese. Some of > the chinese > pages include some (untranslatable) English words, so when a search is > carried out in either language you can get pages from the > wrong branch. To > combat this we introduced a language field into the index > which contains > the standard language codes: en-UK and zh-HK. > > When you parse a query e.g. language:"en\-UK" you could > reasonably expect > the search to recover all pages with the language field set > to "en-UK" (the > minus symbol should be escaped by the backslash according to the FAQ). > Unfortunately the parser seems to return "en UK" as the > parsed query and > hence returns no documents. > > Has anyone else had this problem, or could suggest a > workaround ?? as I > have > yet to find a solution in the mailing archives or elsewhere. > > Many thanks in advance, > > Alex Bourne > > > > _ > > This transmission has been issued by a member of the HSBC Group > ("HSBC") for the information of the addressee only and should not be > reproduced and / or distributed to any other person. Each page > attached hereto must be read in conjunction with any disclaimer which > forms part of it. This transmission is neither an offer nor > the solicitation > of an offer to sell or purchase any investment. Its contents > are based > on information obtained from sources believed to be reliable but HSBC > makes no representation and accepts no responsibility or > liability as to > its completeness or accuracy. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. ** _ This transmission has been issued by a member of the HSBC Group ("HSBC") for the information of the addressee only and should not be reproduced and / or distributed to any other person. Each page attached hereto must be read in conjunction with any disclaimer which forms part of it. This transmission is neither an offer nor the solicitation of an offer to sell or purchase any investment. Its contents are based on information obtained from sources believed to be reliable but HSBC makes no representation and accepts no responsibility or liability as to its completeness or accuracy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query parser and minus signs
if you're dealing with locales, why not use java's built in locale syntax (ex: en_UK, zh_HK)? > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, May 21, 2004 10:36 AM > To: [EMAIL PROTECTED] > Subject: Query parser and minus signs > > > > > > > Hi All, > > I'm using Lucene on a site that has split content with a > branch containing > pages in English and a separate branch in Chinese. Some of > the chinese > pages include some (untranslatable) English words, so when a search is > carried out in either language you can get pages from the > wrong branch. To > combat this we introduced a language field into the index > which contains > the standard language codes: en-UK and zh-HK. > > When you parse a query e.g. language:"en\-UK" you could > reasonably expect > the search to recover all pages with the language field set > to "en-UK" (the > minus symbol should be escaped by the backslash according to the FAQ). > Unfortunately the parser seems to return "en UK" as the > parsed query and > hence returns no documents. > > Has anyone else had this problem, or could suggest a > workaround ?? as I > have > yet to find a solution in the mailing archives or elsewhere. > > Many thanks in advance, > > Alex Bourne > > > > _ > > This transmission has been issued by a member of the HSBC Group > ("HSBC") for the information of the addressee only and should not be > reproduced and / or distributed to any other person. Each page > attached hereto must be read in conjunction with any disclaimer which > forms part of it. This transmission is neither an offer nor > the solicitation > of an offer to sell or purchase any investment. Its contents > are based > on information obtained from sources believed to be reliable but HSBC > makes no representation and accepts no responsibility or > liability as to > its completeness or accuracy. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query parser and minus signs
Hi All, I'm using Lucene on a site that has split content with a branch containing pages in English and a separate branch in Chinese. Some of the chinese pages include some (untranslatable) English words, so when a search is carried out in either language you can get pages from the wrong branch. To combat this we introduced a language field into the index which contains the standard language codes: en-UK and zh-HK. When you parse a query e.g. language:"en\-UK" you could reasonably expect the search to recover all pages with the language field set to "en-UK" (the minus symbol should be escaped by the backslash according to the FAQ). Unfortunately the parser seems to return "en UK" as the parsed query and hence returns no documents. Has anyone else had this problem, or could suggest a workaround ?? as I have yet to find a solution in the mailing archives or elsewhere. Many thanks in advance, Alex Bourne _ This transmission has been issued by a member of the HSBC Group ("HSBC") for the information of the addressee only and should not be reproduced and / or distributed to any other person. Each page attached hereto must be read in conjunction with any disclaimer which forms part of it. This transmission is neither an offer nor the solicitation of an offer to sell or purchase any investment. Its contents are based on information obtained from sources believed to be reliable but HSBC makes no representation and accepts no responsibility or liability as to its completeness or accuracy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
StandardTokenizer and e-mail
Hi all, I want to achieve the following, when I indexing the '[EMAIL PROTECTED]', I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token, the 'company' token and the 'com'token. This way, you'll be able to find the document searching for '[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only. How can I achieve that?, I need to write my own tokenizer? Thanks Albert -- Albert Vila Director de proyectos I+D http://www.imente.com 902 933 242 [iMente “La información con más beneficios”] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
org.apache.lucene.search.highlight.Highlighter
Hi Please can some body give me a simple Example of org.apache.lucene.search.highlight.Highlighter I am trying to use it but unsucessfull Karthik WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK]
AW: Problem indexing Spanish Characters
Hi all, Martin was right. I just adapt the HTML demo as Wallen recommended and it worked. Now I have only to deal with some crazy documents which are UTF-8 decoded mixed with entities. Does anyone know a class which can translate entities into UTF-8 or any other encoding? Peter MH -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Here is an example method in org.apache.lucene.demo.html HTMLParser that uses a different buffered reader for a different encoding. public Reader getReader() throws IOException { if (pipeIn == null) { pipeInStream = new MyPipedInputStream(); pipeOutStream = new PipedOutputStream(pipeInStream); pipeIn = new InputStreamReader(pipeInStream); pipeOut = new OutputStreamWriter(pipeOutStream); //check the first 4 bytes for FFFE marker, if its there we know its UTF-16 encoding if (useUTF16) { try { pipeIn = new BufferedReader(new InputStreamReader(pipeInStream, "UTF-16")); } catch (Exception e) { } } Thread thread = new ParserThread(this); thread.start(); // start parsing } return pipeIn; } -Original Message- From: Martin Remy [mailto:[EMAIL PROTECTED] The tokenizers deal with unicode characters (CharStream, char), so the problem is not there. This problem must be solved at the point where the bytes from your source files are turned into CharSequences/Strings, i.e. by connecting an InputStreamReader to your FileReader (or whatever you're using) and specifying "UTF-8" (or whatever encoding is appropriate) in the InputStreamReader constructor. You must either detect the encoding from HTTP heaaders or XML declarations or, if you know that it's the same for all of your source files, then just hardcode UTF-8, for example. Martin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: documentation fix for website
Thanks for catching this. I fix it, and the change should show up on the site with the next Lucene release. Otis --- Ryan Sonnek <[EMAIL PROTECTED]> wrote: > Is this the right place to submit a problem with the website > documentation? > http://jakarta.apache.org/lucene/docs/systemproperties.html lists > mergeFactor twice with different property names. the second > occurance should be updated to lockDir (the underlying href link is > correct). > > Ryan > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching Microsoft Word , Excel and PPT files for Japanese
Thanks chandan .. I am tried using POI for text extraction . I used The WordDocument.writeAllText method but it didn't worked for Japanese. Is there any other way also for extracting the Japanese text? Regards, Ankur -Original Message- From: Chandan Tamrakar [mailto:[EMAIL PROTECTED] Sent: Friday, May 21, 2004 3:51 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: Re: Searching Microsoft Word , Excel and PPT files for Japanese for miscrosoft word documents and excel use POI API's from jakarta apache. First you need to extract the test and convert inot suitable encoding before you put into lucene for index. It worked for me. - Original Message - From: "Ankur Goel" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Thursday, May 20, 2004 10:55 PM Subject: Searching Microsoft Word , Excel and PPT files for Japanese > Hi, > > I am using CJK Tokenzier for searching the Japanese documents. I am able to > search japanese documents which are text files. But I am not able to search > from Microsoft word, excel files with content in Japanese. > > Can you tell me how can search on Japanese content for Microsoft word, excel > and ppt files. > > Thanks, > Ankur > > -Original Message- > From: Ankur Goel [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 04, 2004 1:36 AM > To: 'Lucene Users List' > Subject: RE: Boolean Phrase Query question > > Thanks Eric for the solution. I have to filename field as I have to give the > end user facility to search on File Name also. That's why I am using TEXT > for file Name also. > > "By using true on the finalQuery.add calls, you have said that both fields > must have the word "temp" in them. Is that what you meant? Or did you mean > an OR type of query?" > > I need an OR type of query. I mean the word can be in the filename or in the > contents of the filename. But i am not able to do this. Can you tell me how > to do it? > > Regards, > Ankur > > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 04, 2004 1:27 AM > To: Lucene Users List > Subject: Re: Boolean Phrase Query question > > On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: > > > > Hi, > > I have to provide a functionality which provides search on both file > > name and contents of the file. > > > > For indexing I use the following code: > > > > > > org.apache.lucene.document.Document doc = new org.apache. > > lucene.document.Document(); > > doc.add(Field.Keyword("fileId","" + document.getFileId())); > > doc.add(Field.Text("fileName",fileName); > > doc.add(Field.Text("contents", new FileReader(new File(fileName))); > > I'm not sure what you plan on doing with the fileName field, but you > probably want to use a Keyword field for it. > > And you may want to glue the file name and contents together into a single > field to facilitate searches to span both. (be sure to put a space in > between if you do this) > > > For searching a text say "temp" I use the following code to look both > > in file Name and contents of the file: > > > > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = > > QueryParser.parse("temp","fileName",analyzer); > > Query mainQuery = QueryParser.parse("temp","contents",analyzer); > > > > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, > > true, false); > > > > Hits hits = is.search(finalQuery); > > By using true on the finalQuery.add calls, you have said that both fields > must have the word "temp" in them. Is that what you meant? Or did you mean > an OR type of query? > > Erik > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching Microsoft Word , Excel and PPT files for Japanese
for miscrosoft word documents and excel use POI API's from jakarta apache. First you need to extract the test and convert inot suitable encoding before you put into lucene for index. It worked for me. - Original Message - From: "Ankur Goel" <[EMAIL PROTECTED]> To: "'Lucene Users List'" <[EMAIL PROTECTED]> Sent: Thursday, May 20, 2004 10:55 PM Subject: Searching Microsoft Word , Excel and PPT files for Japanese > Hi, > > I am using CJK Tokenzier for searching the Japanese documents. I am able to > search japanese documents which are text files. But I am not able to search > from Microsoft word, excel files with content in Japanese. > > Can you tell me how can search on Japanese content for Microsoft word, excel > and ppt files. > > Thanks, > Ankur > > -Original Message- > From: Ankur Goel [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 04, 2004 1:36 AM > To: 'Lucene Users List' > Subject: RE: Boolean Phrase Query question > > Thanks Eric for the solution. I have to filename field as I have to give the > end user facility to search on File Name also. That's why I am using TEXT > for file Name also. > > "By using true on the finalQuery.add calls, you have said that both fields > must have the word "temp" in them. Is that what you meant? Or did you mean > an OR type of query?" > > I need an OR type of query. I mean the word can be in the filename or in the > contents of the filename. But i am not able to do this. Can you tell me how > to do it? > > Regards, > Ankur > > -Original Message- > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 04, 2004 1:27 AM > To: Lucene Users List > Subject: Re: Boolean Phrase Query question > > On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: > > > > Hi, > > I have to provide a functionality which provides search on both file > > name and contents of the file. > > > > For indexing I use the following code: > > > > > > org.apache.lucene.document.Document doc = new org.apache. > > lucene.document.Document(); > > doc.add(Field.Keyword("fileId","" + document.getFileId())); > > doc.add(Field.Text("fileName",fileName); > > doc.add(Field.Text("contents", new FileReader(new File(fileName))); > > I'm not sure what you plan on doing with the fileName field, but you > probably want to use a Keyword field for it. > > And you may want to glue the file name and contents together into a single > field to facilitate searches to span both. (be sure to put a space in > between if you do this) > > > For searching a text say "temp" I use the following code to look both > > in file Name and contents of the file: > > > > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = > > QueryParser.parse("temp","fileName",analyzer); > > Query mainQuery = QueryParser.parse("temp","contents",analyzer); > > > > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, > > true, false); > > > > Hits hits = is.search(finalQuery); > > By using true on the finalQuery.add calls, you have said that both fields > must have the word "temp" in them. Is that what you meant? Or did you mean > an OR type of query? > > Erik > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: org.apache.lucene.search.highlight.Highlighter
Hi Please can some body give me a simple Example of org.apache.lucene.search.highlight.Highlighter I am trying to use it but unsucessfull Karthik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Thursday, May 20, 2004 2:08 AM To: [EMAIL PROTECTED] Subject: Re: org.apache.lucene.search.highlight.Highlighter >>Was Investigating,found some Compile time error.. I see the code you have is taken from the example in the javadocs. Unfortunately that example wasn't complete because the class didnt include the method defined in the Formatter interface. I have updated the Javadocs to correct this oversight. To correct your problem either make your class implement the Formatter interface to perform your choice of custom formatting or remove the "this" parameter from your call to create a new Highlighter with the default Formatter implementation. Thanks for "highlighting" the problem with the Javadocs... Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]