RE: Rebuild after corruption

2004-05-21 Thread wallen
Make sure you close your indexwriter.

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWrite
r.html#close()

-Original Message-
From: Steve Rajavuori [mailto:[EMAIL PROTECTED]
Sent: Friday, May 21, 2004 7:49 PM
To: '[EMAIL PROTECTED]'
Subject: Rebuild after corruption


I have a problem periodically where the process updating my Lucene files
terminates abnormally. When I try to open the Lucene files afterward I get
an exception indicating that files are missing. Does anyone know how I can
recover at this point, without having to rebuild the whole index from
scratch?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Rebuild after corruption

2004-05-21 Thread Steve Rajavuori
I have a problem periodically where the process updating my Lucene files
terminates abnormally. When I try to open the Lucene files afterward I get
an exception indicating that files are missing. Does anyone know how I can
recover at this point, without having to rebuild the whole index from
scratch?


Re: asktog on search problems

2004-05-21 Thread Erik Hatcher
This is not specific advice, but an idea that I think Google leverages 
to build up search corrections.  If a user searches for "100AW" and it 
doesn't match, but a moment later they try something different and 
immediately get to a product page, the system can make a loose 
connection between their original search and the product they soon 
thereafter found.  Over time, the connections get stronger because 
others will do the same thing.

I think term vectors could factor into making latent connections 
somehow also.

Just postulating...
Erik

On May 21, 2004, at 12:09 PM, David Spencer wrote:
Haven't seen this discussed here.
See 7a at the link below:
http://www.asktog.com/columns/062top10ReasonsToNotShop.html
7a talks about searching on a camera site for the "Lowepro 100 AW".
He says this query works:"Lowepro 100 AW"
and this query does not work: "Lowepro 100AW"
Cross checking with google indeed shows that the 1st form is much more 
popular, however the 2nd form is used, and if you're a commerce site 
or a site that wants to make it easier for users to find things you 
should help them out.

So the discussion question is what's the best way to handle this.
I guess the somewhat general form of this is that in a query, and term 
might be split into 2 terms that are individually indexed (so "100AW" 
is not indexed, but "100" and "AW" is).
In a way the flip side of this is that any 2 terms could be 
concatenated to form another term that was indexed (so in another 
universe it might be that passing "100 AW" is not as precise as 
passing "100AW" but how's the user to know).

In the context of Lucene ways to handle this seem to be:
- automagically run a fuzzy query (so if a query doesn't work, 
transform "Lowepro 100AW" to "Lowepro~ 100AW~"
- write a query parser that breaks apart unindexed tokens into ones 
that are indexed (so "100AW" becomes "100 AW")
- write a tokenizer that inserts dummy tokens for every pair of 
tokens, so the stream "Lowepro 100 AW" would also have "Lowepro100" 
and "100AW" inserted, presumably via magic w/ TokenStream.next()

Comments on best way to handle this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: StandardTokenizer and e-mail

2004-05-21 Thread Erik Hatcher
Further on this...
If you are using StandardTokenizer, the token for an e-mail address has 
the type value of "", which you could use to pick up 
specifically in a custom TokenFilter implementation and split it how 
you like, passing through everything else.  Take a look at 
StandardFilter's source code for an example of keying off the types 
emitted by StandardTokenizer.

Erik
On May 21, 2004, at 11:50 AM, Otis Gospodnetic wrote:
Si, si.
Write your own TokenFilter sub-class that overrides next() and extracts
those other elements/tokens from an email address token and uses
Token's setPositionIncrement(0) to store the extracted tokens in the
same position as the original email.
Otis
--- Albert Vila <[EMAIL PROTECTED]> wrote:
Hi all,
I want to achieve the following, when I indexing the
'[EMAIL PROTECTED]',
I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token,
the
'company' token and the 'com'token.
This way, you'll be able to find the document searching for
'[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.
How can I achieve that?, I need to write my own tokenizer?
Thanks
Albert
--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente “La información con más beneficios”]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: asktog on search problems

2004-05-21 Thread Jeff Wong
I don't think the first solution will work because the "100AW~" term must
match either 100 or AW which are your index terms.

Coincidentally,  I have been trying to deal with this very problem over
the past few days.  

In my situation,  I'm trying to help users find thing when the spacing of
their queries doesn't match the spacing in an indexed term.  Possible
errors can be divided into 2 classes.

1) User leaves out  a space where there ought to be one.  Let's say the
user is trying to find "blue bird" but types in the query "bluebird"
thinking it is a single word.  Lucene won't catch this because "blue" and
"bird" are stored as single index tokens.

2) User errantly inserts a space where there shouldn't be one.  An example
would be an index where the word "blackbird" is stored but the user types
in "black bird" as a query.

What I tried to do was create an alternate tokenizer which stored the
entire string in the index in a different field and perform fuzzy search
on the entire string.  This is possible because I am only doing searches
on strings of less than 40 characters on average.  To take the "black
bird" example, I would store the entire string into a field which doesn't
tokenize on word boundaries.  The query, in turn, would look something
like this:

+title:black +title:bird OR fulltitle:black bird~

Where the tilde applies to the entire "black bird" term.  When I tested it
it appeared to work, but was really slow for large indexes.  At about
4 entries, this query started to take 1 or 2 seconds which was worse
than my performance requirement.

Actually, I also thought of the last 2 things you suggested and I was
about to try them out.  However, you do need to apply both of them.
Adding additional concatenated index terms addresses the problem where
users leave out spaces.  Add concatenated terms helps users match terms
in your index when they inject spaces incorrectly.

This may balloon the memory consumption of your Lucene index.  However,
you can use heuristics to avoid inserting extra terms which won't match
likely errors.  For example, you could decide that you only want to
concatenate terms that are parts of model numbers.  Or, if you are dealing
with compound words, you can choose to only concatenate terms which are
English words.  For example,  in my situation, concatenating "blue bird"
as an extra term is useful while doing the same with  "Roy Orbison" is
not since people aren't likely to neglect the space in that situation.

Hope this helps.

Jeff


On Fri, 21 May 2004, David Spencer wrote:

> In the context of Lucene ways to handle this seem to be:
> - automagically run a fuzzy query (so if a query doesn't work, transform 
> "Lowepro 100AW" to "Lowepro~ 100AW~"> 
> - write a query parser that breaks apart unindexed tokens into ones that 
> are indexed (so "100AW" becomes "100 AW")
> - write a tokenizer that inserts dummy tokens for every pair of tokens, 
> so the stream "Lowepro 100 AW" would also have "Lowepro100" and "100AW" 
> inserted, presumably via magic w/ TokenStream.next()


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread markharw00d
Hi Claude, that example code you provided is out of date.

For all concerned - the highlighter code was refactored about a month ago and then 
moved into the Sandbox.

Want the latest version? - get the latest code from the sandbox CVS.
Want the latest docs? - Run javadoc on the above.

There is a basic example of highlighter use in the package-level javadocs and more 
extensive examples 
in the JUnit test that accompanies the source code.

Hope this helps clarify things.

Mark

ps Bruce, I know you were interested in providing an alternative Fragmenter 
implementation 
for the highlighter that detects sentence boundaries.
You may want to look at LingPipe which has "a heuristic sentence boundary detector".
( http://threattracker.com:8080/lingpipe-demo/demo.html )
I took a quick look at it but it has its own tokenizer that would be difficult to make 
work with 
the tokenstream used to identify query terms. At least the code gives some examples of 
the
heuristics involved in detecting sentence boundaries. For my own apps I find the 
standard Fragmenter
implementation suffices.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Claude Devarenne
Arrgh the attachment didn't make it here it goes, sorry:
//perform a standard lucene query
searcher = new IndexSearcher(ramDir);
Analyzer analyzer=new StandardAnalyzer();
Query query = QueryParser.parse("Kenne*", FIELD_NAME, analyzer);
query=query.rewrite(reader); //necessary to expand search terms
Hits hits = searcher.search(query);
//create an instance of the highlighter with the tags used to  
surround highlighted text
QueryHighlightExtractor highlighter =
new QueryHighlightExtractor(query, new  
StandardAnalyzer(), "", "");

for (int i = 0; i < hits.length(); i++)
{
String text = hits.doc(i).get(FIELD_NAME);
//call to highlight text with chosen tags
String highlightedText =  
highlighter.highlightText(text);
System.out.println(highlightedText);
}

If your documents are large you can select only the best fragments from  
each document like this:
//...as above example

int highlightFragmentSizeInBytes = 80;
int maxNumFragmentsRequired = 4;
String fragmentSeparator="...";
for (int i = 0; i < hits.length(); i++)
{
String text = hits.doc(i).get(FIELD_NAME);
String highlightedText =  
highlighter.getBestFragments(text,
 
highlightFragmentSizeInBytes,maxNumFragmentsRequired,fragmentSeparator);
System.out.println(highlightedText);
}

On May 21, 2004, at 9:22 AM, Claude Devarenne wrote:
Hi,
Here is the documentation Mark Harwood included in the original  
package.  I followed his directorions and it worked for me.  Let me  
know if this doesn't do it for you.

Claude

On May 21, 2004, at 4:29 AM, Karthik N S wrote:

Hi
 Please can some body give me a simple Example of
 org.apache.lucene.search.highlight.Highlighter
 I am trying to use it but unsucessfull
 
Karthik











WITH WARM REGARDS
HAVE A NICE DAY
[ N.S.KARTHIK]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Claude Devarenne
Hi,

Here is the documentation Mark Harwood included in the original package.  I followed his directorions and it worked for me.  Let me know if this doesn't do it for you.

Claude



On May 21, 2004, at 4:29 AM, Karthik N S wrote:

Hi

 Please can some body give me a simple Example of

 org.apache.lucene.search.highlight.Highlighter

 I am trying to use it but unsucessfull

 

Karthik























WITH WARM REGARDS 
HAVE A NICE DAY 
[ N.S.KARTHIK] 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

asktog on search problems

2004-05-21 Thread David Spencer
Haven't seen this discussed here.
See 7a at the link below:
http://www.asktog.com/columns/062top10ReasonsToNotShop.html
7a talks about searching on a camera site for the "Lowepro 100 AW".
He says this query works:"Lowepro 100 AW"
and this query does not work: "Lowepro 100AW"
Cross checking with google indeed shows that the 1st form is much more 
popular, however the 2nd form is used, and if you're a commerce site or 
a site that wants to make it easier for users to find things you should 
help them out.

So the discussion question is what's the best way to handle this.
I guess the somewhat general form of this is that in a query, and term 
might be split into 2 terms that are individually indexed (so "100AW" is 
not indexed, but "100" and "AW" is).
In a way the flip side of this is that any 2 terms could be concatenated 
to form another term that was indexed (so in another universe it might 
be that passing "100 AW" is not as precise as passing "100AW" but how's 
the user to know).

In the context of Lucene ways to handle this seem to be:
- automagically run a fuzzy query (so if a query doesn't work, transform 
"Lowepro 100AW" to "Lowepro~ 100AW~"
- write a query parser that breaks apart unindexed tokens into ones that 
are indexed (so "100AW" becomes "100 AW")
- write a tokenizer that inserts dummy tokens for every pair of tokens, 
so the stream "Lowepro 100 AW" would also have "Lowepro100" and "100AW" 
inserted, presumably via magic w/ TokenStream.next()

Comments on best way to handle this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Memo: RE: Query parser and minus signs

2004-05-21 Thread David Townsend
Doesn't "en UK" as a phrase query work?

You're probably indexing it as a text field so it's being tokenised.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: 21 May 2004 16:47
To: Lucene Users List
Subject: Memo: RE: Query parser and minus signs






Hmm, we may have to if there is no work around. We're not using java
locales, but were trying to stick to the ISO standard which uses hyphens.




"Ryan Sonnek" <[EMAIL PROTECTED]> on 21 May 2004 16:38

Please respond to "Lucene Users List" <[EMAIL PROTECTED]>

To:"Lucene Users List" <[EMAIL PROTECTED]>
cc:
bcc:

Subject:RE: Query parser and minus signs


if you're dealing with locales, why not use java's built in locale syntax
(ex: en_UK, zh_HK)?

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Friday, May 21, 2004 10:36 AM
> To: [EMAIL PROTECTED]
> Subject: Query parser and minus signs
>
>
>
>
>
>
> Hi All,
>
> I'm using Lucene on a site that has split content with a
> branch containing
> pages in English and a separate branch in Chinese.  Some of
> the chinese
> pages include some (untranslatable) English words, so when a search is
> carried out in either language you can get pages from the
> wrong branch. To
> combat this we introduced a language field into the index
> which contains
> the standard language codes: en-UK and zh-HK.
>
> When you parse a query  e.g. language:"en\-UK" you could
> reasonably expect
> the search to recover all pages with the language field set
> to "en-UK" (the
> minus symbol should be escaped by the backslash according to the FAQ).
> Unfortunately the parser seems to return "en UK" as the
> parsed query and
> hence returns no documents.
>
> Has anyone else had this problem, or could suggest a
> workaround ?? as I
> have
> yet to find a solution in the mailing archives or elsewhere.
>
> Many thanks in advance,
>
> Alex Bourne
>
>
>
> _
>
> This transmission has been issued by a member of the HSBC Group
> ("HSBC") for the information of the addressee only and should not be
> reproduced and / or distributed to any other person. Each page
> attached hereto must be read in conjunction with any disclaimer which
> forms part of it. This transmission is neither an offer nor
> the solicitation
> of an offer to sell or purchase any investment. Its contents
> are based
> on information obtained from sources believed to be reliable but HSBC
> makes no representation and accepts no responsibility or
> liability as to
> its completeness or accuracy.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



**
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
**








_

This transmission has been issued by a member of the HSBC Group 
("HSBC") for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



now maybe Mozlla/IMAP URLs - Re: StandardTokenizer and e-mail

2004-05-21 Thread David Spencer
This reminds me - if you have a search engine that indexes a mail store 
and you present results in a web page to a browser, you want to (of 
course...well I think this is obvious)  send back a URL that would cause 
the users native mail client to pull up the msg.
IMAP has a URL format, and I use Mozilla on windows to browse & read 
mail, however when I've presented IMAP URLs on a results page the IMAP 
URL doesn't work - either nothing happens or the cursor changes to busy 
but still no mail comes up. Has anyone come across this? This may be 
more appropriate for a moz list but it's definitely a search issue.

This page mentions the problem:
http://www.mozilla.org/projects/security/known-vulnerabilities.html
A writeup on an IMAP indexer I did a while ago:
http://www.tropo.com/techno/java/lucene/imap.html

Albert Vila wrote:
Hi all,
I want to achieve the following, when I indexing the 
'[EMAIL PROTECTED]', I want to index the '[EMAIL PROTECTED]' token, then 
the 'xyz' token, the 'company' token and the 'com'token.
This way, you'll be able to find the document searching for 
'[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.

How can I achieve that?, I need to write my own tokenizer?
Thanks
Albert

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query parser and minus signs

2004-05-21 Thread Peter M Cipollone

- Original Message - 
From: <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, May 21, 2004 11:36 AM
Subject: Query parser and minus signs


>
>
>
>
> Hi All,
>
> I'm using Lucene on a site that has split content with a branch containing
> pages in English and a separate branch in Chinese.  Some of the chinese
> pages include some (untranslatable) English words, so when a search is
> carried out in either language you can get pages from the wrong branch. To
> combat this we introduced a language field into the index which contains
> the standard language codes: en-UK and zh-HK.
>
> When you parse a query  e.g. language:"en\-UK" you could reasonably expect
> the search to recover all pages with the language field set to "en-UK"
(the
> minus symbol should be escaped by the backslash according to the FAQ).
> Unfortunately the parser seems to return "en UK" as the parsed query and
> hence returns no documents.
>
> Has anyone else had this problem, or could suggest a workaround ?? as I
> have
> yet to find a solution in the mailing archives or elsewhere.

Index the standard language code as a

new Field(fieldName, code, false, true, false)

This will bypass the Analyzer at indexing time, since tokenization is set to
false.  Then when you create your queries, add a

new TermQuery(new Term(fieldName, desiredLanguageCode))

to the user query object.  This will bypass the Analyzer at query time and
give you the desired result.

>
> Many thanks in advance,
>
> Alex Bourne
>
>
>
> _
>
> This transmission has been issued by a member of the HSBC Group
> ("HSBC") for the information of the addressee only and should not be
> reproduced and / or distributed to any other person. Each page
> attached hereto must be read in conjunction with any disclaimer which
> forms part of it. This transmission is neither an offer nor the
solicitation
> of an offer to sell or purchase any investment. Its contents are based
> on information obtained from sources believed to be reliable but HSBC
> makes no representation and accepts no responsibility or liability as to
> its completeness or accuracy.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizer and e-mail

2004-05-21 Thread Otis Gospodnetic
Si, si.
Write your own TokenFilter sub-class that overrides next() and extracts
those other elements/tokens from an email address token and uses
Token's setPositionIncrement(0) to store the extracted tokens in the
same position as the original email.

Otis

--- Albert Vila <[EMAIL PROTECTED]> wrote:
> Hi all,
> 
> I want to achieve the following, when I indexing the
> '[EMAIL PROTECTED]', 
> I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token,
> the 
> 'company' token and the 'com'token.
> This way, you'll be able to find the document searching for 
> '[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.
> 
> How can I achieve that?, I need to write my own tokenizer?
> 
> Thanks
> Albert
> 
> -- 
> Albert Vila
> Director de proyectos I+D
> http://www.imente.com
> 902 933 242
> [iMente “La información con más beneficios”]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Memo: RE: Query parser and minus signs

2004-05-21 Thread alex . bourne




Hmm, we may have to if there is no work around. We're not using java
locales, but were trying to stick to the ISO standard which uses hyphens.




"Ryan Sonnek" <[EMAIL PROTECTED]> on 21 May 2004 16:38

Please respond to "Lucene Users List" <[EMAIL PROTECTED]>

To:"Lucene Users List" <[EMAIL PROTECTED]>
cc:
bcc:

Subject:RE: Query parser and minus signs


if you're dealing with locales, why not use java's built in locale syntax
(ex: en_UK, zh_HK)?

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Friday, May 21, 2004 10:36 AM
> To: [EMAIL PROTECTED]
> Subject: Query parser and minus signs
>
>
>
>
>
>
> Hi All,
>
> I'm using Lucene on a site that has split content with a
> branch containing
> pages in English and a separate branch in Chinese.  Some of
> the chinese
> pages include some (untranslatable) English words, so when a search is
> carried out in either language you can get pages from the
> wrong branch. To
> combat this we introduced a language field into the index
> which contains
> the standard language codes: en-UK and zh-HK.
>
> When you parse a query  e.g. language:"en\-UK" you could
> reasonably expect
> the search to recover all pages with the language field set
> to "en-UK" (the
> minus symbol should be escaped by the backslash according to the FAQ).
> Unfortunately the parser seems to return "en UK" as the
> parsed query and
> hence returns no documents.
>
> Has anyone else had this problem, or could suggest a
> workaround ?? as I
> have
> yet to find a solution in the mailing archives or elsewhere.
>
> Many thanks in advance,
>
> Alex Bourne
>
>
>
> _
>
> This transmission has been issued by a member of the HSBC Group
> ("HSBC") for the information of the addressee only and should not be
> reproduced and / or distributed to any other person. Each page
> attached hereto must be read in conjunction with any disclaimer which
> forms part of it. This transmission is neither an offer nor
> the solicitation
> of an offer to sell or purchase any investment. Its contents
> are based
> on information obtained from sources believed to be reliable but HSBC
> makes no representation and accepts no responsibility or
> liability as to
> its completeness or accuracy.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



**
 This message originated from the Internet. Its originator may or
 may not be who they claim to be and the information contained in
 the message and any attachments may or may not be accurate.
**








_

This transmission has been issued by a member of the HSBC Group 
("HSBC") for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Query parser and minus signs

2004-05-21 Thread Ryan Sonnek
if you're dealing with locales, why not use java's built in locale syntax (ex: en_UK, 
zh_HK)?

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Friday, May 21, 2004 10:36 AM
> To: [EMAIL PROTECTED]
> Subject: Query parser and minus signs
> 
> 
> 
> 
> 
> 
> Hi All,
> 
> I'm using Lucene on a site that has split content with a 
> branch containing
> pages in English and a separate branch in Chinese.  Some of 
> the chinese
> pages include some (untranslatable) English words, so when a search is
> carried out in either language you can get pages from the 
> wrong branch. To
> combat this we introduced a language field into the index 
> which contains
> the standard language codes: en-UK and zh-HK.
> 
> When you parse a query  e.g. language:"en\-UK" you could 
> reasonably expect
> the search to recover all pages with the language field set 
> to "en-UK" (the
> minus symbol should be escaped by the backslash according to the FAQ).
> Unfortunately the parser seems to return "en UK" as the 
> parsed query and
> hence returns no documents.
> 
> Has anyone else had this problem, or could suggest a 
> workaround ?? as I
> have
> yet to find a solution in the mailing archives or elsewhere.
> 
> Many thanks in advance,
> 
> Alex Bourne
> 
> 
> 
> _
> 
> This transmission has been issued by a member of the HSBC Group 
> ("HSBC") for the information of the addressee only and should not be 
> reproduced and / or distributed to any other person. Each page 
> attached hereto must be read in conjunction with any disclaimer which 
> forms part of it. This transmission is neither an offer nor 
> the solicitation 
> of an offer to sell or purchase any investment. Its contents 
> are based 
> on information obtained from sources believed to be reliable but HSBC 
> makes no representation and accepts no responsibility or 
> liability as to 
> its completeness or accuracy.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Query parser and minus signs

2004-05-21 Thread alex . bourne




Hi All,

I'm using Lucene on a site that has split content with a branch containing
pages in English and a separate branch in Chinese.  Some of the chinese
pages include some (untranslatable) English words, so when a search is
carried out in either language you can get pages from the wrong branch. To
combat this we introduced a language field into the index which contains
the standard language codes: en-UK and zh-HK.

When you parse a query  e.g. language:"en\-UK" you could reasonably expect
the search to recover all pages with the language field set to "en-UK" (the
minus symbol should be escaped by the backslash according to the FAQ).
Unfortunately the parser seems to return "en UK" as the parsed query and
hence returns no documents.

Has anyone else had this problem, or could suggest a workaround ?? as I
have
yet to find a solution in the mailing archives or elsewhere.

Many thanks in advance,

Alex Bourne



_

This transmission has been issued by a member of the HSBC Group 
("HSBC") for the information of the addressee only and should not be 
reproduced and / or distributed to any other person. Each page 
attached hereto must be read in conjunction with any disclaimer which 
forms part of it. This transmission is neither an offer nor the solicitation 
of an offer to sell or purchase any investment. Its contents are based 
on information obtained from sources believed to be reliable but HSBC 
makes no representation and accepts no responsibility or liability as to 
its completeness or accuracy.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



StandardTokenizer and e-mail

2004-05-21 Thread Albert Vila
Hi all,
I want to achieve the following, when I indexing the '[EMAIL PROTECTED]', 
I want to index the '[EMAIL PROTECTED]' token, then the 'xyz' token, the 
'company' token and the 'com'token.
This way, you'll be able to find the document searching for 
'[EMAIL PROTECTED]', for 'xyz' only, or for 'company' only.

How can I achieve that?, I need to write my own tokenizer?
Thanks
Albert
--
Albert Vila
Director de proyectos I+D
http://www.imente.com
902 933 242
[iMente “La información con más beneficios”]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Karthik N S



Hi 
Please can some body give me a simple Example of 
org.apache.lucene.search.highlight.Highlighter 
I am trying to use it but unsucessfull
 
Karthik

  
  
 

  
WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] 

  


AW: Problem indexing Spanish Characters

2004-05-21 Thread PEP AD Server Administrator
Hi all,
Martin was right. I just adapt the HTML demo as Wallen recommended and it
worked. Now I have only to deal with some crazy documents which are UTF-8
decoded mixed with entities.
Does anyone know a class which can translate entities into UTF-8 or any
other encoding?

Peter MH

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]

Here is an example method in org.apache.lucene.demo.html HTMLParser that
uses a different buffered reader for a different encoding. 

public Reader getReader() throws IOException
{
if (pipeIn == null)
{
pipeInStream = new MyPipedInputStream();
pipeOutStream = new PipedOutputStream(pipeInStream);
pipeIn = new InputStreamReader(pipeInStream);
pipeOut = new OutputStreamWriter(pipeOutStream);
//check the first 4 bytes for FFFE marker, if its
there we know its UTF-16 encoding
if (useUTF16)
{
try
{
pipeIn = new BufferedReader(new
InputStreamReader(pipeInStream, "UTF-16"));
}
catch (Exception e)
{
}
}
Thread thread = new ParserThread(this);
thread.start(); // start parsing
}
return pipeIn;
}

-Original Message-
From: Martin Remy [mailto:[EMAIL PROTECTED]

The tokenizers deal with unicode characters (CharStream, char), so the
problem is not there.  This problem must be solved at the point where the
bytes from your source files are turned into CharSequences/Strings, i.e. by
connecting an InputStreamReader to your FileReader (or whatever you're
using) and specifying "UTF-8" (or whatever encoding is appropriate) in the
InputStreamReader constructor.  

You must either detect the encoding from HTTP heaaders or XML declarations
or, if you know that it's the same for all of your source files, then just
hardcode UTF-8, for example.  

Martin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: documentation fix for website

2004-05-21 Thread Otis Gospodnetic
Thanks for catching this.  I fix it, and the change should show up on
the site with the next Lucene release.

Otis

--- Ryan Sonnek <[EMAIL PROTECTED]> wrote:
> Is this the right place to submit a problem with the website
> documentation? 
> http://jakarta.apache.org/lucene/docs/systemproperties.html lists
> mergeFactor twice with different property names.  the second
> occurance should be updated to lockDir (the underlying href link is
> correct).
> 
> Ryan
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-21 Thread Ankur Goel
Thanks chandan .. 
I am tried  using POI for text extraction . I used The
WordDocument.writeAllText method but it didn't worked for Japanese. 
Is there any other way also for extracting the Japanese text?
Regards,
Ankur 

-Original Message-
From: Chandan Tamrakar [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 21, 2004 3:51 PM
To: Lucene Users List; [EMAIL PROTECTED]
Subject: Re: Searching Microsoft Word , Excel and PPT files for Japanese

for miscrosoft word documents and excel use POI API's  from jakarta
apache.
   First you need to extract the test and convert inot suitable encoding
before you put into lucene for index.
   It worked for me.


- Original Message -
From: "Ankur Goel" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Thursday, May 20, 2004 10:55 PM
Subject: Searching Microsoft Word , Excel and PPT files for Japanese


> Hi,
>
> I am using CJK Tokenzier for searching the Japanese documents.  I am able
to
> search japanese documents which are text files. But I am not able to
search
> from Microsoft word, excel files with content in Japanese.
>
> Can you tell me how can search on Japanese content for Microsoft word,
excel
> and ppt files.
>
> Thanks,
> Ankur
>
> -Original Message-
> From: Ankur Goel [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 04, 2004 1:36 AM
> To: 'Lucene Users List'
> Subject: RE: Boolean Phrase Query question
>
> Thanks Eric for the solution. I have to filename field as I have to give
the
> end user facility to search on File Name also. That's   why I am using
TEXT
> for file Name also.
>
> "By using true on the finalQuery.add calls, you have said that both fields
> must have the word "temp" in them.  Is that what you meant?  Or did you
mean
> an OR type of query?"
>
> I need an OR type of query. I mean the word can be in the filename or in
the
> contents of the filename. But i am not able to do this. Can you tell me
how
> to do it?
>
> Regards,
> Ankur
>
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 04, 2004 1:27 AM
> To: Lucene Users List
> Subject: Re: Boolean Phrase Query question
>
> On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
> >
> > Hi,
> > I have to provide a functionality which provides search on both file
> > name and contents of the file.
> >
> > For indexing I use the following code:
> >
> >
> > org.apache.lucene.document.Document doc = new org.apache.
> > lucene.document.Document();
> > doc.add(Field.Keyword("fileId","" + document.getFileId()));
> > doc.add(Field.Text("fileName",fileName);
> > doc.add(Field.Text("contents", new FileReader(new File(fileName)));
>
> I'm not sure what you plan on doing with the fileName field, but you
> probably want to use a Keyword field for it.
>
> And you may want to glue the file name and contents together into a single
> field to facilitate searches to span both.  (be sure to put a space in
> between if you do this)
>
> > For searching a text say  "temp" I use the following code to look both
> > in file Name and contents of the file:
> >
> > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery =
> > QueryParser.parse("temp","fileName",analyzer);
> > Query mainQuery = QueryParser.parse("temp","contents",analyzer);
> >
> > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery,
> > true, false);
> >
> > Hits hits = is.search(finalQuery);
>
> By using true on the finalQuery.add calls, you have said that both fields
> must have the word "temp" in them.  Is that what you meant?  Or did you
mean
> an OR type of query?
>
> Erik
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching Microsoft Word , Excel and PPT files for Japanese

2004-05-21 Thread Chandan Tamrakar
for miscrosoft word documents and excel use POI API's  from jakarta
apache.
   First you need to extract the test and convert inot suitable encoding
before you put into lucene for index.
   It worked for me.


- Original Message - 
From: "Ankur Goel" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Thursday, May 20, 2004 10:55 PM
Subject: Searching Microsoft Word , Excel and PPT files for Japanese


> Hi,
>
> I am using CJK Tokenzier for searching the Japanese documents.  I am able
to
> search japanese documents which are text files. But I am not able to
search
> from Microsoft word, excel files with content in Japanese.
>
> Can you tell me how can search on Japanese content for Microsoft word,
excel
> and ppt files.
>
> Thanks,
> Ankur
>
> -Original Message-
> From: Ankur Goel [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 04, 2004 1:36 AM
> To: 'Lucene Users List'
> Subject: RE: Boolean Phrase Query question
>
> Thanks Eric for the solution. I have to filename field as I have to give
the
> end user facility to search on File Name also. That's   why I am using
TEXT
> for file Name also.
>
> "By using true on the finalQuery.add calls, you have said that both fields
> must have the word "temp" in them.  Is that what you meant?  Or did you
mean
> an OR type of query?"
>
> I need an OR type of query. I mean the word can be in the filename or in
the
> contents of the filename. But i am not able to do this. Can you tell me
how
> to do it?
>
> Regards,
> Ankur
>
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 04, 2004 1:27 AM
> To: Lucene Users List
> Subject: Re: Boolean Phrase Query question
>
> On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote:
> >
> > Hi,
> > I have to provide a functionality which provides search on both file
> > name and contents of the file.
> >
> > For indexing I use the following code:
> >
> >
> > org.apache.lucene.document.Document doc = new org.apache.
> > lucene.document.Document();
> > doc.add(Field.Keyword("fileId","" + document.getFileId()));
> > doc.add(Field.Text("fileName",fileName);
> > doc.add(Field.Text("contents", new FileReader(new File(fileName)));
>
> I'm not sure what you plan on doing with the fileName field, but you
> probably want to use a Keyword field for it.
>
> And you may want to glue the file name and contents together into a single
> field to facilitate searches to span both.  (be sure to put a space in
> between if you do this)
>
> > For searching a text say  "temp" I use the following code to look both
> > in file Name and contents of the file:
> >
> > BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery =
> > QueryParser.parse("temp","fileName",analyzer);
> > Query mainQuery = QueryParser.parse("temp","contents",analyzer);
> >
> > finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery,
> > true, false);
> >
> > Hits hits = is.search(finalQuery);
>
> By using true on the finalQuery.add calls, you have said that both fields
> must have the word "temp" in them.  Is that what you meant?  Or did you
mean
> an OR type of query?
>
> Erik
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: org.apache.lucene.search.highlight.Highlighter

2004-05-21 Thread Karthik N S
Hi

  Please can some body give me a simple Example of
  org.apache.lucene.search.highlight.Highlighter

  I am trying to use it but unsucessfull


Karthik


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 20, 2004 2:08 AM
To: [EMAIL PROTECTED]
Subject: Re: org.apache.lucene.search.highlight.Highlighter


>>Was Investigating,found some Compile time error..

I see the code you have is taken from the example in the javadocs.
Unfortunately that example wasn't complete because the class didnt
include the method defined in the Formatter interface. I have updated the
Javadocs to correct this oversight.

To correct your problem either make your class implement the Formatter
interface to perform your choice of custom formatting or remove the "this"
parameter from your call to create a new Highlighter with the default
Formatter implementation.

Thanks for "highlighting" the problem with the Javadocs...

Cheers
Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]