Re: Possible bug in SpanNearQuery

2007-05-07 Thread Moti Nisenson

Paul,

The comment should be moved up into SpanNearQuery itself (as opposed to the
comments in the package private implementation classes). Still though, that
comment is inaccurate (regarding overlap - only "exact" overlap is handled).
Here are some additional tests for SpanNearQuery. They all fail except for
testNotExactOverlapInOrder, testTermOvelapStartInOrder and
testTermOverlapEndInOrder (note that the failures for the NotInOrder case
may be alright. There is no documentation indicating the desired behavior).


import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import junit.framework.TestCase;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;
import org.apache.lucene.store.RAMDirectory;

public class SpanNearQueryTest extends TestCase {

   private RAMDirectory dir;

   @Override
   protected void setUp() throws Exception {
   super.setUp();
   dir = new RAMDirectory();
   Document doc = new Document();
   doc.add(new Field("field", new StringReader("one two two three four
five")));
   IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer());
   writer.addDocument(doc);
   writer.close();
   }

   public void testNearQueryInOrder() throws Exception {
   checkNearQuery(true);
   }

   public void testNearQueryNotInOrder() throws Exception {
   checkNearQuery(false);
   }

   private void checkNearQuery(boolean inOrder) throws Exception {
   SpanNearQuery query = buildQuery(5, inOrder, "one", "two");

   IndexReader reader = IndexReader.open(dir);
   Spans spans = query.getSpans(reader);

   int numSpans = countSpans(spans);

   reader.close();

   assertEquals(2, numSpans);
   }

   private int countSpans(Spans spans) throws IOException {
   int numSpans = 0;
   while (spans.next())
   numSpans++;
   return numSpans;
   }

   public void testMinimalSpanInOrder() throws Exception {
   checkMinimalSpan(true);
   }

   public void testMinimalSpanNotInOrder() throws Exception {
   checkMinimalSpan(false);
   }

   private void checkMinimalSpan(boolean inOrder) throws Exception {
   SpanNearQuery query = buildQuery(5, inOrder, "two", "three");

   IndexReader reader = IndexReader.open(dir);
   Spans spans = query.getSpans(reader);

   boolean firstSpan = true;
   int firstSlop = -1;
   int numSpans = 0;
   while (spans.next()) {
   numSpans++;
   if (firstSpan) {
   firstSlop = spans.end() - spans.start();
   firstSpan = false;
   }
   }

   reader.close();

   assertEquals(1, numSpans);
   assertEquals(1, firstSlop);
   }


   public void testNotContainingStartInOrder() throws Exception {
   checkNotContainingStart(true);
   }

   public void testNotContainingStartNotInOrder() throws Exception {
   checkNotContainingStart(false);
   }

   public void testNotContainingEndInOrder() throws Exception {
   checkNotContainingEnd(true);
   }

   public void testNotContainingEndNotInOrder() throws Exception {
   checkNotContainingEnd(false);
   }

   public void testNotOverlappingInOrder() throws Exception {
   checkNotOverlapping(true);
   }

   public void testNotOverlappingNotInOrder() throws Exception {
   checkNotOverlapping(false);
   }

   public void testNotExactOverlapInOrder() throws Exception {
   checkNotExactOverlap(true);
   }

   public void testNotExactOverlapNotInOrder() throws Exception {
   checkNotExactOverlap(false);
   }


   private void checkNotContainingEnd(boolean inOrder) throws Exception {
   SpanNearQuery query1 = buildQuery(5, inOrder, "one", "three");
   SpanNearQuery query2 = buildQuery(5, inOrder, "two", "three");

   SpanNearQuery query = new SpanNearQuery(new SpanQuery[] {query1,
query2}, 5, inOrder);

   IndexReader reader = IndexReader.open(dir);
   Spans spans = query.getSpans(reader);

   int numSpans = countSpans(spans);

   reader.close();

   assertEquals(0, numSpans);
   }

   private void checkNotContainingStart(boolean inOrder) throws Exception {
   SpanNearQuery query1 = buildQuery(5, inOrder, "three", "four");
   SpanNearQuery query2 = buildQuery(5, inOrder, "three", "five");

   SpanNearQuery query = new SpanNearQuery(new SpanQuery[] {query1,
query2}, 5, inOrder);

   IndexReader rea

Multi language indexing

2007-05-07 Thread bhecht

Hello all,

I need to index a table containing company details (name, address, city ...
country).
Each record contains data written in the language appropriate to the records
country.
I was thinking of indexing each record using an analyzer according to the
records country value.
Then when searching, again using the needed analyzer according to the
entered country.
This means I index and search using the same analyzer.

I was interested to know if this is the way to go?

I am trying to implement this using "Hibernate search", and it seems to be a
problem changing analyzers according to a specific value in a record to be
indexed.

Before I break my head understanding how this can be implemented, I wanted
to know if this approach is correct?

Thanks in advance.

-- 
View this message in context: 
http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10353549
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Porter 2 Stemming algorithim in java

2007-05-07 Thread sandeep chawla

Hi All.

is there a implemention of Porter2 Stemming algorithim in java..

I dont want to make a snowballfilter based on snowball  English Stemmer.

Thanks
Sandeep


--
SANDEEP CHAWLA
House No- 23
10th main   
BTM 1st  Stage  
Bangalore   Mobile: 91-9986150603

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread karl wettin


7 maj 2007 kl. 10.02 skrev bhecht:


This means I index and search using the same analyzer.

I was interested to know if this is the way to go?


That would be the way to go (unless you are really sure what you're  
doing).


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread bhecht

I know indexing and searching need to use the same analyzer.

My question regarding "the way to go", was if it is a good solution to index
a content of a table, using more than 1 analyzer, determining the analyzer
by the country value of each record.

Couldn't find a post that describes exactly my problem, and I just want to
be sure this is how people with Lucene 
expiriance would have approached this problem.

Thanks
-- 
View this message in context: 
http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10354930
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread karl wettin


7 maj 2007 kl. 12.16 skrev bhecht:

My question regarding "the way to go", was if it is a good solution  
to index
a content of a table, using more than 1 analyzer, determining the  
analyzer

by the country value of each record.


I'm not sure what you mean, but I'll try.

Do you ask if it makes sense to stem text based on the language of  
the text and put in the same field no matter what language it is?


For the record, it usually makes very little sense to search in text  
stemmed for one language with a query stemmed for another language.  
This is what you will do if you store the stemmed text, no matter the  
language, in the same field. You could add another field called  
"language_iso" and add a boolean clause, but that would just be  
overkill and will increase the response time.


In essence, it depends on your needs. For instance, are users  
supposed to find documents written in other languages than the  
language specified? You want to limit searches to a content language?


My guess is that you probably want to index unstemmed in  
"unstemmed_text" and stemmed in a language specific field  
"stemmed_text_[language iso]", or so, querying the unstemmed field  
and the user language specific when searching, boosting the stemmed  
field.


I hope this helps.

--
karl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread bhecht

OK, thanks for the reply.
The last option seems to be the right one for me, using a stemmed and
unstemmed field.
I assume when you mean "unstemmed", you mean indexing the field using the
UN_TOKENIZED parameter.

Now my problem starts, when trying to implement this with "Hibernate
Search", which allows only 1 analyzer to be defined.

Thanks, I will post my problem now in the Hibernate search forum.

Good day.
-- 
View this message in context: 
http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10355770
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Scope-based crawling and indexing

2007-05-07 Thread Vikas

Hi All:

Can I make nutch to crawl and create separate indices based on scope , where
scope is determined from the querystring?

For example:

Let's assume that I'm having URL like:

http://localhost/admin/orchindex/crawl.asp?lCrpID=0&lPrjID=609&lStrtID=3605&l

then,

lCrpId=0 is one scope

lCorpid=1 is another

and for all urls with lCrpId=0 we want one index and for those with lCrpId=1
we want another


Regards,
Vikas Kumar
-- 
View this message in context: 
http://www.nabble.com/Scope-based-crawling-and-indexing-tf3703543.html#a10356834
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread karl wettin


7 maj 2007 kl. 13.27 skrev bhecht:


The last option seems to be the right one for me, using a stemmed and
unstemmed field.
I assume when you mean "unstemmed", you mean indexing the field  
using the

UN_TOKENIZED parameter.


No, I mean TOKENIZED, but not using a stemmer analyzer.

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread bhecht

OK, thanks, I think I got it.

Just to see if I understood correctly:

When I do the search on both stemmed and unstemmed fields, I will do the
following:

1) If I know the country of the requested search -  I will use the stemmed
analyzer, and then the unstemmed field 
 
might not be found (the stemmed field will be found).

2) if I don't know the country of the requested search - I will use the
unstemmed analyzer, and then the stemmed field


might not be found.

Am I correct?
-- 
View this message in context: 
http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10357611
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread karl wettin


7 maj 2007 kl. 15.45 skrev bhecht:



OK, thanks, I think I got it.

Just to see if I understood correctly:

When I do the search on both stemmed and unstemmed fields, I will  
do the

following:

1) If I know the country of the requested search -  I will use the  
stemmed

analyzer, and then the unstemmed field

might not be found (the stemmed field will be found).

2) if I don't know the country of the requested search - I will use  
the

unstemmed analyzer, and then the stemmed field

might not be found.

Am I correct?


Above sounds very confused and I'm afraid you got it all mixed up.  
Please explain in detail what your data looks like and what effect  
you are looking for. It will make it easier for all parties.


I have no idea how Hibernate search works, perhaps that is the reason  
for me not to understand what you try to do.



--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread bhecht

Sorry,

I didn't understand I need to use the PerFieldanalyzerWrapper for this task,
and tried to index the document twice.
Sorry for the previous post.
thanks for the great help.

But if you already asked, I will be happy to explain what my goal is, and
maybe see if i'm approaching this correctly:

I have a database table containing records of company information, like
comapny name, address, city, state ... country.
The companies information may be written in different languages, but I can
determine the language according to the country field each record has (an
exception to this are countries that use more than 1 language).

I have a JSF form containing input fields for each column, so users can
search for companies.
I have my own metadata (stop words...) and matching alghorythms for each
different country, which I want to use during the analysis process of
Lucene. I have implemented my own analyzer for each country.
So as I see it, when I index these records, I want to provide lucene, with a
specific analyzer per record i'm indexing. 
When a user performs a query in my JSF form, I will use the country value he
entered, to get the needed analyzer, and query lucene with the users query
and the needed analyzer.
The user may also choose not to enter a country value to his search, and
here comes in the solution you gave me, to duplicate each field, and index
it using a non stemming analyzer (A standard analyzer without stop words
defined).
Then with no country entered ni a search, I will use the non stemming
analyzer.

Am I going the right direction?

-- 
View this message in context: 
http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10361747
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Language detection library

2007-05-07 Thread Bob Carpenter



Anyone knows of a good language detection library that can detect what
language a document (text) is ?


Language detection is easy.  It's just a simple
text classification problem.

One way you can do this is using Lucene
itself.  Create a so-called pseudo-document
for each language consisting of lots of text
(1 MB or more, ideally).  Then build a Lucene
index using a character n-gram tokenizer.
Eg. "John Smith" tokenizes to "Jo", "oh",
"hn", "n ", " S", "Sm", "mi", "it", "th"
with 2-grams.

You'll have to make sure to index beyond the
first 1000 tokens or whatever Lucene is set to
by default.

To do language ID, just treat the language
to be identified as the basis of a query.
Parse it using the same character n-gram
tokenizer.  The highest-scoring result is
the answer and if two score high, you know
there may be some ambiguity.  You can't trust
Lucene's normalized scoring for rejection,
though.

Make sure the tokenizer includes spaces as
well as non-space characters (though all
spaces may be normalized to a single whitespace).
Using more orders (1-grams, 2-grams, 3-grams,
etc.) gives more accuracy; the IDF weighting
is quite sensible here and will work out the
details for the counts for you.

For a more sophisticated approach, check out
LingPipe's language ID tutorial, which is
based on probabilistic character language models.
Think of it as similar to the Lucene model but
with different term weighting.

   http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

Here's accuracy vs. input length on a set of 15
languages from the Leipzig Corpus collection (just
one of the many evals in the tutorial):

#chars  accuracy
1   22.59%
2   34.82%
4   58.55%
8   81.17%
16  92.45%
32  97.33%
64  98.99%
128 99.67%

The end of the tutorial has references to other
popular language ID packages online (e.g. TextCat,
which is Gertjan van Noord's Perl package).  And it
also has references to the technical background
on TF/IDF classification with n-grams and
character language models.

- Bob Carpenter
  Alias-i

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Possible bug in SpanNearQuery

2007-05-07 Thread Paul Elschot
Moti,

I have not yet looked into all the details of your comments,
but I remember I had some trouble in trying to define the precise
semantics of NearSpansOrdered. I'll have another look at
being more precise for the overlaps.

NearSpansUnordered is a specialisation of the previous NearSpans
for the unordered case. The ordered case had a bug, which
was fixed by the introduction of NearSpansOrdered.

Your wish to have "one two two" match twice against a query
"one two" with sufficient slop could be probably implemented in
a (hopefully) minor variation of NearSpansOrdered.
NearSpansUnordered has a very different implementation and
I cannot say of the top of my head whether it could be varied
in a similar way.

Providing only the shortest possible match gives some efficiency
and is linguistically easy to understand.

Nested span queries, SpanOrQuery and multiple terms indexed
at the same position (i.e. overlapping in the index) can make these
things quite tricky to implement correctly.

I think it would be worthwhile to open a jira issue for these things.
Could you do that and add your test code there under APL 2?
To make it work as a junit test with the existing ant build.xml might
require renaming the class to start with Test... instead of ending in ...Test.

Shall we move further discussion to the java-dev list?

Regards,
Paul Elschot



On Monday 07 May 2007 09:44, Moti Nisenson wrote:
> Paul,
> 
> The comment should be moved up into SpanNearQuery itself (as opposed to the
> comments in the package private implementation classes). Still though, that
> comment is inaccurate (regarding overlap - only "exact" overlap is handled).
> Here are some additional tests for SpanNearQuery. They all fail except for
> testNotExactOverlapInOrder, testTermOvelapStartInOrder and
> testTermOverlapEndInOrder (note that the failures for the NotInOrder case
> may be alright. There is no documentation indicating the desired behavior).
> 
> 
> import java.io.IOException;
> import java.io.Reader;
> import java.io.StringReader;
> 
> import junit.framework.TestCase;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.spans.SpanNearQuery;
> import org.apache.lucene.search.spans.SpanQuery;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import org.apache.lucene.search.spans.Spans;
> import org.apache.lucene.store.RAMDirectory;
> 
> public class SpanNearQueryTest extends TestCase {
> 
> private RAMDirectory dir;
> 
> @Override
> protected void setUp() throws Exception {
> super.setUp();
> dir = new RAMDirectory();
> Document doc = new Document();
> doc.add(new Field("field", new StringReader("one two two three four
> five")));
> IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer());
> writer.addDocument(doc);
> writer.close();
> }
> 
> public void testNearQueryInOrder() throws Exception {
> checkNearQuery(true);
> }
> 
> public void testNearQueryNotInOrder() throws Exception {
> checkNearQuery(false);
> }
> 
> private void checkNearQuery(boolean inOrder) throws Exception {
> SpanNearQuery query = buildQuery(5, inOrder, "one", "two");
> 
> IndexReader reader = IndexReader.open(dir);
> Spans spans = query.getSpans(reader);
> 
> int numSpans = countSpans(spans);
> 
> reader.close();
> 
> assertEquals(2, numSpans);
> }
> 
> private int countSpans(Spans spans) throws IOException {
> int numSpans = 0;
> while (spans.next())
> numSpans++;
> return numSpans;
> }
> 
> public void testMinimalSpanInOrder() throws Exception {
> checkMinimalSpan(true);
> }
> 
> public void testMinimalSpanNotInOrder() throws Exception {
> checkMinimalSpan(false);
> }
> 
> private void checkMinimalSpan(boolean inOrder) throws Exception {
> SpanNearQuery query = buildQuery(5, inOrder, "two", "three");
> 
> IndexReader reader = IndexReader.open(dir);
> Spans spans = query.getSpans(reader);
> 
> boolean firstSpan = true;
> int firstSlop = -1;
> int numSpans = 0;
> while (spans.next()) {
> numSpans++;
> if (firstSpan) {
> firstSlop = spans.end() - spans.start();
> firstSpan = false;
> }
> }
> 
> reader.close();
> 
> assertEquals(1, numSpans);
> assertEquals(1, firstSlop);
> }
> 
> 
> public void testNotCo

Re: Possible bug in SpanNearQuery

2007-05-07 Thread Moti Nisenson

Sure thing. I actually haven't taken a sufficiently close look at
NearSpansOrdered (I was concentrating more on NearSpansUnordered, which has
got next to no documentation).

- Moti

On 5/7/07, Paul Elschot <[EMAIL PROTECTED]> wrote:


Moti,

I have not yet looked into all the details of your comments,
but I remember I had some trouble in trying to define the precise
semantics of NearSpansOrdered. I'll have another look at
being more precise for the overlaps.

NearSpansUnordered is a specialisation of the previous NearSpans
for the unordered case. The ordered case had a bug, which
was fixed by the introduction of NearSpansOrdered.

Your wish to have "one two two" match twice against a query
"one two" with sufficient slop could be probably implemented in
a (hopefully) minor variation of NearSpansOrdered.
NearSpansUnordered has a very different implementation and
I cannot say of the top of my head whether it could be varied
in a similar way.

Providing only the shortest possible match gives some efficiency
and is linguistically easy to understand.

Nested span queries, SpanOrQuery and multiple terms indexed
at the same position (i.e. overlapping in the index) can make these
things quite tricky to implement correctly.

I think it would be worthwhile to open a jira issue for these things.
Could you do that and add your test code there under APL 2?
To make it work as a junit test with the existing ant build.xml might
require renaming the class to start with Test... instead of ending in
...Test.

Shall we move further discussion to the java-dev list?

Regards,
Paul Elschot



On Monday 07 May 2007 09:44, Moti Nisenson wrote:
> Paul,
>
> The comment should be moved up into SpanNearQuery itself (as opposed to
the
> comments in the package private implementation classes). Still though,
that
> comment is inaccurate (regarding overlap - only "exact" overlap is
handled).
> Here are some additional tests for SpanNearQuery. They all fail except
for
> testNotExactOverlapInOrder, testTermOvelapStartInOrder and
> testTermOverlapEndInOrder (note that the failures for the NotInOrder
case
> may be alright. There is no documentation indicating the desired
behavior).
>
>
> import java.io.IOException;
> import java.io.Reader;
> import java.io.StringReader;
>
> import junit.framework.TestCase;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.spans.SpanNearQuery;
> import org.apache.lucene.search.spans.SpanQuery;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import org.apache.lucene.search.spans.Spans;
> import org.apache.lucene.store.RAMDirectory;
>
> public class SpanNearQueryTest extends TestCase {
>
> private RAMDirectory dir;
>
> @Override
> protected void setUp() throws Exception {
> super.setUp();
> dir = new RAMDirectory();
> Document doc = new Document();
> doc.add(new Field("field", new StringReader("one two two three
four
> five")));
> IndexWriter writer = new IndexWriter(dir, new
StandardAnalyzer());
> writer.addDocument(doc);
> writer.close();
> }
>
> public void testNearQueryInOrder() throws Exception {
> checkNearQuery(true);
> }
>
> public void testNearQueryNotInOrder() throws Exception {
> checkNearQuery(false);
> }
>
> private void checkNearQuery(boolean inOrder) throws Exception {
> SpanNearQuery query = buildQuery(5, inOrder, "one", "two");
>
> IndexReader reader = IndexReader.open(dir);
> Spans spans = query.getSpans(reader);
>
> int numSpans = countSpans(spans);
>
> reader.close();
>
> assertEquals(2, numSpans);
> }
>
> private int countSpans(Spans spans) throws IOException {
> int numSpans = 0;
> while (spans.next())
> numSpans++;
> return numSpans;
> }
>
> public void testMinimalSpanInOrder() throws Exception {
> checkMinimalSpan(true);
> }
>
> public void testMinimalSpanNotInOrder() throws Exception {
> checkMinimalSpan(false);
> }
>
> private void checkMinimalSpan(boolean inOrder) throws Exception {
> SpanNearQuery query = buildQuery(5, inOrder, "two", "three");
>
> IndexReader reader = IndexReader.open(dir);
> Spans spans = query.getSpans(reader);
>
> boolean firstSpan = true;
> int firstSlop = -1;
> int numSpans = 0;
> while (spans.next()) {
> numSpans++;
> if (firstSpan) {
> firstSlop = spans.end() - spans.start()

Re: Questions regarding Lucene query syntax

2007-05-07 Thread Doron Cohen
> Is there a way to require a portion of a query only if there are values
for
> > > that field in the document?
> > > e.g. If I know that I only want to match movies made between 1973 and
> > > 1975,
> > > I would like to be able to say in my query that if the document has a
> > > year,
> > > it must be in that range, but if the document has no year at all,
don't
> > > fail
> > > the document for that reason alone.
> > > This is also important in the director name part.  If a document has
a
> > > director given, and it doesn't match what I'm searching for, that
should
> > > be
> > > a fail, but if the document has no director field, I don't want to
fail
> > > the
> > > document for that reason alone.
> >
> >
> > You'll have to include a dummy value I think. Remember that you're
> > searching for stuff with Lucene, so saying "match even if there's
> > nothing there" is, er, ABnormal..
> >
> > I'd think about putting a dummy value in those fields you want to
handle
> > this way. For instance, add "matchall" to documents with no date. Then
> > you'd need to add an 'or date:matchall' clause to all the dates you
query
> > on. Make sure it's a value that behaves reasonably when you want to
> > include all dates, or all dates before , or all dates after .
> >
>
> Hrm.  I'll keep this idea on the cheat sheet for now. It turns out that

Just to note that in case you do want this, then while
it would be more efficient to index a matchall word (as
Erik suggested), in case it was too late for this (index
already exists, etc.), it is still possible to phrase a
query that applies a range filter only upon docs containing
the range filter field.

With a query parser set to allowLeadingWildcard, this should do:

( +item -price:* ) ( +item +price:[0100 TO 0150] )

or, to avoid too-many-cluases risk:

( +item -price:[MIN TO MAX]) ( +item +price:[0100 TO 0150] )

where MIN and MAX cover (at least) the full range of the ranged field.

Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Porter 2 Stemming algorithim in java

2007-05-07 Thread Mark Miller

http://snowball.tartarus.org/

That is the Snowball page. There exists a Snowball version of the 
Porter2 Stemming algorithm. If you hunt around the download page you 
will find it.


- Mark

sandeep chawla wrote:

Hi All.

is there a implemention of Porter2 Stemming algorithim in java..

I dont want to make a snowballfilter based on snowball  English Stemmer.

Thanks
Sandeep




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Keyphrase Extraction

2007-05-07 Thread Mark Miller
The only commercial options that I have seen do not have a web presence 
(that I know of or can find) and I don't recall the company names (only 
peripherally involved).


Here is a web page where a guy does a nice writeup on a few options: 
http://dsanalytics.com/dsblog/the-start-of-the-art-in-keyphrase-extraction_99


- Mark

[EMAIL PROTECTED] wrote:

Hi Mark,

Do you know of a good paid product that does this?

Thanks,
Arsen


- Original Message 
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, May 2, 2007 7:52:36 AM
Subject: Re: Keyphrase Extraction


>From what I know you generally have to pay if you want something that 
does this really well. Or check out http://www.nzdl.org/Kea/
Unfortunately, the license is GPL. Really too bad; now that it is all 
Java, it would make a great combo with Lucene.


- Mark

mark harwood wrote:
  

I believe the code Otis is referring to is here: 
http://issues.apache.org/jira/browse/LUCENE-474

This is index-level analysis but could be adapted to work for just a single 
document.
The implementation is optimised for speed rather than being a thorough examination of phrase significance. 


Cheers
Mark

- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, 30 April, 2007 4:11:36 AM
Subject: Re: Keyphrase Extraction

Av, look at Lucene's JIRA and search for Mark Harwood.  I believe he once 
contributed something that does this in JIRA.  If you are interested in a 
commercial solution, I can recommend LingPipe.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lucene Consulting - http://lucene-consulting.com/


- Original Message 
From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, April 29, 2007 5:24:17 PM
Subject: Keyphrase Extraction

Hi,

I tried using MoreLikeThis contrib feature to extract "interesting terms" from 
a document. This works very well - but only for SINGLE words.

I am looking for a way to extra "keyPHRASES" from a document. Is there an easy 
way to achieve this using Lucene index?

Thanks in advance!
Av

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 


Don't pick lemons.
See all the new 2007 cars at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html 
  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi language indexing

2007-05-07 Thread Doron Cohen
bhecht <[EMAIL PROTECTED]> wrote on 07/05/2007 10:26:27:

> I have implemented my own analyzer for each country.
> So as I see it, when I index these records, I want to
> provide lucene, with a specific analyzer per record
> i'm indexing.
>
> When a user performs a query in my JSF form, I will
> use the country value he entered, to get the needed
> analyzer, and query lucene with the users query and
> the needed analyzer.
>
> The user may also choose not to enter a country value
> to his search, and here comes in the solution you gave
> me, to duplicate each field, and index it using a non
> stemming analyzer (A standard analyzer without stop
> words defined).
>
> Am I going the right direction?

Sounds ok to me except that there seems to be a mix
between stemming and stop-words elimination. Perhaps
just a typo in the above text, but anyhow while the
StandardAnalyzer constructor takes a stopwords list
parameter and would eliminate these words (e.g. "is"),
it would not do stemming (e.g "knives" --> "knive").
(Though both a stop-list and a stemming algorithm
are language specific.)

So, rephrasing the discussion so far, assuming:

1) a single field "F" (for simplicity),
2) (doc) language always known at indexing
3) (user) language sometimes known at search

I think a resonable solution might be:

1) use PerFieldanalyzerWrapper
2) index each doc to F and to F_LAN
3) F would be language neutral - no
   stemming and no stop words elimination
4) F_LAN (e.g. F_en) would be language specific,
   so a specific language stopwords list would be
   used, and a specific stemmer would be used.
5) Search would go to F_LAN when the language is
   known and to F when the language is not known,
   using language specific analysis as while indexing.
6) Note Karl's mentioning having both F and F_LAN at
   search, assigning higher boost to F_LAN. Useful when
   there is some uncertainty on the "marked language".

There can be other considerations - for instance (1) the
certainty of language id; (2) fallback to English when the
language is unknown...

Note that SnowballFilter can be used for applying
stemming on the output of StandardAnalyzer.

Doron



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]