RE: Using MLT feature

2011-04-08 Thread Frederico Azeiteiro
Yes, i guess that could be an option, but I'm not very experienced with Java 
development and SOLR modifications.
As my main goal was to create a similar sig in C#, I just use the c# method to 
create the sig myself before indexing instead of SOLR Deduplicate function.

That way, when searching I could use the same method with the certain the sig 
is the same. 
As the algorytm used is the same of textProfileSignature the result is the same 
as using SOLR deduplicate. 

Frederico 
 


-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: sexta-feira, 8 de Abril de 2011 10:11
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

Couldn't you extend the TextProfileSignature and modify the TokenComparator
class to use lexical order when token have the same frequency ?

Ludovic.

2011/4/8 Frederico Azeiteiro [via Lucene] <
ml-node+2794604-1683988626-383...@n3.nabble.com>

> Hi.
>
> Yes, I manage to create a stable comparator in c# for profile.
> The problem is before that on:
>
> ...
> tokens.put(s, tok);
> ...
>
> Imagine you have 2 tokens with the same frequency, on the stable sort
> comparator for profile it will maintain the original order.
> The problem is that the original order comes from the way they are
> inserted in hashmap 'tokens' and not from the order the tokens appear on
> original text.
>
> Frederico
>
> -Original Message-
> From: lboutros [mailto:[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=0&by-user=t>]
>
> Sent: sexta-feira, 8 de Abril de 2011 09:49
> To: [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=1&by-user=t>
> Subject: Re: Using MLT feature
>
> It seems that tokens are sorted by frequencies :
>
> ...
> Collections.sort(profile, new TokenComparator());
> ...
>
>
> and
>
> private static class TokenComparator implements Comparator {
> public int compare(Token t1, Token t2) {
>   return t2.cnt - t1.cnt;
> }
>
> and cnt is the token count.
>
> Ludovic.
>
> 2011/4/7 Frederico Azeiteiro [via Lucene] <
> [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=2&by-user=t>>
>
>
> > Well at this point I'm more dedicated to the Deduplicate issue.
> >
> > Using a Min_token_len of 4 I'm getting nice comparison results. MLT
> returns
> > a lot of similar docs that I don't consider similar - even tuning the
> > parameters.
> >
> > Finishing this issue, I found out that the signature also contains the
> > field name meaning that if you wish to signature both title and text
> fields,
> > your signature will be a hash of ("text"+"text value"+"title"+"title
> > value").
> >
> > In any case, I found that the Hashmap used on the hash algorithm
> inserts
> > the tokens by some hashmap internal sort method that I can't
> understand :),
> > and so, impossible to copy to C# implementation.
> >
> > Thank you for all your help,
> > Frederico
> >
> >
>
>
> -
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h<http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h?by-user=t>
> tml
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794604.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=>.
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794622.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using MLT feature

2011-04-08 Thread lboutros
Couldn't you extend the TextProfileSignature and modify the TokenComparator
class to use lexical order when token have the same frequency ?

Ludovic.

2011/4/8 Frederico Azeiteiro [via Lucene] <
ml-node+2794604-1683988626-383...@n3.nabble.com>

> Hi.
>
> Yes, I manage to create a stable comparator in c# for profile.
> The problem is before that on:
>
> ...
> tokens.put(s, tok);
> ...
>
> Imagine you have 2 tokens with the same frequency, on the stable sort
> comparator for profile it will maintain the original order.
> The problem is that the original order comes from the way they are
> inserted in hashmap 'tokens' and not from the order the tokens appear on
> original text.
>
> Frederico
>
> -Original Message-
> From: lboutros [mailto:[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=0&by-user=t>]
>
> Sent: sexta-feira, 8 de Abril de 2011 09:49
> To: [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=1&by-user=t>
> Subject: Re: Using MLT feature
>
> It seems that tokens are sorted by frequencies :
>
> ...
> Collections.sort(profile, new TokenComparator());
> ...
>
>
> and
>
> private static class TokenComparator implements Comparator {
> public int compare(Token t1, Token t2) {
>   return t2.cnt - t1.cnt;
> }
>
> and cnt is the token count.
>
> Ludovic.
>
> 2011/4/7 Frederico Azeiteiro [via Lucene] <
> [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=2&by-user=t>>
>
>
> > Well at this point I'm more dedicated to the Deduplicate issue.
> >
> > Using a Min_token_len of 4 I'm getting nice comparison results. MLT
> returns
> > a lot of similar docs that I don't consider similar - even tuning the
> > parameters.
> >
> > Finishing this issue, I found out that the signature also contains the
> > field name meaning that if you wish to signature both title and text
> fields,
> > your signature will be a hash of ("text"+"text value"+"title"+"title
> > value").
> >
> > In any case, I found that the Hashmap used on the hash algorithm
> inserts
> > the tokens by some hashmap internal sort method that I can't
> understand :),
> > and so, impossible to copy to C# implementation.
> >
> > Thank you for all your help,
> > Frederico
> >
> >
>
>
> -
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h<http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h?by-user=t>
> tml
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794604.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=>.
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794622.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Using MLT feature

2011-04-08 Thread Frederico Azeiteiro
Hi.

Yes, I manage to create a stable comparator in c# for profile. 
The problem is before that on: 

...
tokens.put(s, tok);
...

Imagine you have 2 tokens with the same frequency, on the stable sort
comparator for profile it will maintain the original order. 
The problem is that the original order comes from the way they are
inserted in hashmap 'tokens' and not from the order the tokens appear on
original text.

Frederico

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: sexta-feira, 8 de Abril de 2011 09:49
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

It seems that tokens are sorted by frequencies :

...
Collections.sort(profile, new TokenComparator());
...


and

private static class TokenComparator implements Comparator {
public int compare(Token t1, Token t2) {
  return t2.cnt - t1.cnt;
}

and cnt is the token count.

Ludovic.

2011/4/7 Frederico Azeiteiro [via Lucene] <
ml-node+2790579-1141723501-383...@n3.nabble.com>

> Well at this point I'm more dedicated to the Deduplicate issue.
>
> Using a Min_token_len of 4 I'm getting nice comparison results. MLT
returns
> a lot of similar docs that I don't consider similar - even tuning the
> parameters.
>
> Finishing this issue, I found out that the signature also contains the
> field name meaning that if you wish to signature both title and text
fields,
> your signature will be a hash of ("text"+"text value"+"title"+"title
> value").
>
> In any case, I found that the Hashmap used on the hash algorithm
inserts
> the tokens by some hashmap internal sort method that I can't
understand :),
> and so, impossible to copy to C# implementation.
>
> Thank you for all your help,
> Frederico
>
>


-
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h
tml
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using MLT feature

2011-04-08 Thread lboutros
It seems that tokens are sorted by frequencies :

...
Collections.sort(profile, new TokenComparator());
...


and

private static class TokenComparator implements Comparator {
public int compare(Token t1, Token t2) {
  return t2.cnt - t1.cnt;
}

and cnt is the token count.

Ludovic.

2011/4/7 Frederico Azeiteiro [via Lucene] <
ml-node+2790579-1141723501-383...@n3.nabble.com>

> Well at this point I'm more dedicated to the Deduplicate issue.
>
> Using a Min_token_len of 4 I'm getting nice comparison results. MLT returns
> a lot of similar docs that I don't consider similar - even tuning the
> parameters.
>
> Finishing this issue, I found out that the signature also contains the
> field name meaning that if you wish to signature both title and text fields,
> your signature will be a hash of ("text"+"text value"+"title"+"title
> value").
>
> In any case, I found that the Hashmap used on the hash algorithm inserts
> the tokens by some hashmap internal sort method that I can't understand :),
> and so, impossible to copy to C# implementation.
>
> Thank you for all your help,
> Frederico
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Using MLT feature

2011-04-07 Thread Frederico Azeiteiro
Well at this point I'm more dedicated to the Deduplicate issue.

Using a Min_token_len of 4 I'm getting nice comparison results. MLT returns a 
lot of similar docs that I don't consider similar - even tuning the parameters.

Finishing this issue, I found out that the signature also contains the field 
name meaning that if you wish to signature both title and text fields, your 
signature will be a hash of ("text"+"text value"+"title"+"title value").

In any case, I found that the Hashmap used on the hash algorithm inserts the 
tokens by some hashmap internal sort method that I can't understand :), and so, 
impossible to copy to C# implementation.

Thank you for all your help,
Frederico 


-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: quinta-feira, 7 de Abril de 2011 04:09
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

A "fuzzy signature" system will not work here. You are right, you want
to try MLT instead.

Lance

On Wed, Apr 6, 2011 at 9:47 AM, Frederico Azeiteiro
 wrote:
> Yes, I had already check the code for it and use it to compile a c# method 
> that returns the same signature.
>
> But I have a strange issue:
> For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the 
> text "frederico" (simple text no big deal here):
>
> 1. using my c# app returns "8b92e01d67591dfc60adf9576f76a055"
> 2. using SOLR, passing a doc with HeadLine "frederico" I get 
> "8d9a5c35812ba75b8383d4538b91080f" on my signature field.
> 3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
> SignatureUpdateProcessorFactory class (please check code below) and I get 
> "8b92e01d67591dfc60adf9576f76a055".
>
> Java app code:
>                TextProfileSignature textProfileSignature = new 
> TextProfileSignature();
>                NamedList params = new NamedList();
>                params.add("", "");
>                SolrParams solrParams = SolrParams.toSolrParams(params);
>                textProfileSignature.init(solrParams);
>                textProfileSignature.add("frederico");
>
>
>                byte[] signature =  textProfileSignature.getSignature();
>                char[] arr = new char[signature.length << 1];
>                for (int i = 0; i < signature.length; i++) {
>                        int b = signature[i];
>                        int idx = i << 1;
>                        arr[idx] = StrUtils.HEX_DIGITS[(b >> 4) & 0xf];
>                        arr[idx + 1] = StrUtils.HEX_DIGITS[b & 0xf];
>                }
>                String sigString = new String(arr);
>                System.out.println(sigString);
>
>
>
>
> Here's my processor configs:
>
> 
>      class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>       true
>       sig
>       false
>       HeadLine
>        name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
>       2
>       
>     
>     
>   
>
>
> So both my apps (Java and C#)  return the same signature but SOLR returns a 
> different one..
> Can anyone understand what I should be doing wrong?
>
> Thank you once again.
>
> Frederico
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 15:20
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
>
> If you check the code for TextProfileSignature [1] your'll notice the init
> method reading params. You can set those params as you did. Reading Javadoc
> [2] might help as well. But what's not documented in the Javadoc is how QUANT
> is computed; it rounds.
>
> [1]:
> http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
> [2]:
> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html
>
> On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
>> Thank you, I'll try to create a c# method to create the same sig of SOLR,
>> and then compare both sigs before index the doc. This way I can avoid the
>> indexation of existing docs.
>>
>> If anyone needs to use this parameter (as this info is not on the wiki),
>> you can add the option
>>
>> 5
>>
>> On the processor tag.
>>
>> Best regards,
>> Frederico
>>
>>
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: terç

Re: Using MLT feature

2011-04-06 Thread Lance Norskog
A "fuzzy signature" system will not work here. You are right, you want
to try MLT instead.

Lance

On Wed, Apr 6, 2011 at 9:47 AM, Frederico Azeiteiro
 wrote:
> Yes, I had already check the code for it and use it to compile a c# method 
> that returns the same signature.
>
> But I have a strange issue:
> For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the 
> text "frederico" (simple text no big deal here):
>
> 1. using my c# app returns "8b92e01d67591dfc60adf9576f76a055"
> 2. using SOLR, passing a doc with HeadLine "frederico" I get 
> "8d9a5c35812ba75b8383d4538b91080f" on my signature field.
> 3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
> SignatureUpdateProcessorFactory class (please check code below) and I get 
> "8b92e01d67591dfc60adf9576f76a055".
>
> Java app code:
>                TextProfileSignature textProfileSignature = new 
> TextProfileSignature();
>                NamedList params = new NamedList();
>                params.add("", "");
>                SolrParams solrParams = SolrParams.toSolrParams(params);
>                textProfileSignature.init(solrParams);
>                textProfileSignature.add("frederico");
>
>
>                byte[] signature =  textProfileSignature.getSignature();
>                char[] arr = new char[signature.length << 1];
>                for (int i = 0; i < signature.length; i++) {
>                        int b = signature[i];
>                        int idx = i << 1;
>                        arr[idx] = StrUtils.HEX_DIGITS[(b >> 4) & 0xf];
>                        arr[idx + 1] = StrUtils.HEX_DIGITS[b & 0xf];
>                }
>                String sigString = new String(arr);
>                System.out.println(sigString);
>
>
>
>
> Here's my processor configs:
>
> 
>      class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>       true
>       sig
>       false
>       HeadLine
>        name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
>       2
>       
>     
>     
>   
>
>
> So both my apps (Java and C#)  return the same signature but SOLR returns a 
> different one..
> Can anyone understand what I should be doing wrong?
>
> Thank you once again.
>
> Frederico
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 15:20
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
>
> If you check the code for TextProfileSignature [1] your'll notice the init
> method reading params. You can set those params as you did. Reading Javadoc
> [2] might help as well. But what's not documented in the Javadoc is how QUANT
> is computed; it rounds.
>
> [1]:
> http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
> [2]:
> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html
>
> On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
>> Thank you, I'll try to create a c# method to create the same sig of SOLR,
>> and then compare both sigs before index the doc. This way I can avoid the
>> indexation of existing docs.
>>
>> If anyone needs to use this parameter (as this info is not on the wiki),
>> you can add the option
>>
>> 5
>>
>> On the processor tag.
>>
>> Best regards,
>> Frederico
>>
>>
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: terça-feira, 5 de Abril de 2011 12:01
>> To: solr-user@lucene.apache.org
>> Cc: Frederico Azeiteiro
>> Subject: Re: Using MLT feature
>>
>> On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
>> > Sorry, the reply I made yesterday was directed to Markus and not the
>> > list...
>> >
>> > Here's my thoughts on this. At this point I'm a little confused if SOLR
>> > is a good option to find near duplicate docs.
>> >
>> > >> Yes there is, try set overwriteDupes to true and documents yielding
>> >
>> > the same signature will be overwritten
>> >
>> > The problem is that I don't want to overwrite the doc, I need to
>> > maintain the original version (because the doc has others fields I need
>> > to maintain).
>> >
>> > >>If you have need bot

RE: Using MLT feature

2011-04-06 Thread Frederico Azeiteiro
Yes, I had already check the code for it and use it to compile a c# method that 
returns the same signature.

But I have a strange issue:
For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the text 
"frederico" (simple text no big deal here): 

1. using my c# app returns "8b92e01d67591dfc60adf9576f76a055"
2. using SOLR, passing a doc with HeadLine "frederico" I get 
"8d9a5c35812ba75b8383d4538b91080f" on my signature field.
3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
SignatureUpdateProcessorFactory class (please check code below) and I get 
"8b92e01d67591dfc60adf9576f76a055".

Java app code:
TextProfileSignature textProfileSignature = new 
TextProfileSignature();
NamedList params = new NamedList();
params.add("", "");
SolrParams solrParams = SolrParams.toSolrParams(params);
textProfileSignature.init(solrParams);
textProfileSignature.add("frederico");


byte[] signature =  textProfileSignature.getSignature();
char[] arr = new char[signature.length << 1];
for (int i = 0; i < signature.length; i++) {
int b = signature[i];
int idx = i << 1;
arr[idx] = StrUtils.HEX_DIGITS[(b >> 4) & 0xf];
arr[idx + 1] = StrUtils.HEX_DIGITS[b & 0xf];
}
String sigString = new String(arr);
System.out.println(sigString);




Here's my processor configs:


 
   true
   sig
   false
   HeadLine
   org.apache.solr.update.processor.TextProfileSignature
   2
   
 
 
   


So both my apps (Java and C#)  return the same signature but SOLR returns a 
different one.. 
Can anyone understand what I should be doing wrong?

Thank you once again.

Frederico

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: terça-feira, 5 de Abril de 2011 15:20
To: solr-user@lucene.apache.org
Cc: Frederico Azeiteiro
Subject: Re: Using MLT feature

If you check the code for TextProfileSignature [1] your'll notice the init 
method reading params. You can set those params as you did. Reading Javadoc 
[2] might help as well. But what's not documented in the Javadoc is how QUANT 
is computed; it rounds.

[1]: 
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
[2]: 
http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
> Thank you, I'll try to create a c# method to create the same sig of SOLR,
> and then compare both sigs before index the doc. This way I can avoid the
> indexation of existing docs.
> 
> If anyone needs to use this parameter (as this info is not on the wiki),
> you can add the option
> 
> 5
> 
> On the processor tag.
> 
> Best regards,
> Frederico 
> 
> 
> -Original Message-----
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 12:01
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> > Sorry, the reply I made yesterday was directed to Markus and not the
> > list...
> > 
> > Here's my thoughts on this. At this point I'm a little confused if SOLR
> > is a good option to find near duplicate docs.
> > 
> > >> Yes there is, try set overwriteDupes to true and documents yielding
> > 
> > the same signature will be overwritten
> > 
> > The problem is that I don't want to overwrite the doc, I need to
> > maintain the original version (because the doc has others fields I need
> > to maintain).
> > 
> > >>If you have need both fuzzy and exact matching then add a second
> > 
> > update processor inside the chain and create another signature field.
> > 
> > I just need the fuzzy search but the quick tests I made, return
> > different signatures for what I consider duplicate docs.
> > "Army deploys as clan war kills 11 in Philippine south"
> > "Army deploys as clan war kills 11 in Philippine south."
> > 
> > Same sig for the above 2 strings, that's ok.
> > 
> > But a different sig was created for:
> > "Army deploys as clan war kills 11 in Philippine south the."
> > 
> > Is there a way to setup the TextProfileSignature parameters to adjust
> > the "s

Re: Using MLT feature

2011-04-05 Thread Markus Jelsma
If you check the code for TextProfileSignature [1] your'll notice the init 
method reading params. You can set those params as you did. Reading Javadoc 
[2] might help as well. But what's not documented in the Javadoc is how QUANT 
is computed; it rounds.

[1]: 
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
[2]: 
http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
> Thank you, I'll try to create a c# method to create the same sig of SOLR,
> and then compare both sigs before index the doc. This way I can avoid the
> indexation of existing docs.
> 
> If anyone needs to use this parameter (as this info is not on the wiki),
> you can add the option
> 
> 5
> 
> On the processor tag.
> 
> Best regards,
> Frederico 
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 12:01
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> > Sorry, the reply I made yesterday was directed to Markus and not the
> > list...
> > 
> > Here's my thoughts on this. At this point I'm a little confused if SOLR
> > is a good option to find near duplicate docs.
> > 
> > >> Yes there is, try set overwriteDupes to true and documents yielding
> > 
> > the same signature will be overwritten
> > 
> > The problem is that I don't want to overwrite the doc, I need to
> > maintain the original version (because the doc has others fields I need
> > to maintain).
> > 
> > >>If you have need both fuzzy and exact matching then add a second
> > 
> > update processor inside the chain and create another signature field.
> > 
> > I just need the fuzzy search but the quick tests I made, return
> > different signatures for what I consider duplicate docs.
> > "Army deploys as clan war kills 11 in Philippine south"
> > "Army deploys as clan war kills 11 in Philippine south."
> > 
> > Same sig for the above 2 strings, that's ok.
> > 
> > But a different sig was created for:
> > "Army deploys as clan war kills 11 in Philippine south the."
> > 
> > Is there a way to setup the TextProfileSignature parameters to adjust
> > the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
> > 
> > Do you think that these parameters can help creating the same sig for
> > the above example?
> 
> You can only fix this by increasing minTokenLen to 4 to prevent `the` from
> being added to the list of tokens but this may affect other signatures.
> Possibly more documents will then get the same signature. Messing around
> with quantRate won't do much good because all your tokens have the same
> frequency (1) so quant will always be 1 in this short text. That's why
> TextProfileSignature works less well for short texts.
> 
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSigna
> ture.html
> 
> > Is anyone using the TextProfileSignature with success?
> > 
> > Thank you,
> > Frederico
> > 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: segunda-feira, 4 de Abril de 2011 16:47
> > To: solr-user@lucene.apache.org
> > Cc: Frederico Azeiteiro
> > Subject: Re: Using MLT feature
> > 
> > > Hi again,
> > > I guess I was wrong on my early post... There's no automated way to
> > 
> > avoid
> > 
> > > the indexation of the duplicate doc.
> > 
> > Yes there is, try set overwriteDupes to true and documents yielding the
> > same
> > signature will be overwritten. If you have need both fuzzy and exact
> > matching
> > then add a second update processor inside the chain and create another
> > signature field.
> > 
> > > I guess I have 2 options:
> > > 
> > > 1. Create a temp index with signatures and then have an app that for
> > 
> > each
> > 
> > > new doc verifies if sig exists on my primary index. If not, add the
> > > article.
> > > 
> > > 2. Before adding the doc, create a signature (using the same algorithm
> > 
> > that
> > 
> > > SOLR uses) on my indexing app and then verify if signature exists
> > 
> > before
> > 
> > >

RE: Using MLT feature

2011-04-05 Thread Frederico Azeiteiro
Thank you, I'll try to create a c# method to create the same sig of SOLR, and 
then compare both sigs before index the doc. This way I can avoid the 
indexation of existing docs.

If anyone needs to use this parameter (as this info is not on the wiki), you 
can add the option

5

On the processor tag.

Best regards,
Frederico 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: terça-feira, 5 de Abril de 2011 12:01
To: solr-user@lucene.apache.org
Cc: Frederico Azeiteiro
Subject: Re: Using MLT feature



On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> Sorry, the reply I made yesterday was directed to Markus and not the
> list...
> 
> Here's my thoughts on this. At this point I'm a little confused if SOLR
> is a good option to find near duplicate docs.
> 
> >> Yes there is, try set overwriteDupes to true and documents yielding
> 
> the same signature will be overwritten
> 
> The problem is that I don't want to overwrite the doc, I need to
> maintain the original version (because the doc has others fields I need
> to maintain).
> 
> >>If you have need both fuzzy and exact matching then add a second
> 
> update processor inside the chain and create another signature field.
> 
> I just need the fuzzy search but the quick tests I made, return
> different signatures for what I consider duplicate docs.
> "Army deploys as clan war kills 11 in Philippine south"
> "Army deploys as clan war kills 11 in Philippine south."
> 
> Same sig for the above 2 strings, that's ok.
> 
> But a different sig was created for:
> "Army deploys as clan war kills 11 in Philippine south the."
> 
> Is there a way to setup the TextProfileSignature parameters to adjust
> the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
> 
> Do you think that these parameters can help creating the same sig for
> the above example?

You can only fix this by increasing minTokenLen to 4 to prevent `the` from 
being added to the list of tokens but this may affect other signatures. 
Possibly more documents will then get the same signature. Messing around with 
quantRate won't do much good because all your tokens have the same frequency 
(1) so quant will always be 1 in this short text. That's why 
TextProfileSignature works less well for short texts.

http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSignature.html

> 
> Is anyone using the TextProfileSignature with success?
> 
> Thank you,
> Frederico
> 
> 
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: segunda-feira, 4 de Abril de 2011 16:47
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> > Hi again,
> > I guess I was wrong on my early post... There's no automated way to
> 
> avoid
> 
> > the indexation of the duplicate doc.
> 
> Yes there is, try set overwriteDupes to true and documents yielding the
> same
> signature will be overwritten. If you have need both fuzzy and exact
> matching
> then add a second update processor inside the chain and create another
> signature field.
> 
> > I guess I have 2 options:
> > 
> > 1. Create a temp index with signatures and then have an app that for
> 
> each
> 
> > new doc verifies if sig exists on my primary index. If not, add the
> > article.
> > 
> > 2. Before adding the doc, create a signature (using the same algorithm
> 
> that
> 
> > SOLR uses) on my indexing app and then verify if signature exists
> 
> before
> 
> > adding.
> > 
> > I'm way thinking the right way here? :)
> > 
> > Thank you,
> > Frederico
> > 
> > 
> > 
> > -Original Message-
> > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> > Sent: segunda-feira, 4 de Abril de 2011 11:59
> > To: solr-user@lucene.apache.org
> > Subject: RE: Using MLT feature
> > 
> > Thank you Markus it looks great.
> > 
> > But the wiki is not very detailed on this.
> > Do you mean if I:
> > 
> > 1. Create:
> > 
> > 
> >  
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
> 
> > true
> > 
> >   false
> >   signature
> >   headline,body,medianame
> >
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
> /s
> 
> > tr> 
> > 
> > 
> > 
> >   
> >   
> > 
> > 2. Add the request as the default update request
>

Re: Using MLT feature

2011-04-05 Thread Markus Jelsma


On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> Sorry, the reply I made yesterday was directed to Markus and not the
> list...
> 
> Here's my thoughts on this. At this point I'm a little confused if SOLR
> is a good option to find near duplicate docs.
> 
> >> Yes there is, try set overwriteDupes to true and documents yielding
> 
> the same signature will be overwritten
> 
> The problem is that I don't want to overwrite the doc, I need to
> maintain the original version (because the doc has others fields I need
> to maintain).
> 
> >>If you have need both fuzzy and exact matching then add a second
> 
> update processor inside the chain and create another signature field.
> 
> I just need the fuzzy search but the quick tests I made, return
> different signatures for what I consider duplicate docs.
> "Army deploys as clan war kills 11 in Philippine south"
> "Army deploys as clan war kills 11 in Philippine south."
> 
> Same sig for the above 2 strings, that's ok.
> 
> But a different sig was created for:
> "Army deploys as clan war kills 11 in Philippine south the."
> 
> Is there a way to setup the TextProfileSignature parameters to adjust
> the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
> 
> Do you think that these parameters can help creating the same sig for
> the above example?

You can only fix this by increasing minTokenLen to 4 to prevent `the` from 
being added to the list of tokens but this may affect other signatures. 
Possibly more documents will then get the same signature. Messing around with 
quantRate won't do much good because all your tokens have the same frequency 
(1) so quant will always be 1 in this short text. That's why 
TextProfileSignature works less well for short texts.

http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSignature.html

> 
> Is anyone using the TextProfileSignature with success?
> 
> Thank you,
> Frederico
> 
> 
> -----Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: segunda-feira, 4 de Abril de 2011 16:47
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> > Hi again,
> > I guess I was wrong on my early post... There's no automated way to
> 
> avoid
> 
> > the indexation of the duplicate doc.
> 
> Yes there is, try set overwriteDupes to true and documents yielding the
> same
> signature will be overwritten. If you have need both fuzzy and exact
> matching
> then add a second update processor inside the chain and create another
> signature field.
> 
> > I guess I have 2 options:
> > 
> > 1. Create a temp index with signatures and then have an app that for
> 
> each
> 
> > new doc verifies if sig exists on my primary index. If not, add the
> > article.
> > 
> > 2. Before adding the doc, create a signature (using the same algorithm
> 
> that
> 
> > SOLR uses) on my indexing app and then verify if signature exists
> 
> before
> 
> > adding.
> > 
> > I'm way thinking the right way here? :)
> > 
> > Thank you,
> > Frederico
> > 
> > 
> > 
> > -Original Message-
> > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> > Sent: segunda-feira, 4 de Abril de 2011 11:59
> > To: solr-user@lucene.apache.org
> > Subject: RE: Using MLT feature
> > 
> > Thank you Markus it looks great.
> > 
> > But the wiki is not very detailed on this.
> > Do you mean if I:
> > 
> > 1. Create:
> > 
> > 
> >  
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
> 
> > true
> > 
> >   false
> >   signature
> >   headline,body,medianame
> >
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
> /s
> 
> > tr> 
> > 
> > 
> > 
> >   
> >   
> > 
> > 2. Add the request as the default update request
> > 3. Add a "signature" indexed field to my schema.
> > 
> > Then,
> > When adding a new doc to my index, it is only added of not considered
> 
> a
> 
> > duplicate using a Lookup3Signature on the field defined? All
> 
> duplicates
> 
> > are ignored and not added to my index?
> > Is it so simple as that?
> > 
> > Does it works even if the medianame should be an exact match (not
> 
> similar
> 
> > match as the headline and bodytext are)?
> > 
> > Thank

RE: Using MLT feature

2011-04-05 Thread Frederico Azeiteiro
Sorry, the reply I made yesterday was directed to Markus and not the
list...

Here's my thoughts on this. At this point I'm a little confused if SOLR
is a good option to find near duplicate docs.

>> Yes there is, try set overwriteDupes to true and documents yielding
the same signature will be overwritten

The problem is that I don't want to overwrite the doc, I need to
maintain the original version (because the doc has others fields I need
to maintain).

>>If you have need both fuzzy and exact matching then add a second
update processor inside the chain and create another signature field.

I just need the fuzzy search but the quick tests I made, return
different signatures for what I consider duplicate docs. 
"Army deploys as clan war kills 11 in Philippine south"
"Army deploys as clan war kills 11 in Philippine south."

Same sig for the above 2 strings, that's ok.

But a different sig was created for:
"Army deploys as clan war kills 11 in Philippine south the."
 
Is there a way to setup the TextProfileSignature parameters to adjust
the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?

Do you think that these parameters can help creating the same sig for
the above example?

Is anyone using the TextProfileSignature with success?

Thank you,
Frederico 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 16:47
To: solr-user@lucene.apache.org
Cc: Frederico Azeiteiro
Subject: Re: Using MLT feature


> Hi again,
> I guess I was wrong on my early post... There's no automated way to
avoid
> the indexation of the duplicate doc.

Yes there is, try set overwriteDupes to true and documents yielding the
same 
signature will be overwritten. If you have need both fuzzy and exact
matching 
then add a second update processor inside the chain and create another 
signature field.

> 
> I guess I have 2 options:
> 
> 1. Create a temp index with signatures and then have an app that for
each
> new doc verifies if sig exists on my primary index. If not, add the
> article.
> 
> 2. Before adding the doc, create a signature (using the same algorithm
that
> SOLR uses) on my indexing app and then verify if signature exists
before
> adding.
> 
> I'm way thinking the right way here? :)
> 
> Thank you,
> Frederico
>  
> 
> 
> -Original Message-
> From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> Sent: segunda-feira, 4 de Abril de 2011 11:59
> To: solr-user@lucene.apache.org
> Subject: RE: Using MLT feature
> 
> Thank you Markus it looks great.
> 
> But the wiki is not very detailed on this.
> Do you mean if I:
> 
> 1. Create:
> 
> 
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
>
> true
>   false
>   signature
>   headline,body,medianame
>   
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
/s
> tr> 
> 
> 
>   
> 
> 2. Add the request as the default update request
> 3. Add a "signature" indexed field to my schema.
> 
> Then,
> When adding a new doc to my index, it is only added of not considered
a
> duplicate using a Lookup3Signature on the field defined? All
duplicates
> are ignored and not added to my index?
> Is it so simple as that?
> 
> Does it works even if the medianame should be an exact match (not
similar
> match as the headline and bodytext are)?
> 
> Thank you for your help,
> 
> 
> Frederico Azeiteiro
> Developer
>  
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: segunda-feira, 4 de Abril de 2011 10:48
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> http://wiki.apache.org/solr/Deduplication
> 
> On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > Hi,
> > 
> > The ideia is don't index if something similar (headline+bodytext)
for
> > the same exact medianame.
> > 
> > Do you mean I would need to index the doc first (maybe in a temp
index)
> > and then use the MLT feature to find similar docs before adding to
final
> > index?
> > 
> > Thanks,
> > Frederico
> > 
> > 
> > -Original Message-
> > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > To: solr-user@lucene.apache.org
> > Subject: Re: Using MLT feature
> > 
> > Do you want to not index if something similar? Or don't index if
exact.
> > I would look into a hash code of the document if you don't want to
index
> &

Re: Using MLT feature

2011-04-04 Thread Markus Jelsma

> Hi again,
> I guess I was wrong on my early post... There's no automated way to avoid
> the indexation of the duplicate doc.

Yes there is, try set overwriteDupes to true and documents yielding the same 
signature will be overwritten. If you have need both fuzzy and exact matching 
then add a second update processor inside the chain and create another 
signature field.

> 
> I guess I have 2 options:
> 
> 1. Create a temp index with signatures and then have an app that for each
> new doc verifies if sig exists on my primary index. If not, add the
> article.
> 
> 2. Before adding the doc, create a signature (using the same algorithm that
> SOLR uses) on my indexing app and then verify if signature exists before
> adding.
> 
> I'm way thinking the right way here? :)
> 
> Thank you,
> Frederico
>  
> 
> 
> -Original Message-
> From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> Sent: segunda-feira, 4 de Abril de 2011 11:59
> To: solr-user@lucene.apache.org
> Subject: RE: Using MLT feature
> 
> Thank you Markus it looks great.
> 
> But the wiki is not very detailed on this.
> Do you mean if I:
> 
> 1. Create:
> 
>  class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> true
>   false
>   signature
>   headline,body,medianame
>name="signatureClass">org.apache.solr.update.processor.Lookup3Signature tr> 
> 
> 
>   
> 
> 2. Add the request as the default update request
> 3. Add a "signature" indexed field to my schema.
> 
> Then,
> When adding a new doc to my index, it is only added of not considered a
> duplicate using a Lookup3Signature on the field defined? All duplicates
> are ignored and not added to my index?
> Is it so simple as that?
> 
> Does it works even if the medianame should be an exact match (not similar
> match as the headline and bodytext are)?
> 
> Thank you for your help,
> 
> ____
> Frederico Azeiteiro
> Developer
>  
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: segunda-feira, 4 de Abril de 2011 10:48
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> http://wiki.apache.org/solr/Deduplication
> 
> On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > Hi,
> > 
> > The ideia is don't index if something similar (headline+bodytext) for
> > the same exact medianame.
> > 
> > Do you mean I would need to index the doc first (maybe in a temp index)
> > and then use the MLT feature to find similar docs before adding to final
> > index?
> > 
> > Thanks,
> > Frederico
> > 
> > 
> > -Original Message-
> > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > To: solr-user@lucene.apache.org
> > Subject: Re: Using MLT feature
> > 
> > Do you want to not index if something similar? Or don't index if exact.
> > I would look into a hash code of the document if you don't want to index
> > exact.Similar though, I think has to be based off a document in the
> > index.
> > 
> > On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> > 
> >  wrote:
> > > Hi,
> > > 
> > > 
> > > 
> > > I would like to hear your opinion about the MLT feature and if it's a
> > > good solution to what I need to implement.
> > > 
> > > 
> > > 
> > > My index has fields like: headline, body and medianame.
> > > 
> > > What I need to do is, before adding a new doc, verify if a similar doc
> > > exists for this media.
> > > 
> > > 
> > > 
> > > My idea is to use the MorelikeThisHandler
> > > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> > 
> > way:
> > > For each new doc, perform a MLT search with q= medianame and
> > > stream.body=headline+bodytext.
> > > 
> > > If no similar docs are found than I can safely add the doc.
> > > 
> > > 
> > > 
> > > Is this feasible using the MLT handler? Is it a good approach? Are
> > 
> > there
> > 
> > > a better way to perform this comparison?
> > > 
> > > 
> > > 
> > > Thank you for your help.
> > > 
> > > 
> > > 
> > > Best regards,
> > > 
> > > 
> > > 
> > > Frederico Azeiteiro


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi again,
I guess I was wrong on my early post... There's no automated way to avoid the 
indexation of the duplicate doc.

I guess I have 2 options: 

1. Create a temp index with signatures and then have an app that for each new 
doc verifies if sig exists on my primary index. 
If not, add the article.

2. Before adding the doc, create a signature (using the same algorithm that 
SOLR uses) on my indexing app and then verify if signature exists before adding.

I'm way thinking the right way here? :)

Thank you,
Frederico 
 


-Original Message-
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: segunda-feira, 4 de Abril de 2011 11:59
To: solr-user@lucene.apache.org
Subject: RE: Using MLT feature

Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:


  true
  false
  signature
  headline,body,medianame
  org.apache.solr.update.processor.Lookup3Signature



  

2. Add the request as the default update request 
3. Add a "signature" indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,


Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:


  true
  false
  signature
  headline,body,medianame
  org.apache.solr.update.processor.Lookup3Signature



  

2. Add the request as the default update request 
3. Add a "signature" indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,


Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Using MLT feature

2011-04-04 Thread Markus Jelsma
http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi,

The ideia is don't index if something similar (headline+bodytext) for
the same exact medianame.

Do you mean I would need to index the doc first (maybe in a temp index)
and then use the MLT feature to find similar docs before adding to final
index?

Thanks,
Frederico


-Original Message-
From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] 
Sent: segunda-feira, 4 de Abril de 2011 10:22
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

Do you want to not index if something similar? Or don't index if exact.
I would look into a hash code of the document if you don't want to index
exact.Similar though, I think has to be based off a document in the
index.   

On Apr 4, 2011, at 5:16, Frederico Azeiteiro
 wrote:

> Hi,
> 
> 
> 
> I would like to hear your opinion about the MLT feature and if it's a
> good solution to what I need to implement.
> 
> 
> 
> My index has fields like: headline, body and medianame.
> 
> What I need to do is, before adding a new doc, verify if a similar doc
> exists for this media.
> 
> 
> 
> My idea is to use the MorelikeThisHandler
> (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
way:
> 
> 
> 
> For each new doc, perform a MLT search with q= medianame and
> stream.body=headline+bodytext.
> 
> If no similar docs are found than I can safely add the doc.
> 
> 
> 
> Is this feasible using the MLT handler? Is it a good approach? Are
there
> a better way to perform this comparison?
> 
> 
> 
> Thank you for your help.
> 
> 
> 
> Best regards,
> 
> 
> 
> Frederico Azeiteiro
> 
> 
> 


Re: Using MLT feature

2011-04-04 Thread Chris Fauerbach
Do you want to not index if something similar? Or don't index if exact.   I 
would look into a hash code of the document if you don't want to index exact.   
 Similar though, I think has to be based off a document in the index.   

On Apr 4, 2011, at 5:16, Frederico Azeiteiro  
wrote:

> Hi,
> 
> 
> 
> I would like to hear your opinion about the MLT feature and if it's a
> good solution to what I need to implement.
> 
> 
> 
> My index has fields like: headline, body and medianame.
> 
> What I need to do is, before adding a new doc, verify if a similar doc
> exists for this media.
> 
> 
> 
> My idea is to use the MorelikeThisHandler
> (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way:
> 
> 
> 
> For each new doc, perform a MLT search with q= medianame and
> stream.body=headline+bodytext.
> 
> If no similar docs are found than I can safely add the doc.
> 
> 
> 
> Is this feasible using the MLT handler? Is it a good approach? Are there
> a better way to perform this comparison?
> 
> 
> 
> Thank you for your help.
> 
> 
> 
> Best regards,
> 
> 
> 
> Frederico Azeiteiro
> 
> 
>