Re: Get distinct results in Solr

Upayavira Tue, 01 Sep 2015 06:16:13 -0700

Have you tried with a completely clean index? Are you deduping, or just
calculating the signature? Is it possible dedup is preventing your
documents from indexing (because it thinks they are dups)?


On Tue, Sep 1, 2015, at 09:46 AM, Zheng Lin Edwin Yeo wrote:
> Hi Upayavira,
> 
> I've tried to change <str name="signatureField">id</str> to be <str
> name="signatureField">signature</str>, but nothing is indexed into Solr
> as
> well. Is that what you mean?
> 
> Besides that, I've also included a copyField to copy the content field
> into
> the signature field. Both versions (with and without copyField) have
> nothing indexed into Solr.
> 
> Regards,
> Edwin
> 
> 
> On 1 September 2015 at 15:48, Upayavira <u...@odoko.co.uk> wrote:
> 
> > you are attempting to write your signature to your ID field. That's not
> > a good idea. You are generating your signature from the content field,
> > which seems okay. Change your <str name="signatureField">id</str> to be
> > your 'signature' field instead of id, and something different will
> > happen :-)
> >
> > Upayavira
> >
> > On Tue, Sep 1, 2015, at 04:34 AM, Zheng Lin Edwin Yeo wrote:
> > > I tried to follow the de-duplication guide, but after I configured it in
> > > solrconfig.xml and schema.xml, nothing is indexed into Solr, and there is
> > > no error message. I'm using SimplePostTool to index rich-text documents.
> > >
> > > Below are my configurations:
> > >
> > > In solrconfig.xml
> > >
> > >   <requestHandler name="/update" class="solr.UpdateRequestHandler">
> > >  <lst name="defaults">
> > > <str name="update.chain">dedupe</str>
> > >  </lst>
> > >   </requestHandler>
> > >
> > >     <updateRequestProcessorChain name="dedupe">
> > >  <processor class="solr.processor.SignatureUpdateProcessorFactory">
> > > <bool name="enabled">true</bool>
> > > <str name="signatureField">id</str>
> > > <bool name="overwriteDupes">false</bool>
> > > <str name="fields">content</str>
> > > <str name="signatureClass">solr.processor.Lookup3Signature</str>
> > >  </processor>
> > >     </updateRequestProcessorChain>
> > >
> > >
> > > In schema.xml
> > >
> > >  <field name="signature" type="string" stored="true" indexed="true"
> > > multiValued="false" />
> > >
> > >
> > > Is there anything which I might have missed out or done wrongly?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 1 September 2015 at 10:46, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > > wrote:
> > >
> > > > Thank you for your advice Alexandre.
> > > >
> > > > Will try out the de-duplication from the link you gave.
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > > On 1 September 2015 at 10:34, Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > > > wrote:
> > > >
> > > >> Re-read the question. You want to de-dupe on the full text-content.
> > > >>
> > > >> I would actually try to use the dedupe chain as per the link I gave
> > > >> but put results into a separate string field. Then, you group on that
> > > >> field. You cannot actually group on the long text field, that would
> > > >> kill any performance. So a signature is your proxy.
> > > >>
> > > >> Regards,
> > > >>    Alex
> > > >> ----
> > > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > >> http://www.solr-start.com/
> > > >>
> > > >>
> > > >> On 31 August 2015 at 22:26, Zheng Lin Edwin Yeo <edwinye...@gmail.com
> > >
> > > >> wrote:
> > > >> > Hi Alexandre,
> > > >> >
> > > >> > Will treating it as String affect the search or other functions like
> > > >> > highlighting?
> > > >> >
> > > >> > Yes, the content must be in my index, unless I do a copyField to do
> > > >> > de-duplication on that field.. Will that help?
> > > >> >
> > > >> > Regards,
> > > >> > Edwin
> > > >> >
> > > >> >
> > > >> > On 1 September 2015 at 10:04, Alexandre Rafalovitch <
> > arafa...@gmail.com
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> >> Can't you just treat it as String?
> > > >> >>
> > > >> >> Also, do you actually want those documents in your index in the
> > first
> > > >> >> place? If not, have you looked at De-duplication:
> > > >> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> > > >> >>
> > > >> >> Regards,
> > > >> >>    Alex.
> > > >> >> ----
> > > >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > >> >> http://www.solr-start.com/
> > > >> >>
> > > >> >>
> > > >> >> On 31 August 2015 at 22:00, Zheng Lin Edwin Yeo <
> > edwinye...@gmail.com>
> > > >> >> wrote:
> > > >> >> > Thanks Jan.
> > > >> >> >
> > > >> >> > But I read that the field that is being collapsed on must be a
> > single
> > > >> >> > valued String, Int or Float. As I'm required to get the distinct
> > > >> results
> > > >> >> > from "content" field that was indexed from a rich text document,
> > I
> > > >> got
> > > >> >> the
> > > >> >> > following error:
> > > >> >> >
> > > >> >> >   "error":{
> > > >> >> >     "msg":"java.io.IOException: 64 bit numeric collapse fields
> > are
> > > >> not
> > > >> >> > supported",
> > > >> >> >     "trace":"java.lang.RuntimeException: java.io.IOException: 64
> > bit
> > > >> >> > numeric collapse fields are not supported\r\n\tat
> > > >> >> >
> > > >> >> >
> > > >> >> > Is it possible to collapsed on fields which has a long integer of
> > > >> data,
> > > >> >> > like content from a rich text document?
> > > >> >> >
> > > >> >> > Regards,
> > > >> >> > Edwin
> > > >> >> >
> > > >> >> >
> > > >> >> > On 31 August 2015 at 18:59, Jan Høydahl <jan....@cominvent.com>
> > > >> wrote:
> > > >> >> >
> > > >> >> >> Hi
> > > >> >> >>
> > > >> >> >> Check out the CollapsingQParser (
> > > >> >> >>
> > > >> >>
> > > >>
> > https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
> > > >> >> ).
> > > >> >> >> As long as you have a field that will be the same for all
> > > >> duplicates,
> > > >> >> you
> > > >> >> >> can “collapse” on that field. If you not have a “group id”, you
> > can
> > > >> >> create
> > > >> >> >> one using e.g. an MD5 signature of the identical body text (
> > > >> >> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication
> > ).
> > > >> >> >>
> > > >> >> >> --
> > > >> >> >> Jan Høydahl, search solution architect
> > > >> >> >> Cominvent AS - www.cominvent.com
> > > >> >> >>
> > > >> >> >> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo <
> > > >> >> edwinye...@gmail.com
> > > >> >> >> >:
> > > >> >> >> >
> > > >> >> >> > Hi,
> > > >> >> >> >
> > > >> >> >> > I'm using Solr 5.2.1, and I would like to find out, what is
> > the
> > > >> best
> > > >> >> way
> > > >> >> >> to
> > > >> >> >> > get Solr to return only distinct results?
> > > >> >> >> >
> > > >> >> >> > Currently, I've indexed several exact similar documents into
> > Solr,
> > > >> >> with
> > > >> >> >> > just different id and title, but the content is exactly the
> > same.
> > > >> >> When I
> > > >> >> >> do
> > > >> >> >> > a search, Solr will return all these documents several time
> > in the
> > > >> >> list.
> > > >> >> >> >
> > > >> >> >> > What is the most suitable way to get Solr to return only one
> > of
> > > >> the
> > > >> >> >> > document during the search?
> > > >> >> >> > I understand that there is result grouping and faceting, but
> > I'm
> > > >> not
> > > >> >> sure
> > > >> >> >> > if that is the best way.
> > > >> >> >> >
> > > >> >> >> > Regards,
> > > >> >> >> > Edwin
> > > >> >> >>
> > > >> >> >>
> > > >> >>
> > > >>
> > > >
> > > >
> >

Re: Get distinct results in Solr

Reply via email to