Custom Handler support in Solr-ruby

2011-06-28 Thread Pranav Prakash
Hi,

I found solr-ruby gem (http://wiki.apache.org/solr/solr-ruby) really
inflexible in terms of specifying handler. The Solr::Request::Select class
defines handler as select and all other classes inherit from this class.
And since the methods in Solr::Connection use one of the classes from
Solr::Request, I don't see a direct way to use a custom handler (which I
have made for MoreLikeThis). Currently, the approach I am using is to create
the query URL, do a CURL, parse the response and return it.

Even if I'd to extend the classes, I'd end up making a new
Solr::Request::CustomSelect which will be similar to Solr::Request::Select
except for the flexibility for the user to provide handler, defaulted by
'select'. Then creating different classes each for DisMax and all, which
will be derived from Solr::Request::CustomSelect. Isn't this too much of an
overhead? Or am I missing something?

Also, where can I file bugs to solr-ruby?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


Include synonys in solr

2011-06-28 Thread Romi
Hi, i am using solr for my searches. in this i found a synonyms.text file in
which you can include synonyms manually for the words u want.

But as i suppose it would be very hard to include synonyms manually for each
word as my application has large data.

I want to know is there any way that this synonym.text file generate
automatically referring to all dictionary words

-
Thanks  Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Include-synonys-in-solr-tp3116836p3116836.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Include synonys in solr

2011-06-28 Thread Gora Mohanty
On Tue, Jun 28, 2011 at 12:54 PM, Romi romijain3...@gmail.com wrote:
 Hi, i am using solr for my searches. in this i found a synonyms.text file in
 which you can include synonyms manually for the words u want.

Please see 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

No offence, but a simple Google search, or a search of the Wiki
would have turned this up. Please try such simpler avenues before
dashing off a message to the list.

Regards,
Gora


Re: Include synonys in solr

2011-06-28 Thread Michael Kuhlmann
Am 28.06.2011 09:24, schrieb Romi:
 But as i suppose it would be very hard to include synonyms manually for each
 word as my application has large data.
 
 I want to know is there any way that this synonym.text file generate
 automatically referring to all dictionary words

I don't get the point here. Why should you want to add all dictionary
words to the synonyms? To what shall they translate? Just having all
words in synonyms.txt doesn't make much sense.

If you're asking about some kind of translation into another language:
In that case, you'd rather translate the text at index time and put it
into another field which you query as well.

In my last project, we had multi-valued fields like meta_description
and misspelled, where you could add arbitrary synonyms for each
document - maybe that's what you're asking for?

-Kuli


Re: Analyzer creates PhraseQuery

2011-06-28 Thread lboutros
You could add this filter after the NGram filter to prevent the phrase query
creation :

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory

Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Analyzer-creates-PhraseQuery-tp3116288p3116885.html
Sent from the Solr - User mailing list archive at Nabble.com.


Find results with or without whitespace

2011-06-28 Thread Frankie
I'm looking for a way to index/search on terms that may or may not contain
spaces.
An example will explain better :
- Loooking for healthcare, I want to find both healthcare and health
care.
- Loooking for health care, I want to find both health care and
healthcare.

My other constraints are
- I will index rather long strings (extracted from Office documents)
- I want to avoid synonym lists (as they may be incomplete)
- I want to avoid specific logic (i.e. query rewriting with as many OR as
search terms combination requires)
- I don't want to rely on uppercase/lowercase tokenizer (as users are...
creative)

I already tried many tokenizer/filter combination without success.
I did not find any answer to this problem.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Find-results-with-or-without-whitespace-tp3117144p3117144.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiple spatial values

2011-06-28 Thread marthinal

Yonik Seeley-2-2 wrote:
 
 On Sat, Jun 25, 2011 at 5:56 AM, marthinal
 lt;jm.rodriguez.ve...@gmail.comgt; wrote:
 sfield, pt and d can all be specified directly in the spatial
 functions/filters too, and that will override the global params.

 Unfortunately one must currently use lucene query syntax to do an OR.
 It just makes it look a bit messier.

 q=_query_:{!geofilt} _query:{!geofilt sfield=location_2}

 -Yonik
 http://www.lucidimagination.com


 @Yonik it seems to work like this, i triyed houndreds of other
 possibilities
 without success:

 q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq={!geofilt
 sfield=location_2 pt=40.51,-5.91 d=500}
 
 Ah, right.  I had thought you wanted docs that matched either geofilt
 (hence OR), not docs that only matched both.
 
 -Yonik
 http://www.lucidimagination.com
 

Yes Yonik what i do now is

q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq=_query_:{!geofilt
sfield=location_2 pt=40.51,-5.91 d=500} other_filter:value ..

I write here the query because maybe it *helps* to someone that need to do
something like this ... 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3117145.html
Sent from the Solr - User mailing list archive at Nabble.com.


Saravanan Chinnadurai/Actionimages is out of the office.

2011-06-28 Thread Saravanan . Chinnadurai
I will be out of the office starting  28/06/2011 and will not return until
30/06/2011.

Please email to itsta...@actionimages.com  for any urgent issues.


Action Images is a division of Reuters Limited and your data will therefore be 
protected
in accordance with the Reuters Group Privacy / Data Protection notice which is 
available
in the privacy footer at www.reuters.com
Registered in England No. 145516   VAT REG: 397000555


Index Version and Epoch Time?

2011-06-28 Thread Pranav Prakash
Hi,

I am not sure what is the index number value? It looks like an epoch time,
but in my case, this points to one month back. However, i can see documents
which were added last week, to be in the index.

Even after I did a commit, the index number did not change? Isn't it
supposed to change on every commit? If not, is there a way to look into the
last index time?

Also, this page
http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a
Replication Dashboard. How is this dashboard invoked? Is there any URL which
needs to be called?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


Re: Include synonys in solr

2011-06-28 Thread Romi
Please see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

No offence, but a simple Google search, or a search of the Wiki
would have turned this up. Please try such simpler avenues before
dashing off a message to the list.


Gora, I heve already read the document and also included synonyms in my
search results :)

My question is , when i use this *filter class=solr.SynonymFilterFactory
synonyms=syn.txt ignoreCase=true expand=false/
* i need to enter synonyms manually in synonyms.txt. which is really tough
if you have many words for synonyms. i wanted to ask is there any other
option so that i need not to enter synonyms manually.. i hope you got my
point :)
 

-
Thanks  Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117365.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Include synonys in solr

2011-06-28 Thread Romi
I don't want to add all dictionary words to my synonyms.txt, but i wanted to
include synonyms for the words which i am having in my data...as you can
imagine if i have suppose 1000 words then i would be very tough to enter
synonyms for these 1000 words in synonyms.txt manually. I just want to know
how can i solve this puzzle so that i need not to enter synonyms manually.

for example for GB i am entering gigabyte
for ring i am entering synonyms as band, circle


-
Thanks  Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117373.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Include synonys in solr

2011-06-28 Thread François Schiettecatte
Well you need to find word lists and/or a thesaurus.

This is one place to start:

http://wordlist.sourceforge.net/

I used the US/UK english word list for my synonyms for an index I have because 
it contains both US and UK english terms, the list lacks some medical terms 
though so we just added them.

Cheers

François

On Jun 28, 2011, at 6:55 AM, Romi wrote:

 Please see
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
 
 No offence, but a simple Google search, or a search of the Wiki
 would have turned this up. Please try such simpler avenues before
 dashing off a message to the list.
 
 
 Gora, I heve already read the document and also included synonyms in my
 search results :)
 
 My question is , when i use this *filter class=solr.SynonymFilterFactory
 synonyms=syn.txt ignoreCase=true expand=false/
 * i need to enter synonyms manually in synonyms.txt. which is really tough
 if you have many words for synonyms. i wanted to ask is there any other
 option so that i need not to enter synonyms manually.. i hope you got my
 point :)
 
 
 -
 Thanks  Regards
 Romi
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117365.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Find results with or without whitespace

2011-06-28 Thread roySolr
I had the same problem:

http://lucene.472066.n3.nabble.com/Results-with-and-without-whitespace-soccer-club-and-soccerclub-td2934742.html#a2964942



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Find-results-with-or-without-whitespace-tp3117144p3117386.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
I also have the problem of duplicate docs.
I am indexing news articles, Every news article will have the source URL,
If two news-article has the same URL, only one need to index,
removal of duplicate at index time.



On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:

 have you checked out the deduplication process that's available at
 indexing time ? This includes a fuzzy hash algorithm .

 http://wiki.apache.org/solr/Deduplication

 -Simon

 On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote:
  This approach would definitely work is the two documents are *Exactly*
 the
  same. But this is very fragile. Even if one extra space has been added,
 the
  whole hash would change. What I am really looking for is some %age
  similarity between documents, and remove those documents which are more
 than
  95% similar.
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 
 
  On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
  What you need to do, is to calculate some HASH (using any message digest
  algorithm you want, md5, sha-1 and so on), then do some reading on solr
  field collapse capabilities. Should not be too complicated..
 
  *Omri Cohen*
 
 
 
  Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
 +972-3-6036295
 
 
 
 
  My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
 [image:
  Twitter] http://www.twitter.com/omricohe [image:
  WordPress]http://omricohen.me
   Please consider your environmental responsibility. Before printing this
  e-mail message, ask yourself whether you really need a hard copy.
  IMPORTANT: The contents of this email and any attachments are
 confidential.
  They are intended for the named recipient(s) only. If you have received
  this
  email by mistake, please notify the sender immediately and do not
 disclose
  the contents to anyone or make copies thereof.
  Signature powered by
  
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
  
  WiseStamp
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
  
 
 
 
  -- Forwarded message --
  From: Pranav Prakash pra...@gmail.com
  Date: Thu, Jun 23, 2011 at 12:26 PM
  Subject: Removing duplicate documents from search results
  To: solr-user@lucene.apache.org
 
 
  How can I remove very similar documents from search results?
 
  My scenario is that there are documents in the index which are almost
  similar (people submitting same stuff multiple times, sometimes
 different
  people submitting same stuff). Now when a search is performed for
  keyword,
  in the top N results, quite frequently, same document comes up multiple
  times. I want to remove those duplicate (or possible duplicate)
 documents.
  Very similar to what Google does when they say In order to show you
 most
  relevant result, duplicates have been removed. How can I achieve this
  functionality using Solr? Does Solr has an implied or plugin which could
  help me with it?
 
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com
  
  |
  Google http://www.google.com/profiles/pranny
 
 




-- 
Thanks and Regards
Mohammad Shariq


Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Create a hash from the url and use that as the unique key, md5 or sha1 would 
probably be good enough.

Cheers

François

On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:

 I also have the problem of duplicate docs.
 I am indexing news articles, Every news article will have the source URL,
 If two news-article has the same URL, only one need to index,
 removal of duplicate at index time.
 
 
 
 On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
 have you checked out the deduplication process that's available at
 indexing time ? This includes a fuzzy hash algorithm .
 
 http://wiki.apache.org/solr/Deduplication
 
 -Simon
 
 On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote:
 This approach would definitely work is the two documents are *Exactly*
 the
 same. But this is very fragile. Even if one extra space has been added,
 the
 whole hash would change. What I am really looking for is some %age
 similarity between documents, and remove those documents which are more
 than
 95% similar.
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny
 
 
 On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
 What you need to do, is to calculate some HASH (using any message digest
 algorithm you want, md5, sha-1 and so on), then do some reading on solr
 field collapse capabilities. Should not be too complicated..
 
 *Omri Cohen*
 
 
 
 Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
 +972-3-6036295
 
 
 
 
 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
 [image:
 Twitter] http://www.twitter.com/omricohe [image:
 WordPress]http://omricohen.me
 Please consider your environmental responsibility. Before printing this
 e-mail message, ask yourself whether you really need a hard copy.
 IMPORTANT: The contents of this email and any attachments are
 confidential.
 They are intended for the named recipient(s) only. If you have received
 this
 email by mistake, please notify the sender immediately and do not
 disclose
 the contents to anyone or make copies thereof.
 Signature powered by
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 WiseStamp
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
 -- Forwarded message --
 From: Pranav Prakash pra...@gmail.com
 Date: Thu, Jun 23, 2011 at 12:26 PM
 Subject: Removing duplicate documents from search results
 To: solr-user@lucene.apache.org
 
 
 How can I remove very similar documents from search results?
 
 My scenario is that there are documents in the index which are almost
 similar (people submitting same stuff multiple times, sometimes
 different
 people submitting same stuff). Now when a search is performed for
 keyword,
 in the top N results, quite frequently, same document comes up multiple
 times. I want to remove those duplicate (or possible duplicate)
 documents.
 Very similar to what Google does when they say In order to show you
 most
 relevant result, duplicates have been removed. How can I achieve this
 functionality using Solr? Does Solr has an implied or plugin which could
 help me with it?
 
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com
 
 |
 Google http://www.google.com/profiles/pranny
 
 
 
 
 
 
 -- 
 Thanks and Regards
 Mohammad Shariq



Re: multiple spatial values

2011-06-28 Thread Darren Govoni
Will it be possible to do spatial searches on multi-valued spatial 
fields soon?


I have a latlon field (point) that is multi-valued and don't know how to 
search against it

such that the lats and lons match correctly - since they are split apart.

e.g. I have a document with 10 point/latlon values for the same field.

On 06/28/2011 05:15 AM, marthinal wrote:

Yonik Seeley-2-2 wrote:

On Sat, Jun 25, 2011 at 5:56 AM, marthinal
lt;jm.rodriguez.ve...@gmail.comgt; wrote:

sfield, pt and d can all be specified directly in the spatial
functions/filters too, and that will override the global params.

Unfortunately one must currently use lucene query syntax to do an OR.
It just makes it look a bit messier.

q=_query_:{!geofilt} _query:{!geofilt sfield=location_2}

-Yonik
http://www.lucidimagination.com


@Yonik it seems to work like this, i triyed houndreds of other
possibilities
without success:

q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq={!geofilt
sfield=location_2 pt=40.51,-5.91 d=500}

Ah, right.  I had thought you wanted docs that matched either geofilt
(hence OR), not docs that only matched both.

-Yonik
http://www.lucidimagination.com


Yes Yonik what i do now is

q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq=_query_:{!geofilt
sfield=location_2 pt=40.51,-5.91 d=500} other_filter:value ..

I write here the query because maybe it *helps* to someone that need to do
something like this ...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3117145.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
I am making the Hash from URL, but I can't use this as UniqueKey because I
am using UUID as UniqueKey,
Since I am using SOLR as  index engine Only and using Riak(key-value
storage) as storage engine, I dont want to do the overwrite on duplicate.
I just need to discard the duplicates.



2011/6/28 François Schiettecatte fschietteca...@gmail.com

 Create a hash from the url and use that as the unique key, md5 or sha1
 would probably be good enough.

 Cheers

 François

 On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:

  I also have the problem of duplicate docs.
  I am indexing news articles, Every news article will have the source URL,
  If two news-article has the same URL, only one need to index,
  removal of duplicate at index time.
 
 
 
  On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
  have you checked out the deduplication process that's available at
  indexing time ? This includes a fuzzy hash algorithm .
 
  http://wiki.apache.org/solr/Deduplication
 
  -Simon
 
  On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
 wrote:
  This approach would definitely work is the two documents are *Exactly*
  the
  same. But this is very fragile. Even if one extra space has been added,
  the
  whole hash would change. What I am really looking for is some %age
  similarity between documents, and remove those documents which are more
  than
  95% similar.
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
  http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 
 
  On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
  What you need to do, is to calculate some HASH (using any message
 digest
  algorithm you want, md5, sha-1 and so on), then do some reading on
 solr
  field collapse capabilities. Should not be too complicated..
 
  *Omri Cohen*
 
 
 
  Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
  +972-3-6036295
 
 
 
 
  My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
  [image:
  Twitter] http://www.twitter.com/omricohe [image:
  WordPress]http://omricohen.me
  Please consider your environmental responsibility. Before printing
 this
  e-mail message, ask yourself whether you really need a hard copy.
  IMPORTANT: The contents of this email and any attachments are
  confidential.
  They are intended for the named recipient(s) only. If you have
 received
  this
  email by mistake, please notify the sender immediately and do not
  disclose
  the contents to anyone or make copies thereof.
  Signature powered by
  
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
  WiseStamp
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
  -- Forwarded message --
  From: Pranav Prakash pra...@gmail.com
  Date: Thu, Jun 23, 2011 at 12:26 PM
  Subject: Removing duplicate documents from search results
  To: solr-user@lucene.apache.org
 
 
  How can I remove very similar documents from search results?
 
  My scenario is that there are documents in the index which are almost
  similar (people submitting same stuff multiple times, sometimes
  different
  people submitting same stuff). Now when a search is performed for
  keyword,
  in the top N results, quite frequently, same document comes up
 multiple
  times. I want to remove those duplicate (or possible duplicate)
  documents.
  Very similar to what Google does when they say In order to show you
  most
  relevant result, duplicates have been removed. How can I achieve this
  functionality using Solr? Does Solr has an implied or plugin which
 could
  help me with it?
 
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
  http://blog.myblive.com
 
  |
  Google http://www.google.com/profiles/pranny
 
 
 
 
 
 
  --
  Thanks and Regards
  Mohammad Shariq




-- 
Thanks and Regards
Mohammad Shariq


Re: Default schema - 'keywords' not multivalued

2011-06-28 Thread Tod

On 06/27/2011 11:23 AM, lee carroll wrote:

Hi Tod,
A list of keywords would be fine in a non multi valued field:

keywords : xxx yyy sss aaa 

multi value field would allow you to repeat the field when indexing

keywords: xxx
keywords: yyy
keywords: sss
etc



Thanks Lee. the problem is I'm manually pushing a document (via 
stream.url) and its metadata from a database with the Solr 
/update/extract REST service, HTTP GET, using Perl.


I'm streaming over the document content (presumably via tika) and its 
gathering the document's metadata which includes the keywords metadata 
field.  Since I'm also passing that field from the DB to the REST call 
as a list (as you suggested) there is a collision because the keywords 
field is single valued.


I can change this behavior using a copy field.  What I wanted to know is 
if there was a specific reason the default schema defined a field like 
keywords single valued so I could make sure I wasn't missing something 
before I changed things.


While I'm at it, I'd REALLY like to know how to use DIH to index the 
metadata from the database while simultaneously streaming over the 
document content and indexing it.  I've never quite figured it out yet 
but I have to believe it is a possibility.



- Tod


Re: Find results with or without whitespace

2011-06-28 Thread Frankie
Thank you for your answer.

I agree, I can manage predictable values through synonyms.

However most data in this index are company and product names, leading
sometimes to rather strange syntax (mix of upper/lower case, misplaced dash
or spaces). One purpose to using solr was to help in finding potential
duplicates before data insertion.

On another hand I could write a custom tokenizer/filter and a custom query
builder that would test many combinations. I have the feeling however it is
an inefficient approach.
That is...
Indexing : chelsea soccer club =
chelsea,soccer,club,chelseasoccer,soccerclub,chelseasoccerclub
Searching : chelsea soccerclub = chelsea and soccerclub or
chelseasoccerclub
While search expressions are generally short, indexation will be a
nightmare...


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Find-results-with-or-without-whitespace-tp3117144p3117581.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Maybe there is a way to get Solr to reject documents that already exist in the 
index but I doubt it, maybe someone else with can chime here here. You could do 
a search for each document prior to indexing it so see if it is already in the 
index, that is probably non-optimal, maybe it is easiest to check if the 
document exists in your Riak repository, it no add it and index it, and drop if 
it already exists.

François

On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:

 I am making the Hash from URL, but I can't use this as UniqueKey because I
 am using UUID as UniqueKey,
 Since I am using SOLR as  index engine Only and using Riak(key-value
 storage) as storage engine, I dont want to do the overwrite on duplicate.
 I just need to discard the duplicates.
 
 
 
 2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
 Create a hash from the url and use that as the unique key, md5 or sha1
 would probably be good enough.
 
 Cheers
 
 François
 
 On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
 
 I also have the problem of duplicate docs.
 I am indexing news articles, Every news article will have the source URL,
 If two news-article has the same URL, only one need to index,
 removal of duplicate at index time.
 
 
 
 On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
 have you checked out the deduplication process that's available at
 indexing time ? This includes a fuzzy hash algorithm .
 
 http://wiki.apache.org/solr/Deduplication
 
 -Simon
 
 On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
 wrote:
 This approach would definitely work is the two documents are *Exactly*
 the
 same. But this is very fragile. Even if one extra space has been added,
 the
 whole hash would change. What I am really looking for is some %age
 similarity between documents, and remove those documents which are more
 than
 95% similar.
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny
 
 
 On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
 What you need to do, is to calculate some HASH (using any message
 digest
 algorithm you want, md5, sha-1 and so on), then do some reading on
 solr
 field collapse capabilities. Should not be too complicated..
 
 *Omri Cohen*
 
 
 
 Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
 +972-3-6036295
 
 
 
 
 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
 [image:
 Twitter] http://www.twitter.com/omricohe [image:
 WordPress]http://omricohen.me
 Please consider your environmental responsibility. Before printing
 this
 e-mail message, ask yourself whether you really need a hard copy.
 IMPORTANT: The contents of this email and any attachments are
 confidential.
 They are intended for the named recipient(s) only. If you have
 received
 this
 email by mistake, please notify the sender immediately and do not
 disclose
 the contents to anyone or make copies thereof.
 Signature powered by
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 WiseStamp
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
 -- Forwarded message --
 From: Pranav Prakash pra...@gmail.com
 Date: Thu, Jun 23, 2011 at 12:26 PM
 Subject: Removing duplicate documents from search results
 To: solr-user@lucene.apache.org
 
 
 How can I remove very similar documents from search results?
 
 My scenario is that there are documents in the index which are almost
 similar (people submitting same stuff multiple times, sometimes
 different
 people submitting same stuff). Now when a search is performed for
 keyword,
 in the top N results, quite frequently, same document comes up
 multiple
 times. I want to remove those duplicate (or possible duplicate)
 documents.
 Very similar to what Google does when they say In order to show you
 most
 relevant result, duplicates have been removed. How can I achieve this
 functionality using Solr? Does Solr has an implied or plugin which
 could
 help me with it?
 
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com
 
 |
 Google http://www.google.com/profiles/pranny
 
 
 
 
 
 
 --
 Thanks and Regards
 Mohammad Shariq
 
 
 
 
 -- 
 Thanks and Regards
 Mohammad Shariq



Re: Removing duplicate documents from search results

2011-06-28 Thread Pranav Prakash
I found the deduplication thing really useful. Although I have not yet
started to work on it, as there are some other low hanging fruits I've to
capture. Will share my thoughts soon.


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


2011/6/28 François Schiettecatte fschietteca...@gmail.com

 Maybe there is a way to get Solr to reject documents that already exist in
 the index but I doubt it, maybe someone else with can chime here here. You
 could do a search for each document prior to indexing it so see if it is
 already in the index, that is probably non-optimal, maybe it is easiest to
 check if the document exists in your Riak repository, it no add it and index
 it, and drop if it already exists.

 François

 On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:

  I am making the Hash from URL, but I can't use this as UniqueKey because
 I
  am using UUID as UniqueKey,
  Since I am using SOLR as  index engine Only and using Riak(key-value
  storage) as storage engine, I dont want to do the overwrite on duplicate.
  I just need to discard the duplicates.
 
 
 
  2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
  Create a hash from the url and use that as the unique key, md5 or sha1
  would probably be good enough.
 
  Cheers
 
  François
 
  On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
 
  I also have the problem of duplicate docs.
  I am indexing news articles, Every news article will have the source
 URL,
  If two news-article has the same URL, only one need to index,
  removal of duplicate at index time.
 
 
 
  On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
  have you checked out the deduplication process that's available at
  indexing time ? This includes a fuzzy hash algorithm .
 
  http://wiki.apache.org/solr/Deduplication
 
  -Simon
 
  On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
  wrote:
  This approach would definitely work is the two documents are
 *Exactly*
  the
  same. But this is very fragile. Even if one extra space has been
 added,
  the
  whole hash would change. What I am really looking for is some %age
  similarity between documents, and remove those documents which are
 more
  than
  95% similar.
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
  http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 
 
  On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
  What you need to do, is to calculate some HASH (using any message
  digest
  algorithm you want, md5, sha-1 and so on), then do some reading on
  solr
  field collapse capabilities. Should not be too complicated..
 
  *Omri Cohen*
 
 
 
  Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
  +972-3-6036295
 
 
 
 
  My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
  [image:
  Twitter] http://www.twitter.com/omricohe [image:
  WordPress]http://omricohen.me
  Please consider your environmental responsibility. Before printing
  this
  e-mail message, ask yourself whether you really need a hard copy.
  IMPORTANT: The contents of this email and any attachments are
  confidential.
  They are intended for the named recipient(s) only. If you have
  received
  this
  email by mistake, please notify the sender immediately and do not
  disclose
  the contents to anyone or make copies thereof.
  Signature powered by
  
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
  WiseStamp
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
  -- Forwarded message --
  From: Pranav Prakash pra...@gmail.com
  Date: Thu, Jun 23, 2011 at 12:26 PM
  Subject: Removing duplicate documents from search results
  To: solr-user@lucene.apache.org
 
 
  How can I remove very similar documents from search results?
 
  My scenario is that there are documents in the index which are
 almost
  similar (people submitting same stuff multiple times, sometimes
  different
  people submitting same stuff). Now when a search is performed for
  keyword,
  in the top N results, quite frequently, same document comes up
  multiple
  times. I want to remove those duplicate (or possible duplicate)
  documents.
  Very similar to what Google does when they say In order to show you
  most
  relevant result, duplicates have been removed. How can I achieve
 this
  functionality using Solr? Does Solr has an implied or plugin which
  could
  help me with it?
 
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
  http://blog.myblive.com
 
  |
  Google http://www.google.com/profiles/pranny
 
 
 
 
 
 
  --
  Thanks and Regards
  Mohammad Shariq
 
 
 
 
  --
  Thanks and Regards
  Mohammad Shariq




Re: Analyzer creates PhraseQuery

2011-06-28 Thread Koji Sekiguchi

(11/06/28 16:40), lboutros wrote:

You could add this filter after the NGram filter to prevent the phrase query
creation :

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory

Ludovic.


There is an option to avoid producing phrase queries, 
autoGeneratePhraseQueries=false.

koji
--
http://www.rondhuit.com/en/


Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Indeed, take a look at this:

http://wiki.apache.org/solr/Deduplication

I have not used it but it looks like it will do the trick.

François

On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:

 I found the deduplication thing really useful. Although I have not yet
 started to work on it, as there are some other low hanging fruits I've to
 capture. Will share my thoughts soon.
 
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny
 
 
 2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
 Maybe there is a way to get Solr to reject documents that already exist in
 the index but I doubt it, maybe someone else with can chime here here. You
 could do a search for each document prior to indexing it so see if it is
 already in the index, that is probably non-optimal, maybe it is easiest to
 check if the document exists in your Riak repository, it no add it and index
 it, and drop if it already exists.
 
 François
 
 On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
 
 I am making the Hash from URL, but I can't use this as UniqueKey because
 I
 am using UUID as UniqueKey,
 Since I am using SOLR as  index engine Only and using Riak(key-value
 storage) as storage engine, I dont want to do the overwrite on duplicate.
 I just need to discard the duplicates.
 
 
 
 2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
 Create a hash from the url and use that as the unique key, md5 or sha1
 would probably be good enough.
 
 Cheers
 
 François
 
 On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
 
 I also have the problem of duplicate docs.
 I am indexing news articles, Every news article will have the source
 URL,
 If two news-article has the same URL, only one need to index,
 removal of duplicate at index time.
 
 
 
 On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
 have you checked out the deduplication process that's available at
 indexing time ? This includes a fuzzy hash algorithm .
 
 http://wiki.apache.org/solr/Deduplication
 
 -Simon
 
 On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
 wrote:
 This approach would definitely work is the two documents are
 *Exactly*
 the
 same. But this is very fragile. Even if one extra space has been
 added,
 the
 whole hash would change. What I am really looking for is some %age
 similarity between documents, and remove those documents which are
 more
 than
 95% similar.
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny
 
 
 On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
 What you need to do, is to calculate some HASH (using any message
 digest
 algorithm you want, md5, sha-1 and so on), then do some reading on
 solr
 field collapse capabilities. Should not be too complicated..
 
 *Omri Cohen*
 
 
 
 Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
 +972-3-6036295
 
 
 
 
 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
 [image:
 Twitter] http://www.twitter.com/omricohe [image:
 WordPress]http://omricohen.me
 Please consider your environmental responsibility. Before printing
 this
 e-mail message, ask yourself whether you really need a hard copy.
 IMPORTANT: The contents of this email and any attachments are
 confidential.
 They are intended for the named recipient(s) only. If you have
 received
 this
 email by mistake, please notify the sender immediately and do not
 disclose
 the contents to anyone or make copies thereof.
 Signature powered by
 
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 WiseStamp
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
 -- Forwarded message --
 From: Pranav Prakash pra...@gmail.com
 Date: Thu, Jun 23, 2011 at 12:26 PM
 Subject: Removing duplicate documents from search results
 To: solr-user@lucene.apache.org
 
 
 How can I remove very similar documents from search results?
 
 My scenario is that there are documents in the index which are
 almost
 similar (people submitting same stuff multiple times, sometimes
 different
 people submitting same stuff). Now when a search is performed for
 keyword,
 in the top N results, quite frequently, same document comes up
 multiple
 times. I want to remove those duplicate (or possible duplicate)
 documents.
 Very similar to what Google does when they say In order to show you
 most
 relevant result, duplicates have been removed. How can I achieve
 this
 functionality using Solr? Does Solr has an implied or plugin which
 could
 help me with it?
 
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com
 
 |
 Google http://www.google.com/profiles/pranny
 
 
 
 
 
 
 --
 Thanks and Regards
 Mohammad Shariq
 
 
 

Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
Hey François,
thanks for your suggestion, I followed the same link (
http://wiki.apache.org/solr/Deduplication)

they have the solution*, either make Hash as uniqueKey OR overwrite on
duplicate,
I dont need either.

I need Discard on Duplicate.
*



 I have not used it but it looks like it will do the trick.

 François

 On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:

  I found the deduplication thing really useful. Although I have not yet
  started to work on it, as there are some other low hanging fruits I've to
  capture. Will share my thoughts soon.
 
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 
 
  2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
  Maybe there is a way to get Solr to reject documents that already exist
 in
  the index but I doubt it, maybe someone else with can chime here here.
 You
  could do a search for each document prior to indexing it so see if it is
  already in the index, that is probably non-optimal, maybe it is easiest
 to
  check if the document exists in your Riak repository, it no add it and
 index
  it, and drop if it already exists.
 
  François
 
  On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
 
  I am making the Hash from URL, but I can't use this as UniqueKey
 because
  I
  am using UUID as UniqueKey,
  Since I am using SOLR as  index engine Only and using Riak(key-value
  storage) as storage engine, I dont want to do the overwrite on
 duplicate.
  I just need to discard the duplicates.
 
 
 
  2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
  Create a hash from the url and use that as the unique key, md5 or sha1
  would probably be good enough.
 
  Cheers
 
  François
 
  On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
 
  I also have the problem of duplicate docs.
  I am indexing news articles, Every news article will have the source
  URL,
  If two news-article has the same URL, only one need to index,
  removal of duplicate at index time.
 
 
 
  On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
  have you checked out the deduplication process that's available at
  indexing time ? This includes a fuzzy hash algorithm .
 
  http://wiki.apache.org/solr/Deduplication
 
  -Simon
 
  On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
  wrote:
  This approach would definitely work is the two documents are
  *Exactly*
  the
  same. But this is very fragile. Even if one extra space has been
  added,
  the
  whole hash would change. What I am really looking for is some %age
  similarity between documents, and remove those documents which are
  more
  than
  95% similar.
 
  *Pranav Prakash*
 
  temet nosce
 
  Twitter http://twitter.com/pranavprakash | Blog 
  http://blog.myblive.com |
  Google http://www.google.com/profiles/pranny
 
 
  On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
  What you need to do, is to calculate some HASH (using any message
  digest
  algorithm you want, md5, sha-1 and so on), then do some reading on
  solr
  field collapse capabilities. Should not be too complicated..
 
  *Omri Cohen*
 
 
 
  Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
  +972-3-6036295
 
 
 
 
  My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
  [image:
  Twitter] http://www.twitter.com/omricohe [image:
  WordPress]http://omricohen.me
  Please consider your environmental responsibility. Before printing
  this
  e-mail message, ask yourself whether you really need a hard copy.
  IMPORTANT: The contents of this email and any attachments are
  confidential.
  They are intended for the named recipient(s) only. If you have
  received
  this
  email by mistake, please notify the sender immediately and do not
  disclose
  the contents to anyone or make copies thereof.
  Signature powered by
  
 
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
  WiseStamp
 
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
  -- Forwarded message --
  From: Pranav Prakash pra...@gmail.com
  Date: Thu, Jun 23, 2011 at 12:26 PM
  Subject: Removing duplicate documents from search results
  To: solr-user@lucene.apache.org
 
 
  How can I remove very similar documents from search results?
 
  My scenario is that there are documents in the index which are
  almost
  similar (people submitting same stuff multiple times, sometimes
  different
  people submitting same stuff). Now when a search is performed for
  keyword,
  in the top N results, quite frequently, same document comes up
  multiple
  times. I want to remove those duplicate (or possible duplicate)
  documents.
  Very similar to what Google does when they say In order to show
 you
  most
  relevant result, duplicates have been removed. How can I achieve
  this
  functionality using Solr? Does 

Re: Include synonys in solr

2011-06-28 Thread Romi
Thanks François Schiettecatte, information you provided is very helpful.
i need to know one more thing, i downloaded one of the given dictionary but
it contains many files, do i need to add all this files data in to
synonyms.text ??

-
Thanks  Regards
Romi
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117733.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Removing duplicate documents from search results

2011-06-28 Thread Paul Libbrecht
Mohammad,

just in case you meant it, I would like to discourage you to try to deduplicate 
*the search result*.
There are many things that go wrong if you do that; we had it in one version of 
the ActiveMath search environment (which uses Lucene):
- paging is inappropriate
- total count is wrong unless you go through all the results
- performance can go really bad if you try to go through all the results
- performance does go bad for some search results if you try to fill the page 
(need to fetch till you find)
- you to go through all search results again and again when delivering the next 
ones

So, as others have suggested, please be sure to deduplicate somehow at indexing 
time.

paul

Le 28 juin 2011 à 14:24, Mohammad Shariq a écrit :

 I am making the Hash from URL, but I can't use this as UniqueKey because I
 am using UUID as UniqueKey,
 Since I am using SOLR as  index engine Only and using Riak(key-value
 storage) as storage engine, I dont want to do the overwrite on duplicate.
 I just need to discard the duplicates.
 
 
 
 2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
 Create a hash from the url and use that as the unique key, md5 or sha1
 would probably be good enough.
 
 Cheers
 
 François
 
 On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
 
 I also have the problem of duplicate docs.
 I am indexing news articles, Every news article will have the source URL,
 If two news-article has the same URL, only one need to index,
 removal of duplicate at index time.
 
 
 
 On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
 have you checked out the deduplication process that's available at
 indexing time ? This includes a fuzzy hash algorithm .
 
 http://wiki.apache.org/solr/Deduplication
 
 -Simon
 
 On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
 wrote:
 This approach would definitely work is the two documents are *Exactly*
 the
 same. But this is very fragile. Even if one extra space has been added,
 the
 whole hash would change. What I am really looking for is some %age
 similarity between documents, and remove those documents which are more
 than
 95% similar.
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny
 
 
 On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
 What you need to do, is to calculate some HASH (using any message
 digest
 algorithm you want, md5, sha-1 and so on), then do some reading on
 solr
 field collapse capabilities. Should not be too complicated..
 
 *Omri Cohen*
 
 
 
 Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
 +972-3-6036295
 
 
 
 
 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
 [image:
 Twitter] http://www.twitter.com/omricohe [image:
 WordPress]http://omricohen.me
 Please consider your environmental responsibility. Before printing
 this
 e-mail message, ask yourself whether you really need a hard copy.
 IMPORTANT: The contents of this email and any attachments are
 confidential.
 They are intended for the named recipient(s) only. If you have
 received
 this
 email by mistake, please notify the sender immediately and do not
 disclose
 the contents to anyone or make copies thereof.
 Signature powered by
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 WiseStamp
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
 -- Forwarded message --
 From: Pranav Prakash pra...@gmail.com
 Date: Thu, Jun 23, 2011 at 12:26 PM
 Subject: Removing duplicate documents from search results
 To: solr-user@lucene.apache.org
 
 
 How can I remove very similar documents from search results?
 
 My scenario is that there are documents in the index which are almost
 similar (people submitting same stuff multiple times, sometimes
 different
 people submitting same stuff). Now when a search is performed for
 keyword,
 in the top N results, quite frequently, same document comes up
 multiple
 times. I want to remove those duplicate (or possible duplicate)
 documents.
 Very similar to what Google does when they say In order to show you
 most
 relevant result, duplicates have been removed. How can I achieve this
 functionality using Solr? Does Solr has an implied or plugin which
 could
 help me with it?
 
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com
 
 |
 Google http://www.google.com/profiles/pranny
 
 
 
 
 
 
 --
 Thanks and Regards
 Mohammad Shariq
 
 
 
 
 -- 
 Thanks and Regards
 Mohammad Shariq



Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Yeah, I read the overview which suggests that duplicates can be prevented from 
entering the index and scanned the rest, it does not look like you can actually 
drop the document entirely. Maybe I am missing something here.

François

On Jun 28, 2011, at 9:14 AM, Mohammad Shariq wrote:

 Hey François,
 thanks for your suggestion, I followed the same link (
 http://wiki.apache.org/solr/Deduplication)
 
 they have the solution*, either make Hash as uniqueKey OR overwrite on
 duplicate,
 I dont need either.
 
 I need Discard on Duplicate.
 *
 
 
 
 I have not used it but it looks like it will do the trick.
 
 François
 
 On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:
 
 I found the deduplication thing really useful. Although I have not yet
 started to work on it, as there are some other low hanging fruits I've to
 capture. Will share my thoughts soon.
 
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny
 
 
 2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
 Maybe there is a way to get Solr to reject documents that already exist
 in
 the index but I doubt it, maybe someone else with can chime here here.
 You
 could do a search for each document prior to indexing it so see if it is
 already in the index, that is probably non-optimal, maybe it is easiest
 to
 check if the document exists in your Riak repository, it no add it and
 index
 it, and drop if it already exists.
 
 François
 
 On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
 
 I am making the Hash from URL, but I can't use this as UniqueKey
 because
 I
 am using UUID as UniqueKey,
 Since I am using SOLR as  index engine Only and using Riak(key-value
 storage) as storage engine, I dont want to do the overwrite on
 duplicate.
 I just need to discard the duplicates.
 
 
 
 2011/6/28 François Schiettecatte fschietteca...@gmail.com
 
 Create a hash from the url and use that as the unique key, md5 or sha1
 would probably be good enough.
 
 Cheers
 
 François
 
 On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
 
 I also have the problem of duplicate docs.
 I am indexing news articles, Every news article will have the source
 URL,
 If two news-article has the same URL, only one need to index,
 removal of duplicate at index time.
 
 
 
 On 23 June 2011 21:24, simon mtnes...@gmail.com wrote:
 
 have you checked out the deduplication process that's available at
 indexing time ? This includes a fuzzy hash algorithm .
 
 http://wiki.apache.org/solr/Deduplication
 
 -Simon
 
 On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com
 wrote:
 This approach would definitely work is the two documents are
 *Exactly*
 the
 same. But this is very fragile. Even if one extra space has been
 added,
 the
 whole hash would change. What I am really looking for is some %age
 similarity between documents, and remove those documents which are
 more
 than
 95% similar.
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitter http://twitter.com/pranavprakash | Blog 
 http://blog.myblive.com |
 Google http://www.google.com/profiles/pranny
 
 
 On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote:
 
 What you need to do, is to calculate some HASH (using any message
 digest
 algorithm you want, md5, sha-1 and so on), then do some reading on
 solr
 field collapse capabilities. Should not be too complicated..
 
 *Omri Cohen*
 
 
 
 Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
 +972-3-6036295
 
 
 
 
 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric
 [image:
 Twitter] http://www.twitter.com/omricohe [image:
 WordPress]http://omricohen.me
 Please consider your environmental responsibility. Before printing
 this
 e-mail message, ask yourself whether you really need a hard copy.
 IMPORTANT: The contents of this email and any attachments are
 confidential.
 They are intended for the named recipient(s) only. If you have
 received
 this
 email by mistake, please notify the sender immediately and do not
 disclose
 the contents to anyone or make copies thereof.
 Signature powered by
 
 
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 WiseStamp
 
 
 
 
 http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer
 
 
 
 
 -- Forwarded message --
 From: Pranav Prakash pra...@gmail.com
 Date: Thu, Jun 23, 2011 at 12:26 PM
 Subject: Removing duplicate documents from search results
 To: solr-user@lucene.apache.org
 
 
 How can I remove very similar documents from search results?
 
 My scenario is that there are documents in the index which are
 almost
 similar (people submitting same stuff multiple times, sometimes
 different
 people submitting same stuff). Now when a search is performed for
 keyword,
 in the top N results, quite frequently, same document comes up
 multiple
 times. I want to remove those duplicate (or possible 

Re: Include synonys in solr

2011-06-28 Thread François Schiettecatte
Well no, you need to see which files (if any) will suit your needs, they are 
not all synonyms files, I only needed the UK/US english file and I needed to 
process it into a format suitable for the synonyms file.

There may well be other word lists on the net suitable for your needs. I would 
not recommend the use of synonyms unless you have a specific need for them. I 
needed them because we have documents which mix UK/US english, and we need to 
be able to search on medical terms e.g. hemoglobin/haemoglobin and get the same 
results.

Cheers 

François

On Jun 28, 2011, at 9:21 AM, Romi wrote:

 Thanks François Schiettecatte, information you provided is very helpful.
 i need to know one more thing, i downloaded one of the given dictionary but
 it contains many files, do i need to add all this files data in to
 synonyms.text ??
 
 -
 Thanks  Regards
 Romi
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117733.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: multiple spatial values

2011-06-28 Thread Smiley, David W.
It is precisely this limitation which triggered me to develop a grid indexing 
approach using Geohashes: https://issues.apache.org/jira/browse/SOLR-2155
This patch requires a Solr trunk release.

If you have a small number of distinct points in total, and you only need 
filtering, then the geohash field in Solr 3.1 may be fast enough for you.

~ David Smiley

On Jun 28, 2011, at 7:53 AM, Darren Govoni wrote:

 Will it be possible to do spatial searches on multi-valued spatial 
 fields soon?
 
 I have a latlon field (point) that is multi-valued and don't know how to 
 search against it
 such that the lats and lons match correctly - since they are split apart.
 
 e.g. I have a document with 10 point/latlon values for the same field.
 
 On 06/28/2011 05:15 AM, marthinal wrote:
 Yonik Seeley-2-2 wrote:
 On Sat, Jun 25, 2011 at 5:56 AM, marthinal
 lt;jm.rodriguez.ve...@gmail.comgt; wrote:
 sfield, pt and d can all be specified directly in the spatial
 functions/filters too, and that will override the global params.
 
 Unfortunately one must currently use lucene query syntax to do an OR.
 It just makes it look a bit messier.
 
 q=_query_:{!geofilt} _query:{!geofilt sfield=location_2}
 
 -Yonik
 http://www.lucidimagination.com
 
 @Yonik it seems to work like this, i triyed houndreds of other
 possibilities
 without success:
 
 q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq={!geofilt
 sfield=location_2 pt=40.51,-5.91 d=500}
 Ah, right.  I had thought you wanted docs that matched either geofilt
 (hence OR), not docs that only matched both.
 
 -Yonik
 http://www.lucidimagination.com
 
 Yes Yonik what i do now is
 
 q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq=_query_:{!geofilt
 sfield=location_2 pt=40.51,-5.91 d=500} other_filter:value ..
 
 I write here the query because maybe it *helps* to someone that need to do
 something like this ...
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3117145.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: Index Version and Epoch Time?

2011-06-28 Thread Shalin Shekhar Mangar
On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash pra...@gmail.com wrote:


 I am not sure what is the index number value? It looks like an epoch time,
 but in my case, this points to one month back. However, i can see documents
 which were added last week, to be in the index.


The index version shown on the dashboard is the time at which the most
recent index segment was created. I'm not sure why it has a value older than
a month if a commit has happened after that time.


 Even after I did a commit, the index number did not change? Isn't it
 supposed to change on every commit? If not, is there a way to look into the
 last index time?


Yeah, it changes after every commit which added/deleted a document.


 Also, this page
 http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a
 Replication Dashboard. How is this dashboard invoked? Is there any URL
 which
 needs to be called?


If you have configured replication correctly, the admin dashboard should
show a Replication link right next to the Schema Browser link. The path
should be /admin/replication/index.jsp

-- 
Regards,
Shalin Shekhar Mangar.


Using FieldCache in SolrIndexSearcher - crazy idea?

2011-06-28 Thread Michael Ryan
I am a user of Solr 3.2 and I make use of the distributed search capabilities 
of Solr using a fairly simple architecture of a coordinator + some shards.

Correct me if I am wrong:  In a standard distributed search with 
QueryComponent, the first query sent to the shards asks for fl=myUniqueKey or 
fl=myUniqueKey,score.  When the response is being generated to send back to the 
coordinator, SolrIndexSearcher.doc (int i, SetString fields) is called for 
each document.  As I understand it, this will read each document from the index 
_on disk_ and retrieve the myUniqueKey field value for each document.

My idea is to have a FieldCache for the myUniqueKey field in SolrIndexSearcher 
(or somewhere else?) that would be used in cases where the only field that 
needs to be retrieved is myUniqueKey.  Is this something that would improve 
performance?

In our actual setup, we are using an extended version of QueryComponent that 
queries for a couple other fields besides myUniqueKey in the initial query to 
the shards, and it asks a lot of rows when doing so, many more than what the 
user ends up getting back when they see the results.  (The reasons for this are 
complicated and aren't related much to this question.)  We already maintain 
FieldCaches for the fields that we are asking for, but for other purposes.  
Would it make sense to utilize these FieldCaches in SolrIndexSearcher?  Is this 
something that anyone else has done before?

-Michael


Records disappearing

2011-06-28 Thread Brian Lamb
Hi all,

I'm having some weird behavior with my dataimport script. Because of memory
issues, I've taken to doing a delta import as doing a fullimport with
clean=false. My dataimport config file is set up like:

entity name=findDelta query=SELECT id FROM mytable WHERE date_added gt;
'${dataimporter.last_index_time}' OR last_updated gt;
'${dataimporter.last_index_time}' rootEntity=false
  entity name=mytable
 pk=id
 query=SELECT * FROM mytable WHERE id = '${findDelta.id}'
 deletedPkQuery=SELECT id FROM my_delete_table
 deltaImportQuery=SELECT id FROM mytable WHERE id='${
dataimporter.delta.id}'
 deltaQuery=SELECT id FROM mytable WHERE date_added gt;
'${dataimporter.last_index_time}' OR last_updated gt;
'${dataimporter.last_index_time}'
field column=id name=id /
field column=title name=title /
field column=name name=name /
field column=summary name=summary /
  /entity
/entity

I've found that one (possible more that I haven't noticed) keeps
disappearing from the index. I will do a fullimportclean=false and search
and the record will be there. I'll search again a few hours later and its
there. But then all of a sudden, its gone. I don't know what is triggering
that one record's disappearance but it is quite annoying. Any ideas what's
going on?

Thanks,

Brian Lamb


Re: Default schema - 'keywords' not multivalued

2011-06-28 Thread Chris Hostetter

: I'm streaming over the document content (presumably via tika) and its
: gathering the document's metadata which includes the keywords metadata field.
: Since I'm also passing that field from the DB to the REST call as a list (as
: you suggested) there is a collision because the keywords field is single
: valued.
: 
: I can change this behavior using a copy field.  What I wanted to know is if
: there was a specific reason the default schema defined a field like keywords
: single valued so I could make sure I wasn't missing something before I changed
: things.

That file is just an example, you're absolutely free to change it to meet 
your use case.

I'm not very familiar with Tika, but based on the comment in the example 
config...

   !-- Common metadata fields, named specifically to match up with
 SolrCell metadata when parsing rich documents such as Word, PDF.
 Some fields are multiValued only because Tika currently may return
 multiple values for them.
   --

...i suspect it was intentional that that field is *not* multiValued (i 
guess Tika always returns a single delimited value?) but if you have 
multiple descrete values you want to send for your DB backed data there is 
no downside to changing that.

: While I'm at it, I'd REALLY like to know how to use DIH to index the metadata
: from the database while simultaneously streaming over the document content and
: indexing it.  I've never quite figured it out yet but I have to believe it is
: a possibility.

There's a TikaEntityProcessor that can be used to have Tika crunch the 
data that comes from an entity and extract out specific fields, and it 
can be used in combination with a JdbcDataSource and a BinFileDataSource 
so that a field in your db data specifies the name of a file on disk to 
use as the TikaEntity -- but i've personally never tried it

Here's a simple example someone posted last year that they got working...

http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html



-Hoss


Does Smart Chinese filter work for Traditional Chinese?

2011-06-28 Thread Andy
Hi,

According to the doc:

http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean

solr.SmartChineseWordTokenFilterFactory is for Simplified Chinese.

Does it work for Traditional Chinese too? If not, is there anything equivalent 
for Traditional Chinese?

Thanks.


Re: Analyzer creates PhraseQuery

2011-06-28 Thread entdeveloper
Thanks guys. Both the PositionFilterFactory and the
autoGeneratePhraseQueries=false solutions solved the issue.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Analyzer-creates-PhraseQuery-tp3116288p3118471.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index Version and Epoch Time?

2011-06-28 Thread Pranav Prakash
Hi,

I am facing multiple issues with solr and I am not sure what happens in each
case. I am quite naive in Solr and there are some scenarios I'd like to
discuss with you.

We have a huge volume of documents to be indexed. Somewhere about 5 million.
We have a full indexer script which essentially picks up all the documents
from database and updates into Solr and an incremental script which adds new
documents to Solr.. Relevant areas of my config file goes like

unlockOnStartupfalse/unlockOnStartup
deletionPolicy class=solr.SolrDeletionPolicy
!-- Keep only optimized commit points --
str name=keepOptimizedOnlyfalse/str
!-- The maximum number of commit points to be kept --
str name=maxCommitsToKeep1/str
/deletionPolicy
updateHandler class=solr.DirectUpdateHandler2
autoCommit
maxDocs10/maxDocs
/autoCommit
/updateHandler
requestHandler name=/replication class=solr.ReplicationHandler
lst name=master
str name=enable${enable.master:false}/str
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
/lst
lst name=slave
str name=enable${enable.slave:false}/str
str name=masterUrlhttp://hostname:port/solr/core0/replication/str
/lst
/requestHandler

Sometimes, while the full indexer script breaks while adding documents to
Solr. The script adds the documents and then commits the operation. So, when
the script breaks, we have a huge lot of data which has been updated but not
committed. Next, the incremental index script executes, and figures out all
the new entries, adds them to Solr. It works successfully and commits the
operation.

   - Will the commit by incremental indexer script also commit the
   previously uncommitted changes made by full indexer script before it broke?

Sometimes, while during execution, Solr's avg response time 9avg resp time
for last 10 requests, read from log file) goes as high as 9000ms (which I am
still unclear why, any ideas how to start hunting for the problem?), so the
watchdog process restarts Solr (because it causes a pile of requests queue
at application server, which causes app server to crash). On my local
environment, I performed the same experiment by adding docs to Solr, killing
the process and restarting it. I found that the uncommitted changes were
applied and searchable. However, the updates were uncommitted. Could you
explain me as to how is this happening, or is there a configuration that can
be adjusted for this? Also, what would the index state be if after the
restarting Solr, a commit is applied or a commit is not applied?

I'd be happy to provide any other information that might be needed.

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Tue, Jun 28, 2011 at 20:55, Shalin Shekhar Mangar shalinman...@gmail.com
 wrote:

 On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash pra...@gmail.com wrote:

 
  I am not sure what is the index number value? It looks like an epoch
 time,
  but in my case, this points to one month back. However, i can see
 documents
  which were added last week, to be in the index.
 

 The index version shown on the dashboard is the time at which the most
 recent index segment was created. I'm not sure why it has a value older
 than
 a month if a commit has happened after that time.

 
  Even after I did a commit, the index number did not change? Isn't it
  supposed to change on every commit? If not, is there a way to look into
 the
  last index time?
 

 Yeah, it changes after every commit which added/deleted a document.


  Also, this page
  http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows
 a
  Replication Dashboard. How is this dashboard invoked? Is there any URL
  which
  needs to be called?
 
 
 If you have configured replication correctly, the admin dashboard should
 show a Replication link right next to the Schema Browser link. The path
 should be /admin/replication/index.jsp

 --
 Regards,
 Shalin Shekhar Mangar.



Re: Custom Query Processing

2011-06-28 Thread Dmitry Kan
You should modify the SolrCore for this, if I'm not mistaken.

Would extending LuceneQParserPlugin (solr 1.4) be an option for you?

On Tue, Jun 28, 2011 at 12:25 AM, Jamie Johnson jej2...@gmail.com wrote:

 I have a need to take an incoming solr query and apply some additional
 constraints to it on the Solr end.  Our previous implementation used a
 QueryWrapperFilter along with some custom code to build a new Filter from
 the query provided.  How can we plug this filter into Solr?




-- 
Regards,

Dmitry Kan


Re: Unique document count from index?

2011-06-28 Thread Dmitry Kan
can you use facet search?

facet=truefacet.field=order_nofq=order_no:(1234 OR 5678 OR
...)fq=artist:Pink Floyd



On Mon, Jun 27, 2011 at 6:44 PM, Olson, Ron rol...@lbpc.com wrote:

 Hi all-

 I have a problem that I'm not sure how it can be (if it can be) solved in
 Solr. I am using Solr 3.2 with patch 2524 installed to provide grouping. I
 need to return the count of unique records that match a particular query.

 For an example of what I'm talking about, imagine I have an index of music
 CD orders, created from a SQL database using the DataImportHandler. It's
 possible that the person ordered multiple records by the same artist (e.g.
 order #1234 contains Pink Floyd Wish You Were, Pink Floyd Meddle, Pink
 Floyd Obscured by Clouds). One of the fields indexed and stored fields in
 the document is Artist. If I do a search for Pink Floyd, using the order
 above, I'd get three documents, all with the same order number, for each of
 the Pink Floyd records. What I'd like to find out is how many unique orders
 have Pink Floyd across the entire index. The index has millions of
 documents.

 I have been trying to see if the result grouping functionality provided by
 patch 2524 will help, but while it does collapse the query above into one
 document, the matches field is still the same as without the grouping (which
 I guess makes sense insofar as it is still reporting the number of documents
 it found for the query). I have also thought a subquery in my
 DataImportHandler might work, though I'm not sure how I'd structure it.

 Thanks for any guidance on how to solve this problem; I know Solr isn't
 meant to be a data-mining tool and I'm guessing I'm skating perilously close
 to using it for that purpose, but anything I can do to take load from the
 actual database is considered a Good Thing by all concerned.

 Ron

 DISCLAIMER: This electronic message, including any attachments, files or
 documents, is intended only for the addressee and may contain CONFIDENTIAL,
 PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended
 recipient, you are hereby notified that any use, disclosure, copying or
 distribution of this message or any of the information included in or with
 it is  unauthorized and strictly prohibited.  If you have received this
 message in error, please notify the sender immediately by reply e-mail and
 permanently delete and destroy this message and its attachments, along with
 any copies thereof. This message does not create any contractual obligation
 on behalf of the sender or Law Bulletin Publishing Company.
 Thank you.




-- 
Regards,

Dmitry Kan


Re: Index Version and Epoch Time?

2011-06-28 Thread Jonathan Rochkind

On 6/28/2011 1:38 PM, Pranav Prakash wrote:

- Will the commit by incremental indexer script also commit the
previously uncommitted changes made by full indexer script before it broke?


Yes, as long as the Solr instance hasn't crashed.  Anything added but 
not yet committed sticks around and will be committed on next 'commit'. 
There are no 'transactions' for adding docs in Solr, even if multiple 
processes are adding, if anyone of them issues a 'commit' they'll all be 
committed.



Sometimes, while during execution, Solr's avg response time 9avg resp time
for last 10 requests, read from log file) goes as high as 9000ms (which I am
still unclear why, any ideas how to start hunting for the problem?),


It could be a Java garbage collection issue. I have found it useful to 
start the JVM with Solr in it using some parameters to tune garbage 
collection. I use these JVM options:
 -server -XX:+AggressiveOpts -d64 -XX:+UseConcMarkSweepGC 
-XX:+UseCompressedOops


You've still got to make sure Solr has enough memory for what you're 
doing with it, with with your 5 million doc index might be more than you 
expect. On the other hand, giving a JVM too _much_ heap can cause 
slowdowns too, although I think the -XX:+UseConcMarkSweepGC should 
amelioerate that to some extent.


Possibly more likely, it could instead be Solr readying the new indexes. 
Do you issue commits in the middle of 'execution', and could the 
slowdown happen right after a commit?  When a commit is issued to Solr, 
Solr's got to switch new indexes in with the newly added documents, and 
'warm' those indexes in various ways. Which can be a CPU (as well as 
RAM) intensive thing. (For these purposes a replication from master 
counts as a commit (because it is), and an optimize can count too 
(becaue it's close enough)).


This can be especially a problem if you issue multiple commits very 
close together -- Solr's still working away at readying the index from 
the first commit, when the second comes in, and now Solr's trying to get 
ready two indexes at once (one of which will never be used because its' 
already outdated).  Or even more than two if you issue a bunch of 
commits in rapid succession.






  I found that the uncommitted changes were
applied and searchable. However, the updates were uncommitted.


There is in general no way that uncomitted adds could be searchable, 
that's probably not happening.   What is probably happening instead is 
that a commit _is_ happening.  One way a commit can happen even if you 
aren't manually issuing one is with various auto-commit settings in 
solrconfig.xml.  Commit any pending adds after X documents, or after T 
seconds, can both be configured. If they are configured, that could be 
causing commits to happen when you don't realize it, which could also 
trigger the slowdown due to a commit mentioned in the previous paragraph.


Jonathan



moving to multicore without changing existing index

2011-06-28 Thread lee carroll
hi
I'm looking at setting up multi core indices but also have an exiting
index. Can I run
this index along side new index set up as cores. On a dev  machine
I've experimented with
simply adding solr.xml in slor home and listing the new cores in the
cores element but this breaks the existing
index.

container is tomcat and attempted set up was:

solrHome
 conf (existing running index)
 core1 (new core directory)
 solr.xml (cores element has one entry for core1)

Is this a valid approach ?

thanks lee


Re: moving to multicore without changing existing index

2011-06-28 Thread Jonathan Rochkind
Nope. But you can move your existing index into a core in a multi-core 
setup.  But a multi-core setup is a multi-core setup, there's no way to 
have an index accessible at a non-core URL in a multi-core setup.


On 6/28/2011 2:53 PM, lee carroll wrote:

hi
I'm looking at setting up multi core indices but also have an exiting
index. Can I run
this index along side new index set up as cores. On a dev  machine
I've experimented with
simply adding solr.xml in slor home and listing the new cores in the
cores element but this breaks the existing
index.

container is tomcat and attempted set up was:

solrHome
 conf (existing running index)
 core1 (new core directory)
 solr.xml (cores element has one entry for core1)

Is this a valid approach ?

thanks lee



Dynamic Fields vs. Multicore

2011-06-28 Thread Briggs Thompson
Hi All,

I was searching around for documentation of the performance differences of
having a sharded, single schema, dynamic field set up vs. a multi-core,
static multi-schema setup (which I currently have), but I have not had much
luck finding what I am looking for. I understand commits and optimizes will
be more intensive in a single core since there is more data (though I would
offset by sharding heavily), but I am particularly curious about the search
performance implications.

I am interested in moving to the dynamic field setup in order to implement a
better global search, but I want to make sure I understood the drawbacks of
hitting those datasets individually and globally after they are merged
(NOTE: I would have a global field signifying the dataset type, which could
then be added to the filter query in order to create the subset for
individual dataset queries).

Some background about the data: it is extremely variable. Some documents
contain only 2 or 3 sentences, and some are 20 page extracted PDFs. There
would probably only be about 100-150 unique fields.

Any input is greatly appreciated!

Thanks,
Briggs Thompson


Solr - search queries not returning results

2011-06-28 Thread Walter Closenfleight
Hello everyone,

I believe I am missing something very elementary. The following query
returns zero hits:

http://localhost:8983/solr/core0/select/?q=testabc

However, using solritas, it finds many results:

http://localhost:8983/solr/core0/itas?q=testabc

Do you have any idea what the issue may be?

Thanks in advance!


overwirite if not already in index?

2011-06-28 Thread eks dev
Quick question,
Is there a way with solr to conditionally update document on unique
id? Meaning, default, add behavior if id is not already in index and
*not to touch index if already there.

Deletes are not important (no sync issues).

I am asking because I noticed with deduplication turned on,
index-files get modified even if I update the same documents again
(same signatures).
I am facing very high dupes rate (40-50%), and setup is going to be
master-slave with high commit rate (requirement is to reduce
propagation latency for updates). Having unnecessary index
modifications is going to waste  effort to ship the same information
again and again.

if there is no standard way, what would be the fastest way to check if
Term exists in index from UpdateRequestProcessor?

I intend to extend SignatureUpdateProcessor to prevent a document from
propagating down the chain if this happens?
Would that be a way to deal with it? I repeat, there are no deletes to
make headaches with synchronization


Thanks,
eks


conditionally update document on unique id

2011-06-28 Thread eks dev
Quick question,
Is there a way with solr to conditionally update document on unique
id? Meaning, default, add behavior if id is not already in index and
*not to touch index if already there.

Deletes are not important (no sync issues).

I am asking because I noticed with deduplication turned on,
index-files get modified even if I update the same documents again
(same signatures).
I am facing very high dupes rate (40-50%), and setup is going to be
master-slave with high commit rate (requirement is to reduce
propagation latency for updates). Having unnecessary index
modifications is going to waste  effort to ship the same information
again and again.

if there is no standard way, what would be the fastest way to check if
Term exists in index from UpdateRequestProcessor?

I intend to extend SignatureUpdateProcessor to prevent a document from
propagating down the chain if this happens?
Would that be a way to deal with it? I repeat, there are no deletes to
make headaches with synchronization


Thanks,
eks


Re: Solr - search queries not returning results

2011-06-28 Thread Tomás Fernández Löbbe
Hi Walter, probably solritas is using Dismax with a set of fields on the
qf parameter, while with your first query, you are just querying to the
default field.


On Tue, Jun 28, 2011 at 5:07 PM, Walter Closenfleight 
walter.p.closenflei...@gmail.com wrote:

 Hello everyone,

 I believe I am missing something very elementary. The following query
 returns zero hits:

 http://localhost:8983/solr/core0/select/?q=testabc

 However, using solritas, it finds many results:

 http://localhost:8983/solr/core0/itas?q=testabc

 Do you have any idea what the issue may be?

 Thanks in advance!



edismax - Handling collocations mapped to a single token . . ?

2011-06-28 Thread CRB
We are trying to get edismax to handle collocations mapped to a single 
token. To do so we need to manipulate the chunks (as Hoss referred to 
them in http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/) 
generated by the dismax parser. We have numerous collocations (terms of 
speech which do not directly relate to the constituent words that make 
up the saying). For example, at index time real estate is mapped to 
real_estate to avoid it colliding with searches for estate or real 
value. So we need the chunks to reflect this mapping of multi-word 
phrases to a single token that is done during indexing (via the synonym 
filter).


In an ideal world, we would just list the queryAnalyzerFieldType that 
should be used in pre-processing the query string before it is divided 
into chunks (similar to what is done with the SpellChecker Compoenent).


But our impression thus far is that we are off the reservation and will 
need to hack away at 
org.apache.solr.search.ExtendedDismaxQParser.splitIntoClauses(String, 
boolean).


Is it correct that the only pre-processing by dismax is on stopwords?

Is it correct to be able to limit customization to 
splitIntoClauses(String, boolean) to handle this?


Regards,

Christopher







Re: moving to multicore without changing existing index

2011-06-28 Thread Tomás Fernández Löbbe
But a multi-core setup is a multi-core setup, there's no way to have an
index accessible at a non-core URL in a multi-core setup.

Isn't there? What about defaultCoreName parameter? from the wiki: The
name of a core that will be used for requests that don't specify a core. If
you have one core and want to use the features specified on this page, then
this provides a way to keep your URLs the same.

You will need to set up the directory structure for that core, something
like:

solrHome
 originalCore (new core directory)
  conf (existing running index)
 core1 (new core directory)
 conf (new configuration)
 solr.xml (declare both cores, and set originalCore as defaultCoreName
)

Haven't tried it, but I think it should work.
See http://wiki.apache.org/solr/CoreAdmin#solr

On Tue, Jun 28, 2011 at 3:57 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Nope. But you can move your existing index into a core in a multi-core
 setup.  But a multi-core setup is a multi-core setup, there's no way to have
 an index accessible at a non-core URL in a multi-core setup.


 On 6/28/2011 2:53 PM, lee carroll wrote:

 hi
 I'm looking at setting up multi core indices but also have an exiting
 index. Can I run
 this index along side new index set up as cores. On a dev  machine
 I've experimented with
 simply adding solr.xml in slor home and listing the new cores in the
 cores element but this breaks the existing
 index.

 container is tomcat and attempted set up was:

 solrHome
  conf (existing running index)
  core1 (new core directory)
  solr.xml (cores element has one entry for core1)

 Is this a valid approach ?

 thanks lee




How to Create a weighted function (dismax or otherwise)

2011-06-28 Thread aster
I am trying to create a feature that allows search results to be displayed by
this formula sum(weight1*text relevance score, weight2 * price). weight1 and
weight2 are numeric values that can be changed to influence the search
results.

I am sending the following query params to the Solr instance for searching.

q=red
defType=dismax
qf=10^name+2^price

My understanding is that when using dismax, Solr/Lucene looks for the search
text in all the fields specified in the qf param.

Currently my search results are similar to those I get when qf does not
including a price. I think this is because price is a numeric field and
there is not text match.

Is it possible to rank search results based on this formula -
sum(weight1*text relevance score, weight2 * price).

Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Create-a-weighted-function-dismax-or-otherwise-tp3119977p3119977.html
Sent from the Solr - User mailing list archive at Nabble.com.


Fuzzy Query Param

2011-06-28 Thread entdeveloper
According to the docs on lucene query syntax:

Starting with Lucene 1.9 an additional (optional) parameter can specify the
required similarity. The value is between 0 and 1, with a value closer to 1
only terms with a higher similarity will be matched.

I was messing around with this and started doing queries with values greater
than 1 and it seemed to be doing something. However I haven't been able to
find any documentation on this.

What happens when specifying a fuzzy query with a value  1?

tiger~2
animal~3

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fuzzy-Query-Param-tp3120235p3120235.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using RAMDirectoryFactory in Master/Slave setup

2011-06-28 Thread Lance Norskog
Using RAMDirectory really does not help performance. Java garbage
collection has to work around all of the memory taken by the segments.
It works out that Solr works better (for most indexes) without using
the RAMDirectory.



On Sun, Jun 26, 2011 at 2:07 PM, nipunb ni...@walmartlabs.com wrote:
 PS: Sorry if this is a repost, I was unable to see my message in the mailing
 list - this may have been due to my outgoing email different from the one I
 used to subscribe to the list with.

 Overview – Trying to evaluate if keeping the index in memory using
 RAMDirectoryFactory can help in query performance.I am trying to perform the
 indexing on the master using solr.StandardDirectoryFactory and make those
 indexes accesible to the slave using solr.RAMDirectoryFactory

 Details:
 We have set-up Solr in a master/slave enviornment. The index is built on the
 master and then replicated to slaves which are used to serve the query.
 The replication is done using the in-built Java replication in Solr.
 On the master, in the indexDefaults of solrconfig.xml we have
 directoryFactory name=DirectoryFactory
        class=solr.StandardDirectoryFactory/

 On the slave, I tried to use the following in the indexDefaults

 directoryFactory name=DirectoryFactory
         class=solr.RAMDirectoryFactory/

 My slave shows no data for any queries. In solrconfig.xml it is mentioned
 that replication doesn’t work when using RAMDirectoryFactory, however this (
 https://issues.apache.org/jira/browse/SOLR-1379) mentions that you can use
 it to have the index on disk and then load into memory.

 To test the sanity of my set-up, I changed solrconfig.xml in the slave to
 and replicated:
 directoryFactory name=DirectoryFactory
        class=solr.StandardDirectoryFactory/
 I was able to see the results.

 Shouldn’t RAMDirectoryFactory be used for reading index from disk into
 memory?

 Any help/pointers in the right direction would be appreciated.

 Thanks!

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Using-RAMDirectoryFactory-in-Master-Slave-setup-tp3111792p3111792.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com