Tweaking boosts for more search results variety

2013-09-05 Thread Sai Gadde
Our index is aggregated content from various sites on the web. We want good
user experience by showing multiple sites in the search results. In our
setup we are seeing most of the results from same site on the top.

Here is some information regarding queries and schema
site - String field. We have about 1000 sites in index
sitetype - String field.  we have 3 site types
omitNorms=true for both the fields

Doc count varies largely based on site and sitetype by a factor of 10 -
1000 times
Total index size is about 5 million docs.
Solr Version: 4.0

In our queries we have a fixed and preferential boost for certain sites.
sitetype has different and fixed boosts for 3 possible values. We turned
off Inverse Document Frequency (IDF) for these boosts to work properly.
Other text fields are boosted based on search keywords only.

With this setup we often see a bunch of hits from a single site followed by
next etc.,
Is there any solution to see results from variety of sites and still keep
the preferential boosts in place?


Re: Solr 4.3: Recovering from Too many values for UnInvertedField faceting on field

2013-09-05 Thread Dmitry Kan
We had a similar case for multivalued fields with a lot of unique values
per field in some cases. Using facet.method=enum instead of facet.method=fc
fixed the problem. Can run slower though.

Dmitry


On Tue, Sep 3, 2013 at 5:04 PM, Dennis Schafroth den...@indexdata.comwrote:

 We are harvesting and indexing bibliographic data, thus having many
 distinct author names in our index. While testing Solr 4 I believe I had
 pushed a single core to 100 million records (91GB of data) and everything
 was working fine and fast. After adding a little more to the index, then
 following started to happen:

 17328668 [searcherExecutor-4-thread-1] WARN org.apache.solr.core.SolrCore
 – Approaching too many values for UnInvertedField faceting on field
 'author_exact' : bucket size=16726546
 17328701 [searcherExecutor-4-thread-1] INFO org.apache.solr.core.SolrCore
 – UnInverted multi-valued field
 {field=author_exact,memSize=336715415,tindexSize=5001903,time=31595,phase1=31465,nTerms=12048027,bigTerms=0,termInstances=57751332,uses=0}
 18103757 [searcherExecutor-4-thread-1] ERROR org.apache.solr.core.SolrCore
 – org.apache.solr.common.SolrException: Too many values for UnInvertedField
 faceting on field author_exact
 at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:181)
 at
 org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:664)

 I can see that we reached a limit of bucket size. Is there a way to adjust
 this? The index also seem to explode in size (217GB).

 Thinking that I had reached a limit for what a single core could handle in
 terms of facet, I deleted records in the index, but even now at 1/3 (32
 million) it will still fails with above error. I have optimised with
 expungeDeleted=true. The index is  somewhat larger (76GB) than I would have
 expected.

 While we can still use the index and get facets back using enum method on
 that field, I would still like a way to fix the index if possible. Any
 suggestions?

 cheers,
 :-Dennis


Re: Little XsltResponseWriter documentation bug (Attn: Wiki Admin)

2013-09-05 Thread Stefan Matheis
Dimitri

I've added you to the https://wiki.apache.org/solr/ContributorsGroup - feel 
free to improve the wiki :)

- Stefan 


On Wednesday, September 4, 2013 at 11:46 PM, Dmitri Popov wrote:

 Upayavira,
 
 I could edit that page myself, but need to be confirmed human according to
 http://wiki.apache.org/solr/FrontPage#How_to_edit_this_Wiki
 
 My wiki account name is 'pin' just in case.
 
 On Wed, Sep 4, 2013 at 5:27 PM, Upayavira u...@odoko.co.uk 
 (mailto:u...@odoko.co.uk) wrote:
 
  It's a wiki. Can't you correct it?
  
  Upayavira
  
  On Wed, Sep 4, 2013, at 08:25 PM, Dmitri Popov wrote:
   Hi,
   
   http://wiki.apache.org/solr/XsltResponseWriter (and reference manual PDF
   too) become out of date:
   
   In configuration section
   
   queryResponseWriter
   name=xslt
   class=org.apache.solr.request.XSLTResponseWriter
   int name=xsltCacheLifetimeSeconds5/int
   /queryResponseWriter
   
   class name
   
   org.apache.solr.request.XSLTResponseWriter
   
   should be replaced by
   
   org.apache.solr.response.XSLTResponseWriter
   
   Otherwise ClassNotFoundException happens. Change is result of
   https://issues.apache.org/jira/browse/SOLR-1602 as far as I see.
   
   Apparently can't update that page myself, please could someone else do
   that?
   
   Thanks! 



Tweaking Edismax on the Phrase Fields

2013-09-05 Thread Bruno René Santos
Hi,

I have a doubt about the raw query that is parsed from a edismax query.
Form example the query:

_query_:{!edismax mm=100% bf='log(div(9900,producttier))'
pf='name_synonyms~100^3 name~100^6 heading~100^20' pf2='name_synonyms~100^3
name~100^6 heading~100^20' qf='name_synonyms^3 name^6 heading^20'}hotel
centro lisboa

is transformed into


(+((DisjunctionMaxQuery((name_synonyms:hotel^3.0 | heading:hotel^20.0
| name:hotel^6.0)) DisjunctionMaxQueryname_synonyms:semtr
name_synonyms:centr)^3.0) | ((heading:semtr heading:centr)^20.0) |
((name:semtr name:centr)^6.0)))
DisjunctionMaxQueryname_synonyms:lisbon name_synonyms:lisbo)^3.0)
| ((heading:lisbon heading:lisbo)^20.0) | ((name:lisbon
name:lisbo)^6.0~3) DisjunctionMaxQuery((name_synonyms:\hotel
(semtr centr) (lisbon lisbo)\~100^3.0))
DisjunctionMaxQuery((name:\hotel (semtr centr) (lisbon
lisbo)\~100^6.0)) DisjunctionMaxQuery((heading:\hotel (semtr centr)
(lisbon lisbo)\~100^20.0))
(DisjunctionMaxQuery((name_synonyms:\hotel (semtr centr)\~100^3.0))
DisjunctionMaxQuery((name_synonyms:\(semtr centr) (lisbon
lisbo)\~100^3.0))) (DisjunctionMaxQuery((name:\hotel (semtr
centr)\~100^6.0)) DisjunctionMaxQuery((name:\(semtr centr) (lisbon
lisbo)\~100^6.0))) (DisjunctionMaxQuery((heading:\hotel (semtr
centr)\~100^20.0)) DisjunctionMaxQuery((heading:\(semtr centr)
(lisbon lisbo)\~100^20.0)))
FunctionQuery(log(div(const(9900),int(producttier)/no_coord

As you can see for each field on a phrase query a new
DisjunctionMaxQuery is created. Why the behaviour is not same as the
qf? On the qf the most important field (max) is what is counts. on the
phrase query all fields participate on the final score. Is there any
way to emulate the qf behaviour of the qf (one DisjunctionMaxQuery for
each combination) on the pf? like one DisjunctionMaxQuery for pf,
another for the pf2, etc

Regards

Bruno

-- 

Bruno René Santos
Lisboa - Portugal


Re: Need help on Joining and sorting syntax and limitations between multiple documents in solr-4.4.0

2013-09-05 Thread Erick Erickson
The very first thing I'd do is see if you can _not_ use joins. Especially
if you're coming from a RDBMS background. Joins in Solr are
somewhat specialized and are NOT equivalent to db joins.

First of all there's no way to get fields from the from part
of the join returned in the results. Secondly, there are a number
of cases where the performance isn't stellar. Thirdly...

The first approach is always to explore denormalizing the data so
you can do straight searches rather than joins. Second is to think
about your use case carefully and se if there are clever indexing
schemes that allow you to not use joins.

Only after those avenues are exhausted would I rely on joins.
There's a reason they are sometimes referred to as pseudo joins

Best,
Erick


On Wed, Sep 4, 2013 at 4:19 AM, Sukanta Dey sukanta@gettyimages.comwrote:

 Hi Team,

 In my project I am going to use Apache solr-4.4.0 version for searching.
 While doing that I need to join between multiple solr documents within the
 same core on one of the common field across the documents.
 Though I successfully join the documents using solr-4.4.0 join syntax, it
 is returning me the expected result, but, since my next requirement is to
 sort the returned result on basis of the fields from the documents
 Involved in join condition's from clause, which I was not able to get.
 Let me explain the problem in detail along with the files I am using ...


 1)  Files being used :

 a.   Picklist_1.xml

 --

 adddoc

 field name=describedObjectIdt1324838/field

 field name=describedObjectType7/field

 field name=picklistItemId956/field

 field name=siteId130712901/field

 field name=enDraft/field

 field name=grDraoft/field

 /doc/add



 b.  Picklist_2.xml

 ---

 adddoc

 field name=describedObjectIdt1324837/field

 field name=describedObjectType7/field

 field name=picklistItemId87749/field

 field name=siteId130712901/field

 field name=enNew/field

 field name=grNeuo/field

 /doc/add



 c.   AssetID_1.xml

 ---

 adddoc

 field name=def14227_picklistt1324837/field

 field name=describedObjectIda180894808/field

 field name=describedObjectType1/field

 field name=isMetadataCompletetrue/field

 field name=lastUpdateDate2013-09-02T09:28:18Z/field

 field name=ownerId130713716/field

 field name=siteId130712901/field

 /doc/add



 d.  AssetID_2.xml

 

 adddoc

  field name=def14227_picklistt1324838/field

  field name=describedObjectIda171658357/field

 field name=describedObjectType1/field

 field name=ownerId130713716/field

 field name=rGroupId2283961/field

 field name=rGroupId2290309/field

 field name=rGroupPermissionLevel7/field

 field name=rGroupPermissionLevel7/field

 field name=rRuleId13503796/field
 field name=rRuleId15485964/field

 field name=rUgpId38052/field

 field name=rUgpId41133/field

 field name=siteId130712901/field

 /doc/add



 2)  Requirement:

 

 i. It needs to have a join  between the files using
 def14227_picklist field from AssetID_1.xml and AssetID_2.xml and
 describedObjectId field from Picklist_1.xml and Picklist_2.xml files.

 ii.   After joining we need to have all the fields from
 the files AssetID_*.xml and en,gr fields from Picklist_*.xml files.

 iii.  While joining we also sort the result based on the
 en field value.



 3)  I was trying with q={!join from=inner_id to=outer_id}zzz:vvv
 syntax but no luck.

 Any help/suggestion would be appreciated.

 Thanks,
 Sukanta Dey







Re: Solr Cloud hangs when replicating updates

2013-09-05 Thread Erick Erickson
If you run into this again, try a jstack trace. You should see
evidence of being stuck in SolrCmdDistributor on a variable
called semaphore... On current 4x this is around line 420.

If you're using SolrJ, then SOLR-4816 is another thing to try.

But Mark's patch would be best of all to test, If that doesn't
fix it then the jstack suggestion would at least tell us if it's
the issue we think it is.

FWIW,
Erick


On Wed, Sep 4, 2013 at 12:51 PM, Mark Miller markrmil...@gmail.com wrote:

 It would be great if you could give this patch a try:
 http://pastebin.com/raw.php?i=aaRWwSGP

 - Mark


 On Wed, Sep 4, 2013 at 8:31 AM, Kevin Osborn kevin.osb...@cbsi.com
 wrote:

  Thanks. If there is anything I can do to help you resolve this issue, let
  me know.
 
  -Kevin
 
 
  On Wed, Sep 4, 2013 at 7:51 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
   Ill look at fixing the root issue for 4.5. I've been putting it off for
   way to long.
  
   Mark
  
   Sent from my iPhone
  
   On Sep 3, 2013, at 2:15 PM, Kevin Osborn kevin.osb...@cbsi.com
 wrote:
  
I was having problems updating SolrCloud with a large batch of
 records.
   The
records are coming in bursts with lulls between updates.
   
At first, I just tried large updates of 100,000 records at a time.
Eventually, this caused Solr to hang. When hung, I can still query
  Solr.
But I cannot do any deletes or other updates to the index.
   
At first, my updates were going as SolrJ CSV posts. I have also tried
   local
file updates and had similar results. I finally slowed things down to
   just
use SolrJ's Update feature, which is basically just JavaBin. I am
 also
sending over just 100 at a time in 10 threads. Again, it eventually
  hung.
   
Sometimes, Solr hangs in the first couple of chunks. Other times, it
   hangs
right away.
   
These are my commit settings:
   
autoCommit
  maxTime15000/maxTime
  maxDocs5000/maxDocs
  openSearcherfalse/openSearcher
/autoCommit
autoSoftCommit
maxTime3/maxTime
  /autoSoftCommit
   
I have tried quite a few variations with the same results. I also
 tried
various JVM settings with the same results. The only variable seems
 to
  be
that reducing the cluster size from 2 to 1 is the only thing that
  helps.
   
I also did a jstack trace. I did not see any explicit deadlocks, but
 I
   did
see quite a few threads in WAITING or TIMED_WAITING. It is typically
something like this:
   
 java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  0x00074039a450 (a
java.util.concurrent.Semaphore$NonfairSync)
   at
   java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   at
   
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
   at
   
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
   at
   
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
   at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
   at
   
  
 
 org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
   at
   
  
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
   at
   
  
 
 org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
   at
   
  
 
 org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
   at
   
  
 
 org.apache.solr.update.SolrCmdDistributor.distribAdd(SolrCmdDistributor.java:139)
   at
   
  
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:474)
   at
   
  
 
 org.apache.solr.handler.loader.CSVLoaderBase.doAdd(CSVLoaderBase.java:395)
   at
   
  
 
 org.apache.solr.handler.loader.SingleThreadedCSVLoader.addDoc(CSVLoader.java:44)
   at
   
  org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:364)
   at
   org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31)
   at
   
  
 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
   at
   
  
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
   at
   
  
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
   at
   
  
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
   at
   
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
   at
   
  
 
 

Re: Tweaking boosts for more search results variety

2013-09-05 Thread Jack Krupansky
The grouping (field collapsing) feature somewhat addresses this - group by a 
site field and then if more than one or a few top pages are from the same 
site they get grouped or collapsed so that you can see more sites in a few 
results.


See:
http://wiki.apache.org/solr/FieldCollapsing
https://cwiki.apache.org/confluence/display/solr/Result+Grouping

-- Jack Krupansky

-Original Message- 
From: Sai Gadde

Sent: Thursday, September 05, 2013 2:27 AM
To: solr-user@lucene.apache.org
Subject: Tweaking boosts for more search results variety

Our index is aggregated content from various sites on the web. We want good
user experience by showing multiple sites in the search results. In our
setup we are seeing most of the results from same site on the top.

Here is some information regarding queries and schema
   site - String field. We have about 1000 sites in index
   sitetype - String field.  we have 3 site types
omitNorms=true for both the fields

Doc count varies largely based on site and sitetype by a factor of 10 -
1000 times
Total index size is about 5 million docs.
Solr Version: 4.0

In our queries we have a fixed and preferential boost for certain sites.
sitetype has different and fixed boosts for 3 possible values. We turned
off Inverse Document Frequency (IDF) for these boosts to work properly.
Other text fields are boosted based on search keywords only.

With this setup we often see a bunch of hits from a single site followed by
next etc.,
Is there any solution to see results from variety of sites and still keep
the preferential boosts in place? 



JSON update request handler commitWithin

2013-09-05 Thread Ryan, Brent
I'm prototyping a search product for us and I was trying to use the 
commitWithin parameter for posting updated JSON documents like so:

curl -v 
'http://localhost:8983/solr/proposal.solr/update/json?commitWithin=1' 
--data-binary @rfp.json -H 'Content-type:application/json'

However, the commit never seems to happen as you can see below there are still 
2 docsPending (even 1 hour later).  Is there a trick to getting this to work 
with submitting to the json update request handler?
[cid:483C4A1C-D20D-4AAB-822E-DFCA03026572]



Re: JSON update request handler commitWithin

2013-09-05 Thread Jack Krupansky
I just tried commitWithin with the standard Solr example in Solr 4.4 and it 
works fine.

Can you reproduce your problem using the standard Solr example in Solr 4.4?

-- Jack Krupansky

From: Ryan, Brent 
Sent: Thursday, September 05, 2013 10:39 AM
To: solr-user@lucene.apache.org 
Subject: JSON update request handler  commitWithin

I'm prototyping a search product for us and I was trying to use the 
commitWithin parameter for posting updated JSON documents like so:

curl -v 
'http://localhost:8983/solr/proposal.solr/update/json?commitWithin=1' 
--data-binary @rfp.json -H 'Content-type:application/json'

However, the commit never seems to happen as you can see below there are still 
2 docsPending (even 1 hour later).  Is there a trick to getting this to work 
with submitting to the json update request handler?



Re: JSON update request handler commitWithin

2013-09-05 Thread Jason Hellman
They have modified the mechanisms for committing documents…Solr in DSE is not 
stock Solr...so you are likely encountering a boundary where stock Solr 
behavior is not fully supported.

I would definitely reach out to them to find out if they support the request.

On Sep 5, 2013, at 8:27 AM, Ryan, Brent br...@cvent.com wrote:

 Ya, looks like this is a bug in Datastax Enterprise 3.1.2.  I'm using
 their enterprise cluster search product which is built on SOLR 4.
 
 :(
 
 
 
 On 9/5/13 11:24 AM, Jack Krupansky j...@basetechnology.com wrote:
 
 I just tried commitWithin with the standard Solr example in Solr 4.4 and
 it works fine.
 
 Can you reproduce your problem using the standard Solr example in Solr
 4.4?
 
 -- Jack Krupansky
 
 From: Ryan, Brent 
 Sent: Thursday, September 05, 2013 10:39 AM
 To: solr-user@lucene.apache.org
 Subject: JSON update request handler  commitWithin
 
 I'm prototyping a search product for us and I was trying to use the
 commitWithin parameter for posting updated JSON documents like so:
 
 curl -v 
 'http://localhost:8983/solr/proposal.solr/update/json?commitWithin=1'
 --data-binary @rfp.json -H 'Content-type:application/json'
 
 However, the commit never seems to happen as you can see below there are
 still 2 docsPending (even 1 hour later).  Is there a trick to getting
 this to work with submitting to the json update request handler?
 
 



Re: JSON update request handler commitWithin

2013-09-05 Thread Ryan, Brent
Ya, looks like this is a bug in Datastax Enterprise 3.1.2.  I'm using
their enterprise cluster search product which is built on SOLR 4.

:(



On 9/5/13 11:24 AM, Jack Krupansky j...@basetechnology.com wrote:

I just tried commitWithin with the standard Solr example in Solr 4.4 and
it works fine.

Can you reproduce your problem using the standard Solr example in Solr
4.4?

-- Jack Krupansky

From: Ryan, Brent 
Sent: Thursday, September 05, 2013 10:39 AM
To: solr-user@lucene.apache.org
Subject: JSON update request handler  commitWithin

I'm prototyping a search product for us and I was trying to use the
commitWithin parameter for posting updated JSON documents like so:

curl -v 
'http://localhost:8983/solr/proposal.solr/update/json?commitWithin=1'
--data-binary @rfp.json -H 'Content-type:application/json'

However, the commit never seems to happen as you can see below there are
still 2 docsPending (even 1 hour later).  Is there a trick to getting
this to work with submitting to the json update request handler?




charfilter doesn't do anything

2013-09-05 Thread Andreas Owen
i would like to filter / replace a word during indexing but it doesn't do 
anything and i dont get a error.

in schema.xml i have the following:

field name=text_html type=text_cutHtml indexed=true stored=true 
multiValued=true/

fieldType name=text_cutHtml class=solr.TextField
analyzer
  !--  tokenizer class=solr.StandardTokenizerFactory/ --
  charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=Zahlungsverkehr replacement=ASDFGHJK /
  tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
   /fieldType

my 2. question is where can i say that the expression is multilined like in 
javascript i can use /m at the end of the pattern?

Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

2013-09-05 Thread Austin Rasmussen
Hello,

I currently have Solr 4.3 set up with about 400 cores set to load upon start 
up.  When starting Solr with an empty index for each core, Solr is able to load 
all of the cores and start up normally as expected.  However, after running a 
dataimport on all cores and restarting Solr, it hangs at 
org.apache.solr.core.CoreContainer; registering core: ... without any type of 
error message in the log.  The process still exists at this point, but doesn't 
make any progress even if left for a period of time.  Prior to the restart, 
Solr continues to function normally, and is searchable.

Solr is currently running in master-slave replication, and this same, exact 
behavior occurs on the master and both slaves.

I've checked all of the system log files and am also unable to find any errors 
or messages that would point to a particular problem.  Originally, I had 
thought it may have been related to an open file limit, but I also tried 
raising the limit to 65k, and Solr continued to hang at the same spot.  It does 
appear to be related to files to an extent, since removing the index/data 
directory of half of the cores does allow Solr to start up normally.

Any help or suggestions are appreciated.

Thanks!


Loading a SpellCheck dynamically

2013-09-05 Thread Mr Havercamp
I currently have multiple spellchecks configured in my solrconfig.xml to 
handle a variety of different spell suggestions in different languages.


In the snippet below, I have a catch-all spellcheck as well as an 
English only one for more accurate matching (I.e. my schema.xml is set 
up to capture english only fields to an english-specific textSpell_en 
field and then I also capture to a generic textSpell field):


---solrconfig.xml---

searchComponent name=spellcheck_en class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetextSpell_en/str

lst name=spellchecker
str name=namedefault/str
str name=fieldspell_en/str
str name=spellcheckIndexDir./spellchecker_en/str
str name=buildOnOptimizetrue/str
/lst
/searchComponent

searchComponent name=spellcheck class=solr.SpellCheckComponent
str name=queryAnalyzerFieldTypetextSpell/str

lst name=spellchecker
str name=namedefault/str
str name=fieldspell/str
str name=spellcheckIndexDir./spellchecker/str
str name=buildOnOptimizetrue/str
/lst
/searchComponent

My question is; when I query my Solr index, am I able to load, say, just 
spellcheck values from the spellcheck_en spellchecker rather than from 
both? This would be useful if I were to start implementing additional 
language spellchecks; E.g. spellcheck_ja, spellcheck_fr, etc.


Thanks for any insights.

Cheers


Hayden


bucket count for facets

2013-09-05 Thread Steven Bower
Is there a way to get the count of buckets (ie unique values) for a field
facet? the rudimentary approach of course is to get back all buckets, but
in some cases this is a huge amount of data.

thanks,

steve


Solr documents update on index

2013-09-05 Thread Luis Portela Afonso
Hi,

I'm having a problem when solr indexes.
It is updating documents already indexed. Is this a normal behavior?
If a document with the same key already exists is it supposed to be updated?
I has thinking that is supposed to just update if the information on the
rss has changed.

Appreciate your help

-- 
Sent from Gmail Mobile


Odd behavior after adding an additional core.

2013-09-05 Thread mike st. john
using solr 4.4  , i used collection admin to create a collection  4shards
replication - factor of 1

i did this so i could index my data, then bring in replicas later by adding
cores via coreadmin


i added a new core via coreadmin,  what i noticed shortly after adding the
core,  the leader of the shard where the new replica was placed was marked
active the new core marked as the leader  and the routing was now set to
implicit.



i've replicated this on another solr setup as well.


Any ideas?


Thanks

msj


Re: Numeric fields and payload

2013-09-05 Thread Erick Erickson
Peter:

I don't quite get this. Formatting to display is trivial as it's
usually done for just a few docs anyway. You could also
just store the original unaltered value and add an additional
normalized field.

Best
Erick


On Wed, Sep 4, 2013 at 2:02 PM, PETER LENAHAN peter_lena...@ibi.com wrote:

 Chris Hostetter hossman_lucene at fucit.org writes:

 
 
  : is it possible to store (text) payload to numeric fields (class
  : solr.TrieDoubleField)?  My goal is to store measure units to numeric
  : features - e.g. '1.5 cm' - and to use faceted search with these fields.
  : But the field type doesn't allow analyzers to add the payload data. I
  : want to avoid database access to load the units. I'm using Solr 4.2 .
 
  I'm not sure if it's possible to add payloads to Trie fields, but even if
  there is i don't think you really want that for your usecase -- i think
 it
  would make a lot more sense to normalize your units so you do consistent
  sorting, range queries, and faceting on the values regardless of wether
  it's 100cm or 1000mm or 1m.
 
  -Hoss
 
 

 Hoss,  What you suggest may be fine for specific units. But for monetary
 values with formatting it is not realistic. $10,000.00 would require
 formatting the number to display it.  It would be much easier to store the
 string as a payload with the formatted value.


 Peter Lenahan




Re: charfilter doesn't do anything

2013-09-05 Thread Jack Krupansky

And show us an input string and a query that fail.

-- Jack Krupansky

-Original Message- 
From: Shawn Heisey

Sent: Thursday, September 05, 2013 2:41 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

On 9/5/2013 10:03 AM, Andreas Owen wrote:
i would like to filter / replace a word during indexing but it doesn't do 
anything and i dont get a error.


in schema.xml i have the following:

field name=text_html type=text_cutHtml indexed=true stored=true 
multiValued=true/


fieldType name=text_cutHtml class=solr.TextField
analyzer
  !--  tokenizer class=solr.StandardTokenizerFactory/ --
  charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=Zahlungsverkehr replacement=ASDFGHJK /

  tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
   /fieldType

my 2. question is where can i say that the expression is multilined like 
in javascript i can use /m at the end of the pattern?


I don't know about your second question.  I don't know if that will be
possible, but I'll leave that to someone who's more expert than I.

As for the first question, here's what I have.  Did you reindex?  That
will be required.

http://wiki.apache.org/solr/HowToReindex

Assuming that you did reindex, are you trying to search for ASDFGHJK in
a field that contains more than just Zahlungsverkehr?  The keyword
tokenizer might not do what you expect - it tokenizes the entire input
string as a single token, which means that you won't be able to search
for single words in a multi-word field without wildcards, which are
pretty slow.

Note that both the pattern and replacement are case sensitive.  This is
how regex works.  You haven't used a lowercase filter, which means that
you won't be able to search for asdfghjk.

Use the analysis tab in the UI on your core to see what Solr does to
your field text.

Thanks,
Shawn 



Re: charfilter doesn't do anything

2013-09-05 Thread Shawn Heisey
On 9/5/2013 10:03 AM, Andreas Owen wrote:
 i would like to filter / replace a word during indexing but it doesn't do 
 anything and i dont get a error.
 
 in schema.xml i have the following:
 
 field name=text_html type=text_cutHtml indexed=true stored=true 
 multiValued=true/
 
 fieldType name=text_cutHtml class=solr.TextField
   analyzer
 !--  tokenizer class=solr.StandardTokenizerFactory/ --
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=Zahlungsverkehr replacement=ASDFGHJK /
 tokenizer class=solr.KeywordTokenizerFactory/
   /analyzer
/fieldType
 
 my 2. question is where can i say that the expression is multilined like in 
 javascript i can use /m at the end of the pattern?

I don't know about your second question.  I don't know if that will be
possible, but I'll leave that to someone who's more expert than I.

As for the first question, here's what I have.  Did you reindex?  That
will be required.

http://wiki.apache.org/solr/HowToReindex

Assuming that you did reindex, are you trying to search for ASDFGHJK in
a field that contains more than just Zahlungsverkehr?  The keyword
tokenizer might not do what you expect - it tokenizes the entire input
string as a single token, which means that you won't be able to search
for single words in a multi-word field without wildcards, which are
pretty slow.

Note that both the pattern and replacement are case sensitive.  This is
how regex works.  You haven't used a lowercase filter, which means that
you won't be able to search for asdfghjk.

Use the analysis tab in the UI on your core to see what Solr does to
your field text.

Thanks,
Shawn



Re: More on topic of Meta-search/Federated Search with Solr

2013-09-05 Thread Paul Libbrecht
Hello list,

A student of a friend of mine made his masters on that topic, especially about 
federated ranking.

I have copied his text here:

http://direct.hoplahup.net/tmp/FederatedRanking-Koblischke-2009.pdf

Feel free to contact me to contact Robert Koblischke for questions.

Paul


On 28 août 2013, at 20:35, Dan Davis wrote:

 On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha shanuu@gmail.com wrote:
 
 Would you like to create something like
 http://knimbus.com
 
 
 I work at the National Library of Medicine.   We are moving our library
 catalog to a newer platform, and we will probably include articles.   The
 article's content and meta-data are available from a number of web-scale
 discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's traditional
 API.   Most libraries use open source solutions to avoid the cost of
 purchasing an expensive enterprise search platform.   We are big; we
 already have a closed-source enterprise search engine (and our own home
 grown Entrez search used for PubMed).Since we can already do Federated
 Search with the above, I am evaluating the effort of adding such to Apache
 Solr.   Because NLM data is used in the open relevancy project, we actually
 have the relevancy decisions to decide whether we have done a good job of
 it.
 
 I obviously think it would be Fun to add Federated Search to Apache Solr.
 
 *Standard disclosure *- my opinion's do not represent the opinions of NIH
 or NLM.Fun is no reason to spend tax-payer money.Enhancing Apache
 Solr would reduce the risk of putting all our eggs in one basket. and
 there may be some other relevant benefits.
 
 We do use Apache Solr here for more than one other project... so keep up
 the good work even if my working group decides to go with the closed-source
 solution.



Solr Cell Question

2013-09-05 Thread Jamie Johnson
Is it possible to configure solr cell to only extract and store the body of
a document when indexing?  I'm currently doing the following which I
thought would work

ModifiableSolrParams params = new ModifiableSolrParams();

 params.set(defaultField, content);

 params.set(xpath, /xhtml:html/xhtml:body/descendant::node());

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest(
/update/extract);

 up.setParams(params);

 FileStream f = new FileStream(new File(..));

 up.addContentStream(f);

up.setAction(ACTION.COMMIT, true, true);

solrServer.request(up);


But the result of content is as follows

arr name=content_mvtxt
str/
strnull/str
strISO-8859-1/str
strtext/plain; charset=ISO-8859-1/str
strJust a little test/str
/arr


What I had hoped for was just

arr name=content_mvtxt
strJust a little test/str
/arr


Solr substring search

2013-09-05 Thread Scott Schneider
Hello,

I'm trying to find out how Solr runs a query for *foo*.  Google tells me that 
you need to use NGramFilterFactory for that kind of substring search, but I 
find that even with very simple fieldTypes, it just works.  (Perhaps because 
I'm testing on very small data sets, Solr is willing to look through all the 
keywords.)  e.g. This works on the tutorial.

Can someone tell me exactly how this works and/or point me to the Lucene code 
that implements this?

Thanks,
Scott



Re: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core

2013-09-05 Thread Chris Hostetter

: I currently have Solr 4.3 set up with about 400 cores set to load upon 
: start up.  When starting Solr with an empty index for each core, Solr is 
: able to load all of the cores and start up normally as expected.  
: However, after running a dataimport on all cores and restarting Solr, it 
: hangs at org.apache.solr.core.CoreContainer; registering core: ... 
: without any type of error message in the log.  The process still exists 
: at this point, but doesn't make any progress even if left for a period 
: of time.  Prior to the restart, Solr continues to function normally, and 
: is searchable.

When solr gets into this state, can you generate a thread dump, wait 20-30 
seconds, generate another thread dump, and then send both to the list so 
we can see what's going on at this point?

The easiest way to generate a threaddump is with jstack on the same 
machine...

jstack pid  threaddumps.log


: hang at the same spot.  It does appear to be related to files to an 
: extent, since removing the index/data directory of half of the cores 
: does allow Solr to start up normally.

wild shot in the dark -- is it possible you have really large transaction 
logs that are being replayed on startup, because you never did a hard 
commit after indexing?

can you also include in your next email a listing of all the files in all 
the data dirs of the affected solr instance, including file sizes?

something along the lines of this command output from your solr home 
dir...

du -ab */data

?


-Hoss


Re: SolrCloud 4.x hangs under high update volume

2013-09-05 Thread Tim Vaillancourt
Update: It is a bit too soon to tell, but about 6 hours into testing there
are no crashes with this patch. :)

We are pushing 500 batches of 10 updates per second to a 3 node, 3 shard
cluster I mentioned above. 5000 updates per second total.

More tomorrow after a 24 hr soak!

Tim

On Wednesday, 4 September 2013, Tim Vaillancourt wrote:

 Thanks so much for the explanation Mark, I owe you one (many)!

 We have this on our high TPS cluster and will run it through it's paces
 tomorrow. I'll provide any feedback I can, more soon! :D

 Cheers,

 Tim



Re: unknown _stream_source_info while indexing rich doc in solr

2013-09-05 Thread Chris Hostetter

: yes sir i did restart the tomcat.

When you look at the Schema Browser for your default solr core (i'm 
guessing it's collection1?), does it list ignored_* as a dynamic field?  
does this URL below show you that ignored_* is using type ignored ? 
...

http://localhost:8983/solr/#/collection1/schema-browser?dynamic-field=ignored_*

...if not, then you aren't using the schema.xml that you think you are.



-Hoss


solrcloud shards backup/restoration

2013-09-05 Thread Aditya Sakhuja
Hello,

I was looking for a good backup / recovery solution for the solrcloud
indexes. I am more looking for restoring the indexes from the index
snapshot, which can be taken using the replicationHandler's backup command.

I am looking for something that works with solrcloud 4.3 eventually, but
still relevant if you tested with a previous version.

I haven't been successful in have the restored index replicate across the
new replicas, after I restart all the nodes, with one node having the
restored index.

Is restoring the indexes on all the nodes the best way to do it ?
-- 
Regards,
-Aditya Sakhuja


data/index naming format

2013-09-05 Thread Aditya Sakhuja
Hello,

I am running solr 4.1 for now, and am confused about the structure and
naming of the contents of the data dir. I do not see the index.properties
being generated on a fresh solr node start either.

Can someone clarify when should one expect to see

data/index vs. data/index.timestamp, and the index.properties along with
the second version.

-- 
Regards,
-Aditya Sakhuja


Re: subindex

2013-09-05 Thread Erick Erickson
Nope. You can do this if you've stored _all_ the fields (with the exception
of
_version_ and the destinations of copyField directives). But there's no way
I
know of to do what you want if you haven't.

If you have, you'd be essentially spinning through all your docs and
re-indexing
just the fields you cared about. But if you still have access to your
original
docs this would be slower/more complicated than just re-indexing from
scratch.

Best
Erick


On Wed, Sep 4, 2013 at 1:51 PM, Peyman Faratin pey...@robustlinks.comwrote:

 Hi

 Is there a way to build a new (smaller) index from an existing (larger)
 index where the smaller index contains a subset of the fields of the larger
 index?

 thank you


Re: data/index naming format

2013-09-05 Thread Shawn Heisey
On 9/5/2013 6:48 PM, Aditya Sakhuja wrote:
 I am running solr 4.1 for now, and am confused about the structure and
 naming of the contents of the data dir. I do not see the index.properties
 being generated on a fresh solr node start either.
 
 Can someone clarify when should one expect to see
 
 data/index vs. data/index.timestamp, and the index.properties along with
 the second version.

I have never seen an index.properties file get created.  I've used
versions from 1.4.0 through 4.4.0.

Generally when you have an index.timestamp directory, it's because
you're doing replication.  There may be other circumstances when it
appears, but I do not know what those are.

As for the other files in the index directory, here's Lucene's file
format documentation:

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description

Thanks,
Shawn



Re: data/index naming format

2013-09-05 Thread Jason Hellman
The circumstance I've most typically seen the index.timestamp show up is when 
an update is sent to a slave server.  The replication then appears to preserve 
the updated slave index in a separate folder while still respecting the correct 
data from the master.  

On Sep 5, 2013, at 8:03 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/5/2013 6:48 PM, Aditya Sakhuja wrote:
 I am running solr 4.1 for now, and am confused about the structure and
 naming of the contents of the data dir. I do not see the index.properties
 being generated on a fresh solr node start either.
 
 Can someone clarify when should one expect to see
 
 data/index vs. data/index.timestamp, and the index.properties along with
 the second version.
 
 I have never seen an index.properties file get created.  I've used
 versions from 1.4.0 through 4.4.0.
 
 Generally when you have an index.timestamp directory, it's because
 you're doing replication.  There may be other circumstances when it
 appears, but I do not know what those are.
 
 As for the other files in the index directory, here's Lucene's file
 format documentation:
 
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description
 
 Thanks,
 Shawn