Re: DIH throws NullPointerException when using dataimporter.functions.escapeSql with parent entities

2012-10-19 Thread Lance Norskog
If it worked before and does not work now, I don't think you are doing anything 
wrong :)

Do you have a different version of your JDBC driver?
Can you make a unit test with a minimal DIH script and schema?
Or, scan through all of the JIRA issues against the DIH from your old Solr 
capture date.


- Original Message -
| From: "Dominik Siebel" 
| To: solr-user@lucene.apache.org
| Sent: Thursday, October 18, 2012 11:22:54 PM
| Subject: Fwd: DIH throws NullPointerException when using 
dataimporter.functions.escapeSql with parent entities
| 
| Hi folks,
| 
| I am currently migrating our Solr servers from a 4.0.0 nightly build
| (aprox. November 2011, which worked very well) to the newly released
| 4.0.0 and am running into some issues concerning the existing
| DataImportHandler configuratiions. Maybe you have an idea where I am
| going wrong here.
| 
| The following lines are a highly simplified excerpt from one of the
| problematic imports:
| 
| 
| 
| 
| 
| 
| 
| While this configuration worked without any problem for over half a
| year now, when upgrading to 4.0.0-BETA AND 4.0.0 the Import throws
| the
| followeing Stacktrace and exits:
| 
|  SEVERE: Exception while processing: path document :
| null:org.apache.solr.handler.dataimport.DataImportHandlerException:
| java.lang.NullPointerException
| 
| which is caused by
| 
| Caused by: java.lang.NullPointerException
| at
| 
org.apache.solr.handler.dataimport.EvaluatorBag$1.evaluate(EvaluatorBag.java:79)
| 
| In other words: The EvaluatorBag doesn't seem to resolve the given
| path.name variable properly and returns null.
| 
| Does anyone have any idea?
| Appreciate your input!
| 
| Regards
| Dom
| 


Re: "diversity" of search results?

2012-10-19 Thread Otis Gospodnetic
Hi Paul,

We've done this for a client in the past via a custom SearchComponent
and it worked well.  Yes, it involved some post-processing, but on the
server, not client.  I *think* we saw 10% performance degradation.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Fri, Oct 19, 2012 at 3:26 AM, Paul Libbrecht  wrote:
> Hello SOLR expert,
>
> yesterday in our group we realized that a danger we may need to face is that 
> a search result includes very similar results.
> Of course, one would expect skimming so that duplicates that show almost the 
> same results in a search result would be avoided but we fear that this is not 
> possible.
>
> I was wondering if some technology, plugin, or even research was existing 
> that would enable a search result to be partially reordered so that 
> "diversity" is ensured for a first page of results at least.
>
> I suppose that might be doable by processing the result page and the next 
> (and the five next?) and pushing down some results if they are "too" similar 
> to previous ones.
>
> Hope I am being clear.
>
> Paul


Re: Benchmarking/Performance Testing question

2012-10-19 Thread Otis Gospodnetic
Hi Amit,

I'm not sure I follow what you are after...
Yes, seeing how queries that result in cache misses perform is
valuable (esp. if you have low cache hit rate in production)
But figuring out if you chose a bad field type or bad faceting method
or  doesn't require profiling - you can review configs and logs
and such and quickly find performance issues.

In production (or dev, really, too) you can use tools like SPM for
Solr or NewRelic.  SPM will show you performance breakdown over all
Solr SearchComponents used in searches.  NewRelic has non-free plans
that also let you do on-demand profiling, so you could profile Solr in
production, which can be handy.

HTH,
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Fri, Oct 19, 2012 at 12:02 PM, Amit Nithian  wrote:
> Hi all,
>
> I know there have been many posts about this already and I have done
> my best to read through them but one lingering question remains. When
> doing performance testing on a Solr instance (under normal production
> like circumstances, not the ones where commits are happening more
> frequently than necessary), is there any value in performance testing
> against a server with caches *disabled* with a profiler hooked up to
> see where queries in the absence of a cache are spending the most
> time?
>
> The reason I am asking this is to tune things like field types, using
> tint vs regular int, different precision steps etc. Or maybe sorting
> is taking a long time and the profiler shows an inordinate amount of
> time spent there etc. so either we find a different way to solve that
> particular problem. Perhaps we are faceting on something bad etc. Then
> we can optimize those to at least not be as slow and then ensure that
> caching is tuned properly so that cache misses don't yield these
> expensive spikes.
>
> I'm trying to devise a proper performance testing for any new
> features/config changes and wanted to get some feedback on whether or
> not this approach makes sense. Of course performance testing against a
> typical production setup *with* caching will also be done to make sure
> things behave as expected.
>
> Thanks!
> Amit


Re: Solr-4.0.0 DIH not indexing xml attributes

2012-10-19 Thread Lance Norskog
Do other fields get added?
Do these fields have type problems? I.e. is 'attr1' a number and you are adding 
a string?
There is a logging EP that I think shows the data found- I don't know how to 
use it.
Is it possible to post the whole DIH script?

- Original Message -
| From: "Billy Newman" 
| To: solr-user@lucene.apache.org
| Sent: Friday, October 19, 2012 9:06:08 AM
| Subject: Solr-4.0.0 DIH not indexing xml attributes
| 
| Hello all,
| 
| I am having problems indexing xml attributes using the DIH.
| 
| I have the following xml:
| 
| 
| 
| 
| 
| 
| However nothing is getting inserted into my index.
| 
| I am pretty sure this should work so I have no idea what is wrong.
| 
| Can anyone else confirm that this is a problem?  Or is it just me?
| 
| Thanks,
| Billy
| 


Re: Transient commit errors during autocommit

2012-10-19 Thread Casey Callendrello

Lance,
I have seen this error when the Solr process hit the maximum file 
descriptors (because the commit triggered an optimize). Make sure your 
maxfds is set as high as possible. In my case, 1024 was not nearly 
sufficient.


--Casey


On 10/19/12 6:20 PM, Lance Norskog wrote:

When a transient error happens during an autocommit, the error does not cause a 
safe rollback or notify the user there was a problem. Instead, there is a write 
lock failure and Solr has to be restarted. It run fine after restart.

Is this a known problem? Is it fixable? Is it unit-test-able?









Re: need help with exact match search

2012-10-19 Thread geeky2
hello jack,

thank you very much for the reply - i will re-test and let you know.

really appreciate it ;)

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-help-with-exact-match-search-tp4014832p4014848.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: need help with exact match search

2012-10-19 Thread Jack Krupansky
Because you used solr.StandardTokenizerFactory which will tokenize terms at 
some delimiters - such as the hyphens that surround your errant 404 case.


Try solr.WhitespaceTokenizerFactory or solr.KeywordTokenizerFactory.

And maybe rename your field type from "text_general_trim" to "text_exact" 
since "general" implies a general text analyzer.


Test your field type changes on the Solr Admin Analysis page.

-- Jack Krupansky

-Original Message- 
From: geeky2

Sent: Friday, October 19, 2012 5:20 PM
To: solr-user@lucene.apache.org
Subject: need help with exact match search

environment: solr 3.5

Hello,

i have a query for an exact match that is bringing back one (1) additional
record that is NOT an exact match.

when i do an exact match search for 404 - i should get back three (3)
document, *but i get back the additional record, with an
itemModelNoExactMatchStr of DUS-404-19  *

can someone help me understand what i am missing or not setting up
correctly?


response from solr with 4 documents



 
   0
   1
   
 itemModelNoExactMatchStr asc
 itemType:2
 all
 itemModelNoExactMatchStr^30.0
 *:*
 50
 edismax
 true
   *  itemModelNoExactMatchStr:404*
 modelItemNoSearch
 50
 false
   
 
 **
   
 
   Kitchen Equipment*
 
 0212020
 0212020,0431  ,404

 ELECTRIC GENERAL SLICER WITH VACU BASE
* 404*
 404

 2
 13
 
   GENERAL
 
 0431  
 0
   
   
 
   Vacuum, Canister
 
 0642000
 0642000,0517  ,404

 HOOVER 
 404
* 404
*
 2
 48
 
   HOOVER
 
 0517  
 0
   
   
 
   Power roller
 
 0733200
 0733200,1164  ,404

 POWER PAINTER
 404
* 404
*
 2
 39
 
   WAGNER
 
 1164  
 0
   
   
 
   Dishwasher^
 
 013
 013,0164  ,DUS-404-19

 DISHWASHERS
 DUS-404-19 
 *DUS-404-19
*
 2
 185
 
   CALORIC
 
 0164  
 0
   
 
 
   itemModelNoExactMatchStr:404
   itemModelNoExactMatchStr:404
   +itemModelNoExactMatchStr:404
   +itemModelNoExactMatchStr:404
   
 
10.053003 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 4745495),
product of:
 1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
 10.053003 = idf(docFreq=971, maxDocs=8304922)
 1.0 = fieldNorm(field=itemModelNoExactMatchStr, doc=4745495)

 
10.053003 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 4781972),
product of:
 1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
 10.053003 = idf(docFreq=971, maxDocs=8304922)
 1.0 = fieldNorm(field=itemModelNoExactMatchStr, doc=4781972)

 
10.053003 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 8186768),
product of:
 1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
 10.053003 = idf(docFreq=971, maxDocs=8304922)
 1.0 = fieldNorm(field=itemModelNoExactMatchStr, doc=8186768)

 
5.0265017 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 4665718),
product of:
 1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
 10.053003 = idf(docFreq=971, maxDocs=8304922)
 0.5 = fieldNorm(field=itemModelNoExactMatchStr, doc=4665718)

   
   ExtendedDismaxQParser
   
   
   
 itemType:2
   
   
 itemType:2
   
   
 1.0
 
   1.0
   
 1.0
   
   
 0.0
   
   
 0.0
   
   
 0.0
   
   
 0.0
   
   
 0.0
   
 
 
   0.0
   
 0.0
   
   
 0.0
   
   
 0.0
   
   
 0.0
   
   
 0.0
   
   
 0.0
   
 
   
 





i have looked at some of the threads up here related to this topic, but
still do not understand why the additional document is coming back.

here is my query:

http://someserver/somecore/select?qt=modelItemNoSearch&q=itemModelNoExactMatchStr:404&debugQuery=true&rows=50


here is my RH from the solrconfig.xml

 
   
 edismax
 all
 10
 itemModelNoExactMatchStr^30.0
 *:*
   
   
 itemType:2
 itemModelNoExactMatchStr asc
   
   
 false
   
 


here is the field, copyField and text type from schema.xml

   


   
 
   
   
   
 
 
   

   
 
   





--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-help-with-exact-match-search-tp4014832.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr 4.0 simultaneous query problem

2012-10-19 Thread Rohit Harchandani
Hi,

The same query is fired always for 500 rows. The only thing different is
the "start" parameter.

The 3 shards are in the same instance on the same server. They all have the
same schema. But the inherent type of the documents is different. Also most
of the apps queries goes to shard "A" which has the smallest index size
(4gb).

The query is made to a "master" shard which by default goes to all 3 shards
for results. (also, the query that i am trying matches documents only only
in shard "A" mentioned above)

Will try debugQuery now and post it here.

Thanks,
Rohit




On Thu, Oct 18, 2012 at 11:00 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi,
>
> Maybe you can narrow this down a little further.  Are there some
> queries that are faster and some slower?  Is there a pattern?  Can you
> share examples of slow queries?  Have you tried &debugQuery=true?
> These 3 shards is each of them on its own server or?  Is the slow
> one always the one that hits the biggest shard?  Do they hold the same
> type of data?  How come their sizes are so different?
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html
>
>
> On Thu, Oct 18, 2012 at 12:22 PM, Rohit Harchandani 
> wrote:
> > Hi all,
> > I have an application which queries a solr instance having 3 shards(4gb,
> > 13gb and 30gb index size respectively) having 6 million documents in all.
> > When I start 10 threads in my app to make simultaneous queries (with
> > rows=500 and different start parameter, sort on 1 field and no facets) to
> > solr to return 500 different documents in each query, sometimes I see
> that
> > most of the responses come back within no time (500ms-1000ms), but the
> last
> > response takes close to 50 seconds (Qtime).
> > I am using the latest 4.0 release. What is the reason for this delay? Is
> > there a way to prevent this?
> > Thanks and regards,
> > Rohit
>


Re: [/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
found a solr/lucene bug : TimeLimitingCollector starts thread in static {}
with no way to stop them
 https://issues.apache.org/jira/browse/LUCENE-2822

is this the same issue? it is fixed in Luence 3.5.   but I am using solr3.5
with lucene 2.9.3 (matched lucene version).

can anyone shed some light on if this means I need to upgrade to lucene 3.5?
thanks
jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788p4014833.html
Sent from the Solr - User mailing list archive at Nabble.com.


need help with exact match search

2012-10-19 Thread geeky2
environment: solr 3.5

Hello,

i have a query for an exact match that is bringing back one (1) additional
record that is NOT an exact match.  

when i do an exact match search for 404 - i should get back three (3)
document, *but i get back the additional record, with an
itemModelNoExactMatchStr of DUS-404-19  *   

can someone help me understand what i am missing or not setting up
correctly?


response from solr with 4 documents 



  
0
1

  itemModelNoExactMatchStr asc
  itemType:2
  all
  itemModelNoExactMatchStr^30.0
  *:*
  50
  edismax
  true
*  itemModelNoExactMatchStr:404*
  modelItemNoSearch
  50
  false

  
  **

  
Kitchen Equipment*
  
  0212020
  0212020,0431  ,404   

  ELECTRIC GENERAL SLICER WITH VACU BASE
 * 404*
  404   

  2
  13
  
GENERAL
  
  0431  
  0


  
Vacuum, Canister
  
  0642000
  0642000,0517  ,404   

  HOOVER 
  404
 * 404   
*
  2
  48
  
HOOVER
  
  0517  
  0


  
Power roller
  
  0733200
  0733200,1164  ,404   

  POWER PAINTER
  404
 * 404   
*
  2
  39
  
WAGNER
  
  1164  
  0


  
Dishwasher^
  
  013
  013,0164  ,DUS-404-19

  DISHWASHERS
  DUS-404-19 
  *DUS-404-19
*
  2
  185
  
CALORIC
  
  0164  
  0

  
  
itemModelNoExactMatchStr:404
itemModelNoExactMatchStr:404
+itemModelNoExactMatchStr:404
+itemModelNoExactMatchStr:404

  
10.053003 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 4745495),
product of:
  1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
  10.053003 = idf(docFreq=971, maxDocs=8304922)
  1.0 = fieldNorm(field=itemModelNoExactMatchStr, doc=4745495)

  
10.053003 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 4781972),
product of:
  1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
  10.053003 = idf(docFreq=971, maxDocs=8304922)
  1.0 = fieldNorm(field=itemModelNoExactMatchStr, doc=4781972)

  
10.053003 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 8186768),
product of:
  1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
  10.053003 = idf(docFreq=971, maxDocs=8304922)
  1.0 = fieldNorm(field=itemModelNoExactMatchStr, doc=8186768)

  
5.0265017 = (MATCH) fieldWeight(itemModelNoExactMatchStr:404 in 4665718),
product of:
  1.0 = tf(termFreq(itemModelNoExactMatchStr:404)=1)
  10.053003 = idf(docFreq=971, maxDocs=8304922)
  0.5 = fieldNorm(field=itemModelNoExactMatchStr, doc=4665718)


ExtendedDismaxQParser



  itemType:2


  itemType:2


  1.0
  
1.0

  1.0


  0.0


  0.0


  0.0


  0.0


  0.0

  
  
0.0

  0.0


  0.0


  0.0


  0.0


  0.0


  0.0

  

  





i have looked at some of the threads up here related to this topic, but
still do not understand why the additional document is coming back.

here is my query:

http://someserver/somecore/select?qt=modelItemNoSearch&q=itemModelNoExactMatchStr:404&debugQuery=true&rows=50


here is my RH from the solrconfig.xml

  

  edismax
  all
  10
  itemModelNoExactMatchStr^30.0
  *:*


  itemType:2
  itemModelNoExactMatchStr asc


  false

  


here is the field, copyField and text type from schema.xml


 


  



  
  



  






--
View this message in context: 
http://lucene.472066.n3.nabble.com/need-help-with-exact-match-search-tp4014832.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.0 copyField not applying index analyzers

2012-10-19 Thread Jack Krupansky
What exactly is the precise symptom - give us an example with field names of 
source and dest and what precise value is in fact being indexed. Is the 
entire field value being indexed as a single term/string (if analyzer is not 
being applied)? Or, what?


-- Jack Krupansky

-Original Message- 
From: davers

Sent: Friday, October 19, 2012 2:51 PM
To: solr-user@lucene.apache.org
Subject: Solr 4.0 copyField not applying index analyzers

I am upgrading from solr 3.6 to solr 4.0 and my copyFields do not seem to be
applying the index analyzers. I'm sure there is something i'm missing in my
schema.xml. I am also using a DIH but I'm not sure that matters.














  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


  


  


  
  


  

  

  

  
  
  
  
  
  
  


  

  
  


  
  
  
  
  

  

  

  








id






 
 
 
 
 
 

 
 
 
 

 
 
 
 
 
 
 
 
 
 
 


 

 



   


   




   
   
   
   


   
   
   
   


   


   



   


   
   
   
   
   


   





   
 
   
 
   


   
 
   
   
   
   
   
   
 
 
   
   
   
   
   
   
   
 
   

   
 
   
   
   
   
   
   
 
 
   
   
   
   
   
   
   
 
   

   
 
   
   
   
   
   
 
 
   
   
   
   
   
   
 
   



   
 
   
   
   
 
   


   
 
   
   
   
   
   
 
 
   
   
   
   
   
 
   


   
 
   
   
   
   
 
 
   
   
   
   
 
   


   
 

   

   

   
 
   

 






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-copyField-not-applying-index-analyzers-tp4014811.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-19 Thread Bill Au
It's not the browser cache.  I have tried reloading the admin page and
accessing the admin page from another machine.  Both show the older index
version and generation.  On the slave, replication did kicked in and show
the new index version and generation for the slave.  But the slave admin
page also shows the older index version and generation for the master.

If I do a second delete by query on the master, the master index generation
reported the admin UI does go up by one on both the master and slave.  But
it is still one generation behind.

Bill

On Fri, Oct 19, 2012 at 7:09 AM, Erick Erickson wrote:

> I wonder if you're getting hit by the browser caching the admin page and
> serving up the old version? What happens if you try from a different
> browser or purge the browser cache?
>
> Of course you have to refresh the master admin page, there's no
> automatic update but I assume you did that.
>
> Best
> Erick
>
> On Thu, Oct 18, 2012 at 1:59 PM, Bill Au  wrote:
> > Just discovered that the replication admin REST API reports the correct
> > index version and generation:
> >
> > http://master_host:port/solr/replication?command=indexversion
> >
> > So is this a bug in the admin UI?
> >
> > Bill
> >
> > On Thu, Oct 18, 2012 at 11:34 AM, Bill Au  wrote:
> >
> >> I just upgraded to Solr 4.0.0.  I noticed that after a delete by query,
> >> the index version, generation, and size remain unchanged on the master
> even
> >> though the documents have been deleted (num docs changed and those
> deleted
> >> documents no longer show up in query responses).  But on the slave both
> the
> >> index version, generation, and size are updated.  So I though the master
> >> and slave were out of sync but in reality that is not true.
> >>
> >> What's going on here?
> >>
> >> Bill
> >>
>


number and minus operator

2012-10-19 Thread calmsoul
I have a document with name ABC 102030 XYZ and if i search for this document
with ABC and -"10" then i dont get this document (which is correct behavior)
but when i do ABC and -10 i don't get the correct result back.  Any
explanation around this. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/number-and-minus-operator-tp4014794.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Easy question ? docs with empty geodata field

2012-10-19 Thread Amit Nithian
So here is my spec for lat/long (similar to yours except I explicitly
define the sub-field names for clarity)






So then the query would be location_0_latLon:[ * TO *].

Looking at your schema, my guess would be:
location_0_coordinate:[* TO *]
location_1_coordinate:[* TO *]

Let me know if that helps
Amit

On Fri, Oct 19, 2012 at 9:37 AM, darul  wrote:
> Your idea looks great but with this schema info :
>
>   subFieldSuffix="_d"/>
>  subFieldSuffix="_coordinate"/>
> 
> .
>
> 
>  stored="false" />
>
> How can I use it ?
>
> fq=location_coordinate:[1 to *] not working by instance
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751p4014779.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: [/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
by the way, I am running tomcat 6, solr 3.5 on redhat 2.6.18-274.el5 #1 SMP
Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788p4014792.html
Sent from the Solr - User mailing list archive at Nabble.com.


[/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
very often when we try to shutdown tomcat, we got following error in
catalina.out indicating a solr thread can not be stopped, the tomcat results
hanging, we have to kill -9, which we think lead to some core corruptions in
our production environment. please help ...

catalina.out:

... ...

Oct 19, 2012 10:17:22 AM org.apache.catalina.loader.WebappClassLoader
clearReferencesThreads
SEVERE: The web application [/solr] appears to have started a thread named
[pool-69-thread-1] but has failed to stop it. This is very likely to create
a memory leak.

Then I used kill -3 to signal the thread dump, here is what I get (note the
thread [pool-69-thread-1] is hanging) :

2012-10-19 10:18:39
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.2-b06 mixed mode):

"DestroyJavaVM" prio=10 tid=0x55b39800 nid=0x7e82 waiting on
condition [0x]
   java.lang.Thread.State: RUNNABLE

"pool-69-thread-1" prio=10 tid=0x2aaabcb41800 nid=0x19fa waiting on
condition [0x4205e000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0006de699d80> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown
Source)
at java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

"JDWP Transport Listener: dt_socket" daemon prio=10 tid=0x578aa000
nid=0x19f9 runnable [0x]
   java.lang.Thread.State: RUNNABLE

... ...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788.html
Sent from the Solr - User mailing list archive at Nabble.com.


Highlighter isn't highlighting what is matched in query analyzer

2012-10-19 Thread Ali Nabavi
Hi, all.

The content I'm trying to index contains dollar signs that should be
indexed and matched, e.g., "$1".

I've set up my schema to index the dollar sign, and am able to successfully
match it with the query analyzer; searching for "$1" matches "$1".

However, the highlighter doesn't seem to recognize the dollar sign.  When I
submit a query for "$1", the results do contain highlighted results, but
the highlights appear like "$1"; the dollar sign is not
highlighted.

How can I ensure that the highlighter will highlight the entirety of what
is matched in the query analyzer tool?

-Ali


Re: Easy question ? docs with empty geodata field

2012-10-19 Thread darul
Your idea looks great but with this schema info :

 


.




How can I use it ?

fq=location_coordinate:[1 to *] not working by instance





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751p4014779.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sorl 4.0: ClassNotFoundException DataImportHandler

2012-10-19 Thread srinalluri
Thanks Chris for your reply. I really need some help here.

1) If I put the apache-solr-dataimporthandler-*.jar files in solr/lib
folder, the jar files are loading. I see that in the tomcat logs. But in the
end it says 'ClassNotFoundException DataImportHandler'.

2) So If I remove apache-solr-dataimporthandler-*.jar from solr/lib folder
and placed them in tomcat/lib folder. No more ClassNotFoundException. But
this time it says 'Error Instantiating Request Handler,
org.apache.solr.handler.dataimport.DataImportHandler failed to instantiate
org.apache.solr.request.SolrRequestHandler'.








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorl-4-0-ClassNotFoundException-DataImportHandler-tp4014348p4014770.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Easy question ? docs with empty geodata field

2012-10-19 Thread Amit Nithian
What about querying on the dynamic lat/long field to see if there are
documents that do not have the dynamic _latlon0 or whatever defined?

On Fri, Oct 19, 2012 at 8:17 AM, darul  wrote:
> I have already tried but get a nice exception because of this field type :
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751p4014763.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Solr-4.0.0 DIH not indexing xml attributes

2012-10-19 Thread Billy Newman
Hello all,

I am having problems indexing xml attributes using the DIH.

I have the following xml:






However nothing is getting inserted into my index.

I am pretty sure this should work so I have no idea what is wrong.

Can anyone else confirm that this is a problem?  Or is it just me?

Thanks,
Billy


Benchmarking/Performance Testing question

2012-10-19 Thread Amit Nithian
Hi all,

I know there have been many posts about this already and I have done
my best to read through them but one lingering question remains. When
doing performance testing on a Solr instance (under normal production
like circumstances, not the ones where commits are happening more
frequently than necessary), is there any value in performance testing
against a server with caches *disabled* with a profiler hooked up to
see where queries in the absence of a cache are spending the most
time?

The reason I am asking this is to tune things like field types, using
tint vs regular int, different precision steps etc. Or maybe sorting
is taking a long time and the profiler shows an inordinate amount of
time spent there etc. so either we find a different way to solve that
particular problem. Perhaps we are faceting on something bad etc. Then
we can optimize those to at least not be as slow and then ensure that
caching is tuned properly so that cache misses don't yield these
expensive spikes.

I'm trying to devise a proper performance testing for any new
features/config changes and wanted to get some feedback on whether or
not this approach makes sense. Of course performance testing against a
typical production setup *with* caching will also be done to make sure
things behave as expected.

Thanks!
Amit


Re: Getting count for Multi-Select Faceting

2012-10-19 Thread fbrisbart
Did you look think of using 'facet.query' ?
Adding '&facet.query=category:Article' to your url should return what
you expected.

Franck Brisbart



Le vendredi 19 octobre 2012 à 15:18 +0200, Stephane Gamard a écrit :
> Hi all, 
> 
> Congrats on the 4.0.0 delivery, it's a pleasure to work with! 
> 
> I have a small problem that I am trying to "elegantly" resolve: while using 
> multi-select faceting it might happen that a facet is selected which is not 
> part of the facet list (due to limit for example). When executing the query I 
> cannot then get the facet's value count as it still outside of the scope of 
> the limit. 
> 
> for a sample query: 
> http://192.168.160.2:8983/solr/select?fq={!tag=scat}category:Article&facet.field={!ex=scat}category&q=*:*&facet=true&facet.limit=5&facet.mincount=1
> 
> I have the following results:
> 
> 
> 
> 6225
> 3055
> 236
> 187
> 59
> 
> 
> 
> Note that the facet (category:Article) is not present within the facet_fields 
> result. I've thought of running 2 facet queries where one is not tagged and 
> merge the 2 list within the UI. Is that the best solution available, or 
> should the facet of fq be present (as sticky) within the facet_list? 
> 
> Cheers, 
> 
> _Stephane




Re: Easy question ? docs with empty geodata field

2012-10-19 Thread Tanguy Moal
Hello,

Did you try q=-geodata:[* TO *] ? (Note the '-' (minus))
This reads as "documents without any value for field named geodata".

Also if you plan to use this intensively, you'd better declare a boolean
field telling if geodata are set or not and set a value to each doc,
because the -field_name:[* TO *] is an expansive query, especially on large
data sets.

Regards,

--
Tanguy

2012/10/19 darul 

> sorry, I mean this field called "geodata" in my schema
>
>  subFieldSuffix="_coordinate"/>
> 
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751p4014752.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Data Writing Performance of Solr 4.0

2012-10-19 Thread Mark Miller
On Fri, Oct 19, 2012 at 2:50 AM, higashihara_hdk
 wrote:
> Hello everyone.
>
> I have two questions. I am considering using Solr 4.0 to perform full
> searches on the data output in real-time by a Storm cluster
> (http://storm-project.net/).
>
> 1. In particular, I'm concerned whether Solr would be able to keep up
> with the 2000-message-per-second throughput of the Storm cluster. What
> kind of throughput would I be able to expect from Solr 4.0, for example
> on a Xeon 2.5GHz 4-core with HDD?

It depends on the size of the messages and the analysis you will be applying.

But without any other info, yes, it's possible depending on your data
and how you massage it.

>
> 2. Also, how efficiently would Solr scale with clustering?

That's a pretty general question.


-- 
- Mark


Re: Easy question ? docs with empty geodata field

2012-10-19 Thread darul
sorry, I mean this field called "geodata" in my schema






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751p4014752.html
Sent from the Solr - User mailing list archive at Nabble.com.


Getting count for Multi-Select Faceting

2012-10-19 Thread Stephane Gamard
Hi all, 

Congrats on the 4.0.0 delivery, it's a pleasure to work with! 

I have a small problem that I am trying to "elegantly" resolve: while using 
multi-select faceting it might happen that a facet is selected which is not 
part of the facet list (due to limit for example). When executing the query I 
cannot then get the facet's value count as it still outside of the scope of the 
limit. 

for a sample query: 
http://192.168.160.2:8983/solr/select?fq={!tag=scat}category:Article&facet.field={!ex=scat}category&q=*:*&facet=true&facet.limit=5&facet.mincount=1

I have the following results:



6225
3055
236
187
59



Note that the facet (category:Article) is not present within the facet_fields 
result. I've thought of running 2 facet queries where one is not tagged and 
merge the 2 list within the UI. Is that the best solution available, or should 
the facet of fq be present (as sticky) within the facet_list? 

Cheers, 

_Stephane

Easy question ? docs with empty geodata field

2012-10-19 Thread darul
Hello,

Looking to get all documents with empty geolocalisation field, I have not
found any way to do it, with ['' to *], 

geodata being a specific field, do you have any solution ?

Thanks,

Jul



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Easy-question-docs-with-empty-geodata-field-tp4014751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query related to Solr XML

2012-10-19 Thread Otis Gospodnetic
Leena,

Please ask on Lucid fora. You'll get better and faster help there.

Otis
--
Performance Monitoring - http://sematext.com/spm
On Oct 19, 2012 5:54 AM, "Leena Jawale" 
wrote:

>
> Hi,
>
> I made a Solr XML data source in lucidworks enterprise v2.1. When I search
> in Solr Admin for text. I am unable to get the result.
> Could you help me in this?
>
>
>
> Thanks & Regards,
> Leena Jawale
> Software Engineer Trainee
> BFS BU
> Phone No. - 9762658130
> Email - leena.jaw...@lntinfotech.com
>
>
> 
> The contents of this e-mail and any attachment(s) may contain confidential
> or privileged information for the intended recipient(s). Unintended
> recipients are prohibited from taking action on the basis of information in
> this e-mail and using or disseminating the information, and must notify the
> sender and delete it from their system. L&T Infotech will not accept
> responsibility or liability for the accuracy or completeness of, or the
> presence of any virus or disabling code in this e-mail"
>


Re: Query related to Solr XML

2012-10-19 Thread Erik Hatcher
Leena -

It's best to ask LucidWorks related questions at http://support.lucidworks.com 
rather than in this e-mail list.

As for your issue more information is needed in order to assist.  Did you 
start the Solr XML crawler?   Does your data source show that there are 
documents in the index?   If you simply press search (with an empty query) do 
you see documents?   (best, again, to respond to these questions at the 
LucidWorks support site)

Erik


On Oct 19, 2012, at 05:54 , Leena Jawale wrote:

> 
> Hi,
> 
> I made a Solr XML data source in lucidworks enterprise v2.1. When I search in 
> Solr Admin for text. I am unable to get the result.
> Could you help me in this?
> 
> 
> 
> Thanks & Regards,
> Leena Jawale
> Software Engineer Trainee
> BFS BU
> Phone No. - 9762658130
> Email - leena.jaw...@lntinfotech.com
> 
> 
> 
> The contents of this e-mail and any attachment(s) may contain confidential or 
> privileged information for the intended recipient(s). Unintended recipients 
> are prohibited from taking action on the basis of information in this e-mail 
> and using or disseminating the information, and must notify the sender and 
> delete it from their system. L&T Infotech will not accept responsibility or 
> liability for the accuracy or completeness of, or the presence of any virus 
> or disabling code in this e-mail"



Re: Apache Solr Quiz

2012-10-19 Thread Dmitry Kan
Thanks for the quiz. It is refreshing. Do you plan on covering other parts
of SOLR management, like various handlers, scoring, plugins, sharding etc?

Dmitry

On Wed, Oct 17, 2012 at 7:12 PM, Yulia Crowder wrote:

> I love Solr!
> I have searched for a quiz about Solr and didn't find any on the net.
> I am pleased to say that I have conducted a Quiz about Solr:
>
> http://www.quizmeup.com/quiz/apache-solr-configuration
>
> It is build on a free wiki based quiz site. You can, and welcome to,
> improve my questions and add new questions.
> Hope you find it useful and enjoyable way to learn about Solr.
> Comments?
>


SimpleTextCodec usage tips?

2012-10-19 Thread seralf
Hi

does anybody could give some direction / suggestion on how to correctly
configure and use the SimpleTextCodec?
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/simpletext/SimpleTextCodec.html

i'd like to do some test for debugging purpose, but i'm not shure on how to
enable the pluggable codecs interface.

as far as i understand, i have to use the codec factory in the schema.xml,
but i didn't understand where to configure and choice the specific codec.

thank you in advance (sorry if this question was earlier posted, i din't
find any post on that),

Alfredo Serafini


Re: Solr 4.0 segment flush times has bigger difference between tow machines

2012-10-19 Thread Jun Wang
I have found that segment flush is controlled by
DocumentWriterFlushControl, and indexing is implemented by
DocumentWriterPerThread. DocumentWriterFlushControl has information about
number of doc and size of RAM buffer, but this seemed be shared by
all DocumentWriterPerThread. Is that RAM limit is sum of all buffer
of DocumentWriterPerThread?

2012/10/19 Jun Wang 

> Hi
>
> I have 2 machine for a collection, and it's using DIH to import data, DIH
> is trigger via url request at one machine, let's call it A, and A will
> forward some index to machine B. Recently I have found that segment flush
> happened more in machine B. here is part of INFOSTREAM.txt.
>
> Machine A:
> 
> DWPT 0 [Thu Oct 18 20:06:20 PDT 2012; Thread-39]: flush postings as
> segment _4r3 numDocs=71616
> DWPT 0 [Thu Oct 18 20:06:21 PDT 2012; Thread-39]: new segment has 0
> deleted docs
> DWPT 0 [Thu Oct 18 20:06:21 PDT 2012; Thread-39]: new segment has no
> vectors; no norms; no docValues; prox; freqs
> DWPT 0 [Thu Oct 18 20:06:21 PDT 2012; Thread-39]:
> flushedFiles=[_4r3_Lucene40_0.prx, _4r3.fdt, _4r3.fdx, _4r3.fnm,
> _4r3_Lucene40_0.tip, _4r3_Lucene40_0.tim, _4r3_Lucene40_0.frq]
> DWPT 0 [Thu Oct 18 20:06:21 PDT 2012; Thread-39]: flushed codec=Lucene40
> D
>
> Machine B
> --
> DWPT 0 [Thu Oct 18 21:41:22 PDT 2012; http-0.0.0.0-8080-3]: flush postings
> as segment _zi0 numDocs=4302
> DWPT 0 [Thu Oct 18 21:41:22 PDT 2012; http-0.0.0.0-8080-3]: new segment
> has 0 deleted docs
> DWPT 0 [Thu Oct 18 21:41:22 PDT 2012; http-0.0.0.0-8080-3]: new segment
> has no vectors; no norms; no docValues; prox; freqs
> DWPT 0 [Thu Oct 18 21:41:22 PDT 2012; http-0.0.0.0-8080-3]:
> flushedFiles=[_zi0_Lucene40_0.prx, _zi0.fdx, _zi0_Lucene40_0.tim, _zi0.fdt,
> _zi0.fnm, _zi0_Lucene40_0.frq, _zi0_Lucene40_0.tip]
> DWPT 0 [Thu Oct 18 21:41:22 PDT 2012; http-0.0.0.0-8080-3]: flushed
> codec=Lucene40
> D
>
> I have found that flush occured  when number of doc in RAM reached
> 7~9000 in machine A, but the number in machine B is very different,
> almost is 4000.  It seem that every doc in buffer used more RAM in machine
> B then machine A, that result in more flush . Does any one know why this
> happened?
>
> My conf is here.
>
> 6410
>
>
>
>
> --
> from Jun Wang
>
>
>


-- 
from Jun Wang


Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-19 Thread Erick Erickson
I wonder if you're getting hit by the browser caching the admin page and
serving up the old version? What happens if you try from a different
browser or purge the browser cache?

Of course you have to refresh the master admin page, there's no
automatic update but I assume you did that.

Best
Erick

On Thu, Oct 18, 2012 at 1:59 PM, Bill Au  wrote:
> Just discovered that the replication admin REST API reports the correct
> index version and generation:
>
> http://master_host:port/solr/replication?command=indexversion
>
> So is this a bug in the admin UI?
>
> Bill
>
> On Thu, Oct 18, 2012 at 11:34 AM, Bill Au  wrote:
>
>> I just upgraded to Solr 4.0.0.  I noticed that after a delete by query,
>> the index version, generation, and size remain unchanged on the master even
>> though the documents have been deleted (num docs changed and those deleted
>> documents no longer show up in query responses).  But on the slave both the
>> index version, generation, and size are updated.  So I though the master
>> and slave were out of sync but in reality that is not true.
>>
>> What's going on here?
>>
>> Bill
>>


Re: Antw: Re: How to retrieve field contents as UTF-8 from Solr-Index with SolrJ

2012-10-19 Thread Andreas Kahl
Fetching the same records using a raw Http-Request works fine and
characters are OK. I am actually considering to fetch the data in Java
via raw Http-Requests + XSLTResponsWriter as a workaround, but I want to
try it first using the 'native' way with SolrJ. 

Andreas
 
>>> "Jack Krupansky"  18.10.2012 21:36 >>> 
Have you verified that the data was indexed properly (UTF-8 encoding)?
Try a 
raw HTTP request using the browser or curl and see how that field looks
in 
the resulting XML.

-- Jack Krupansky

-Original Message- 
From: Andreas Kahl
Sent: Thursday, October 18, 2012 1:10 PM
To: j...@basetechnology.com ; solr-user@lucene.apache.org
Subject: Antw: Re: How to retrieve field contents as UTF-8 from
Solr-Index 
with SolrJ

Jack,

Thanks for the hint, but we have already set URIEncoding="UTF-8" on
all
our tomcats, too.

Regards
Andreas

>>> "Jack Krupansky"  18.10.12 17.11 Uhr >>>
It may be that your container does not have UTF-8 enabled. For
example,
with
Tomcat you need something like:



Make sure your "Connector" element has URIEncoding="UTF-8" (for
Tomcat.)

-- Jack Krupansky

-Original Message- 
From: Andreas Kahl
Sent: Thursday, October 18, 2012 10:53 AM
To: solr-user@lucene.apache.org
Subject: How to retrieve field contents as UTF-8 from Solr-Index with
SolrJ

Hello everyone,

we are trying to implement a simple Servlet querying a Solr 3.5-Index
with SolrJ. The Query we send is an identifier in order to retrieve a
single record. From the result we extract one field to return. This
field contains an XML-Document with characters from several european
and
asian alphabets, so we need UTF-8.

Now we have the problem that the string returned by
marcXml = results.get(0).getFirstValue("marcxml").toString();
is not valid UTF-8, so the resulting XML-Document is not well formed.

Here is what we do in Java:
<<
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("q", query.toString());
params.set("fl", "marcxml");
params.set("rows", "1");
try {
QueryResponse result = server.query(params,
SolrRequest.METHOD.POST);
SolrDocumentList results = result.getResults();
if (!results.isEmpty()) {
marcXml =
results.get(0).getFirstValue("marcxml").toString();
}
} catch (Exception ex) {
Logger.getLogger(MarcServer.class.getName()).log(Level.SEVERE,
null, ex);
}
>>

Charset.defaultCharset() is "UTF-8" on both, the querying machine and
the Solr-Server. Also we tried BinaryResponseParser as well as
XMLResponseParser when instantiating CommonsHttpSolrServer.

Does anyone have a solution to this? Is this related to
https://issues.apache.org/jira/browse/SOLR-2034 ? Is there
eventually a workaround?

Regards
Andreas





Saravanan Chinnadurai/Actionimages is out of the office.

2012-10-19 Thread Saravanan . Chinnadurai
I will be out of the office starting  18/10/2012 and will not return until
23/10/2012.

Please email to itsta...@actionimages.com  for any urgent issues.


Action Images is a division of Reuters Limited and your data will therefore be 
protected
in accordance with the Reuters Group Privacy / Data Protection notice which is 
available
in the privacy footer at www.reuters.com
Registered in England No. 145516   VAT REG: 397000555


Query related to Solr XML

2012-10-19 Thread Leena Jawale

Hi,

I made a Solr XML data source in lucidworks enterprise v2.1. When I search in 
Solr Admin for text. I am unable to get the result.
Could you help me in this?



Thanks & Regards,
Leena Jawale
Software Engineer Trainee
BFS BU
Phone No. - 9762658130
Email - leena.jaw...@lntinfotech.com



The contents of this e-mail and any attachment(s) may contain confidential or 
privileged information for the intended recipient(s). Unintended recipients are 
prohibited from taking action on the basis of information in this e-mail and 
using or disseminating the information, and must notify the sender and delete 
it from their system. L&T Infotech will not accept responsibility or liability 
for the accuracy or completeness of, or the presence of any virus or disabling 
code in this e-mail"


Re: "diversity" of search results?

2012-10-19 Thread dirk
Hi Paul,

yes that`s a typical problem in configuring a search engine. A solution
depends on your data. Sometimes you can overcome this problem by fine tuning
your search engine on boosting level. Thats not easy and always based on
trail and error tests.

Another thing you can do is to try to realize a data pre-processing which
compensate the reasons of similar content in certain fields, e.g. in a title
field. 
For example if you have products with very similar titles and you boost such
a field. The result is, that you always will found all documents in the
result list. But if you go on and add some informations (perhaps out of
other search fields) in this title field you perhaps can reduce the
similarity. (typical example in my branch: Book titles in different volumes,
then I add the volumn  number and der year to the title field.) 

Perhaps it is also necessary to cape with a pre-processed deduplication.
Here you can find an entry point:
http://wiki.apache.org/solr/Deduplication

Dirk

   



-
my developer logs 
--
View this message in context: 
http://lucene.472066.n3.nabble.com/diversity-of-search-results-tp4014692p4014696.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Building an enterprise quality search engine using Apache Solr

2012-10-19 Thread Ahmet Arslan
Hi Alexandre,

Yes it is active. ManifoldCF 1.0.1 is released yesterday :)
You can index content of SharePoint 2010 to Solr 4.0.0 .

'End user documentation' and 'in action book' are two main resources.

http://manifoldcf.apache.org/release/release-1.0.1/en_US/end-user-documentation.html

http://www.manning.com/wright/


--- On Fri, 10/19/12, Alexandre Rafalovitch  wrote:

> From: Alexandre Rafalovitch 
> Subject: Re: Building an enterprise quality search engine using Apache Solr
> To: solr-user@lucene.apache.org
> Date: Friday, October 19, 2012, 7:18 AM
> This is the first time I hear of this
> project. Looks interesting, but
> Is it active?
> 
> The integration FAQ seem to be talking about Solr 1.4, a bit
> out of date.
> 
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from
> happening all
> at once. Lately, it doesn't seem to be working. 
> (Anonymous  - via GTD
> book)
> 
> 
> On Fri, Oct 19, 2012 at 12:37 AM, Jack Krupansky
> 
> wrote:
> > Take a look at Apache ManifoldCF for crawling
> enterprise repositories such
> > as SharePoint (as well as lighterweight web crawling
> and file system
> > crawling).
> >
> > http://manifoldcf.apache.org/en_US/index.html
> >
> > -- Jack Krupansky
> >
> > -Original Message- From: Venky Naganathan
> > Sent: Thursday, October 18, 2012 2:21 PM
> > To: solr-user@lucene.apache.org
> > Subject: Building an enterprise quality search engine
> using Apache Solr
> >
> >
> > Hello,
> >
> > Can some one please provide me advise on the below ?
> >
> > 1) I am considering building an enterprise search
> engine that indexes
>


"diversity" of search results?

2012-10-19 Thread Paul Libbrecht
Hello SOLR expert,

yesterday in our group we realized that a danger we may need to face is that a 
search result includes very similar results.
Of course, one would expect skimming so that duplicates that show almost the 
same results in a search result would be avoided but we fear that this is not 
possible.

I was wondering if some technology, plugin, or even research was existing that 
would enable a search result to be partially reordered so that "diversity" is 
ensured for a first page of results at least.

I suppose that might be doable by processing the result page and the next (and 
the five next?) and pushing down some results if they are "too" similar to 
previous ones.

Hope I am being clear.

Paul

Re: KeeperException (NodeExists for /overseer): SolrCloud Multiple Collections - is it safe ignore these exceptions?

2012-10-19 Thread Jeevanandam Madanagopal
Thanks Mark! 

Cheers, Jeeva

On Oct 19, 2012, at 8:35 AM, Mark Miller  wrote:

> Yes, those exceptions are fine. These are cases where we try to delete the 
> node if it's there, but don't care if it's not there - things like that. In 
> some of these cases, ZooKeeper logs things we can't stop, even though it's 
> expected that sometimes we will try and remove nodes that are not there or 
> create nodes that are already there.
> 
> - Mark
> 
> On Thu, Oct 18, 2012 at 9:01 AM, Jeevanandam Madanagopal  
> wrote:
> Hello -
> 
> While doing prototype of SolrCloud with Multiple Collection.  Each collection 
> represents country level data.
> - searching within collection represents country level - local search
> - searching across collection represents global search
> 
> Attached the graph image of SolrCoud structure.  For prototype I'm running 
> Embedded ZooKeeper ensemble (5 replicated zookeeper servers).
> - Searching and Indexing in respective collection works well
> - Search across collection works well (for global search)
> 
> 
> 
> 
> While joining the 'Collection2' to zookeeper ensemble I noticed the following 
> KeeperException in the logger.
> 
> Question 'is it safe to ignore these exceptions?'
> 
> Exception Log snippet:
> Oct 18, 2012 4:54:26 PM org.apache.zookeeper.server.NIOServerCnxn$Factory run
> INFO: Accepted socket connection from /fe80:0:0:0:0:0:0:1%1:62700
> Oct 18, 2012 4:54:26 PM org.apache.zookeeper.server.NIOServerCnxn 
> readConnectRequest
> INFO: Client attempting to establish new session at 
> /fe80:0:0:0:0:0:0:1%1:62700
> Oct 18, 2012 4:54:26 PM org.apache.zookeeper.server.NIOServerCnxn 
> finishSessionInit
> INFO: Established session 0x13a73521356000a with negotiated timeout 15000 for 
> client /fe80:0:0:0:0:0:0:1%1:62700
> Oct 18, 2012 4:54:26 PM org.apache.zookeeper.server.PrepRequestProcessor 
> pRequest
> INFO: Got user-level KeeperException when processing 
> sessionid:0x13a73521356000a type:create cxid:0x1 zxid:0xfffe 
> txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode = 
> NodeExists for /overseer
> Oct 18, 2012 4:54:26 PM org.apache.zookeeper.server.PrepRequestProcessor 
> pRequest
> INFO: Got user-level KeeperException when processing 
> sessionid:0x13a73521356000a type:create cxid:0x2 zxid:0xfffe 
> txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode = 
> NodeExists for /overseer
> Oct 18, 2012 4:54:26 PM org.apache.zookeeper.server.PrepRequestProcessor 
> pRequest
> INFO: Got user-level KeeperException when processing 
> sessionid:0x13a73521356000a type:delete cxid:0x4 zxid:0xfffe 
> txntype:unknown reqpath:n/a Error 
> Path:/live_nodes/mac-book-pro.local:7500_solr Error:KeeperErrorCode = NoNode 
> for /live_nodes/mac-book-pro.local:7500_solr
> Oct 18, 2012 4:54:26 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> INFO: Updating live nodes
> 
> Cheers, Jeeva
> 
> 
> 
> 
> -- 
> - Mark



Re: Building an enterprise quality search engine using Apache Solr

2012-10-19 Thread dirk
Hi,
your question is not easy to answer. It depends on so many things, that
there is no standard way to realize an enterprise solution and time planning
aspects are depending on so much things. 

I can try to give you some brief notes about our solution, but there are
some differences in target group and data source. I am technical responsible
for the system disco (a research and discovery system) at the library at
university of Münster. (excuse me, I don't want to make a promotion tour
here, I earn no money with such activities -:)). Ok, in this search engine,
based on lucene, we search in about 200 Mio Articles, Books, Journals and so
on. So we have different data sources in structure and also in the way of
delivery. At the beginning we thought, lets buy a solution in order to avoid
more or less own developement work. So we bought a commercial search engine,
which works on a lucene core with a proprietary business logic in order to
talk to lucene core. So far so good - or not good. At that time I was the
onliest worker on this project and I need nearly one and a half year in
fulltime in order to fullfill most features and requirements. And the reason
for that long time is not, that I had no exiperiences, (I hope so). I work
in this area nearly 15 years in different companies, always as developer in
J2EE. (That`s rare today, because today every experienced developer wants to
work as "leader" or manager, that`s sounds better and less project leader
are outsourced. ok, other topic) And other universities (customers) who
realized a comparable search engine in that environment took as long or
longer. So I am hopefully...

In germany we say "der teufel steckt im detail" (translation literally:
devil is hidden in detail), which means you start work and parallel to that
process mostly requirements changed, sadly in most cases after development
has done the software basis. For example we need a lot of time for the fine
tuning of ranking and for realizing a complete automatic mechanism to update
data sources. And it was one thing to realize the search in development and
run a first developer test, a complete other thing is to make the system fit
for 24/7 service and run a productive system without problems.

Most time we need on data pre-processing because of the "shit in - shit out"
problem. Work on the quality of data is expensive but you get no
appreciation, because everybody is cope with searching features. This
requirement shows us, that mostly it is impossible to avoid own developement
completely. 
Next thing is user interface, not every feature a customer knows from good
old database backboned systems is easy to realized in a search engine
because of more or less flat data structure. So we had to develop one
service after the other in order to read additional informations. In our
case for example runtime holding informations of our library. 

Summarized, if you want to estimate a concrete time duration in order to
realize a complete productive enterprise search solution, you should talk to
some people with similar solutions, think of your own requirements in detail
and then multiply your estimation with 2. Then perhaps you have a realistic
estimate. 
Dirk   



-
my developer logs 
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Building-an-enterprise-quality-search-engine-using-Apache-Solr-tp4014557p4014688.html
Sent from the Solr - User mailing list archive at Nabble.com.