Search Regression Testing

2011-04-06 Thread Mark Mandel
Hey guys,

I'm wondering how people are managing regression testing, in particular with
things like text based search.

I.e. if you change how fields are indexed or change boosts in dismax,
ensuring that doesn't mean that critical queries are showing bad data.

The obvious answer to me was using unit tests. These may be brittle as some
index data can change over time, but I couldn't think of a better way.

How is everyone else solving this problem?

Cheers,

Mark

-- 
E: mark.man...@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com


Re: how to start GarbageCollector

2011-04-06 Thread stockii
why is solr copy my complete index to somewhere when i start an delta-import?

i copy one core, start an full-import from 35Million docs and then start an
delta-import from the last hour (~2000Docs).
dih/solr need start to copy the hole index... why ? i think he is copy the
index, because my hdd-space starts to increase imediatly ...


my live core ended a delta in 5-10 seconds !?!?!?!?!?!?


i run jconsole during this time, what say it to me ? 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-start-GarbageCollector-tp2748080p2783923.html
Sent from the Solr - User mailing list archive at Nabble.com.


very slow commit. copy of index ?

2011-04-06 Thread stockii
Hello again ;-)

after a full-import from 36M Doc`s my delta import dont work fine. 

if i starts my delta (which runs on another core very fast) the commit need
vry long.
I think, that solr copys the hole index and commit the new documents in the
index and then reduce the index size after this operations !?!!?!?!?!

i start delta over DIH with: command=delta-importoptimize=falsecommit=true

jconsole is running with  but i dont know in which way jconsole can help
me ...


thx ! =) 



-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/very-slow-commit-copy-of-index-tp2783940p2783940.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: command is still running ? delta-import?

2011-04-06 Thread stockii
i have the same problem. any resolutions ? 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other Cores  100.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/command-is-still-running-delta-import-tp48p2783986.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search Regression Testing

2011-04-06 Thread Colin Vipurs
Hi Mark,

What we're doing is using a bunch of acceptance tests with JBehave to
drive our testing.  We run this in a clean room environment, clearing
out the indexes before a test run and inserting the data we're
interested in.  As well as tests to ensure things just work we have a
bunch of tests that insert data and check it comes out in the order
we're expecting to - so unexpected changes to boosts etc. can be caught
early.

Whereas what this doesn't tell us what a certain query will return with
our live data set, it does affirm our assertions about the abstract
case.  You could use a similar technique to insert a bunch of data and
then check your critical queries.


 Hey guys,
 
 I'm wondering how people are managing regression testing, in particular with
 things like text based search.
 
 I.e. if you change how fields are indexed or change boosts in dismax,
 ensuring that doesn't mean that critical queries are showing bad data.
 
 The obvious answer to me was using unit tests. These may be brittle as some
 index data can change over time, but I couldn't think of a better way.
 
 How is everyone else solving this problem?
 
 Cheers,
 
 Mark
 
 -- 
 E: mark.man...@gmail.com
 T: http://www.twitter.com/neurotic
 W: www.compoundtheory.com
 
 cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
 http://www.cfobjective.com.au
 
 Hands-on ColdFusion ORM Training
 www.ColdFusionOrmTraining.com
 
 
 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email 
 __


-- 


Colin Vipurs
Server Team Lead

Shazam Entertainment Ltd   
26-28 Hammersmith Grove, London W6 7HA
m:   +44 (0)  000 000   t: +44 (0) 20 8742 6820
w:www.shazam.com

Please consider the environment before printing this document

This e-mail and its contents are strictly private and confidential. It
must not be disclosed, distributed or copied without our prior consent.
If you have received this transmission in error, please notify Shazam
Entertainment immediately on: +44 (0) 020 8742 6820 and then delete it
from your system. Please note that the information contained herein
shall additionally constitute Confidential Information for the purposes
of any NDA between the recipient/s and Shazam Entertainment. Shazam
Entertainment Limited is incorporated in England and Wales under company
number 3998831 and its registered office is at 26-28 Hammersmith Grove,
London W6 7HA. 




__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

solr faceted search performance reason

2011-04-06 Thread Robin Palotai
Hello List,

Please see my question at
http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search,
I would be interested to know some details.

Thank you,
Robin


Re: Search Regression Testing

2011-04-06 Thread Paul Libbrecht
Mark,

In one project, with Lucene not Solr, I also use a smallish unit test sample 
and apply some queries there. 
It is very limited but is automatable.

I find a better way is to have precision and recall measures of real users run 
release after release. 
I could never fully apply this yet on a recurring basis sadly.

My ideal world would be that the search sample is small enough and that users 
are able to restrict search to this.
Then users have the possibility of checking correctness of each result (say, 
first 10) for each query out of which one can then read results. Often, users 
provide comments along, e.g. missing matches. This is packed as a wiki page.
First samples generally do not use enough of the features, this is adjusted as 
a dialogue.

As a developer I review the test suite run and plan for next adjustments.
The numeric approach allows easy mean precision and mean recall which is good 
for reporting.

My best reference for PR testing and other forms of testing Kavi Mahesh's Text 
Retrieval Quality, a primer: 
http://www.oracle.com/technetwork/database/enterprise-edition/imt-quality-092464.html

I would love to hear more of what the users have been doing.

paul


Le 6 avr. 2011 à 08:10, Mark Mandel a écrit :

 Hey guys,
 
 I'm wondering how people are managing regression testing, in particular with
 things like text based search.
 
 I.e. if you change how fields are indexed or change boosts in dismax,
 ensuring that doesn't mean that critical queries are showing bad data.
 
 The obvious answer to me was using unit tests. These may be brittle as some
 index data can change over time, but I couldn't think of a better way.
 
 How is everyone else solving this problem?
 
 Cheers,
 
 Mark
 
 -- 
 E: mark.man...@gmail.com
 T: http://www.twitter.com/neurotic
 W: www.compoundtheory.com
 
 cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
 http://www.cfobjective.com.au
 
 Hands-on ColdFusion ORM Training
 www.ColdFusionOrmTraining.com



Re: help with Jetty log message

2011-04-06 Thread Matthieu Huin

As far as I am aware of, licensing issues make that impossible for us ...

On 04/05/2011 07:29 PM, Kaufman Ng wrote:

Looks like you are using openjdk.  Can you try using Sun jdk?

On Mon, Apr 4, 2011 at 6:53 AM, Upayavirau...@odoko.co.uk  wrote:


This is not Solr crashing, per se, it is your JVM. I personally haven't
generally had much success debugging these kinds of failure - see
whether it happens again, and if it does, try updating your
JVM/switching to another/etc.

Anyone have better advice?

Upayavira

On Mon, 04 Apr 2011 11:59 +0200, Matthieu Huin
matthieu.h...@wallix.com  wrote:

Greetings all,

I am currently using solr as the backend behind a log aggregation and
search system my team is developing. All was well and good until I
noticed a test server crashing quite unexpectedly. We'd like to dig more
into the incident but none of us has much experience with Jetty crash
logs - not to mention that our Java is very rusty.

The crash log is joined as an attachment.

Could anyone help us with understanding what went wrong there ?

Also, would it be possible and/or wise to automatically restart the
server in case of such a crash ?


Thanks for your help. If you need any extra info about that case, do not
hesitate to ask !


Matthieu Huin



Email had 1 attachment:
+ hs_err_pid5033.log
   26k (text/x-log)

---
Enterprise Search Consultant at Sourcesense UK,
Making Sense of Open Source




How to avoid Lock file generation - solr 1.4.1

2011-04-06 Thread rajini maski
I am using Solr 1.4.1(windows os) and below are the settings  in my solr
config file:


writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs1/maxMergeDocs

lockTypenative/lockType

While writing the index,  I am doing the post procedure.. posting the xml
with solr/update http request.

 I am gettting the following error.

SEVERE: Could not start SOLR. Check solr/home property
java.nio.channels.OverlappingFileLockException
at sun.nio.ch.FileChannelImpl$SharedFileLockTable.checkList(Unknown Source)
at sun.nio.ch.FileChannelImpl$SharedFileLockTable.add(Unknown Source)
at sun.nio.ch.FileChannelImpl.tryLock(Unknown Source)
at java.nio.channels.FileChannel.tryLock(Unknown Source)
at org.apache.lucene.store.NativeFSLock.obtain(NativeFSLockFactory.java:233)
at org.apache.lucene.store.Lock.obtain(Lock.java:73)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1545)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1402)
at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190)
at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
at
org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
at
org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
at
org.apache.solr.handler.ReplicationHandler.inform(ReplicationHandler.java:845)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:486)
at org.apache.solr.core.SolrCore.init(SolrCore.java:588)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
at
org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4071)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4725)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601)
at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:675)
at
org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:601)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:502)
at org.apache.catalina.startup.HostConfig.check(HostConfig.java:1383)
at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:306)
at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
at
org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1385)
at
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1649)
at
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1658)
at
org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1638)
at java.lang.Thread.run(Unknown Source)

What are the correct settings to be made for avoiding this lock file?


Re: Script to remove all index.* leftovers

2011-04-06 Thread Markus Jelsma
Yes my mistake, you're right about #1.

On Wednesday 06 April 2011 05:25:50 William Bell wrote:
 Thank you for pointing out #2. The commitsToKeep is interesting, but I
 thought each commit would create a segment (before optimized) and be
 self contained in the index.* directory?
 
 I would only run this on the slave.
 
 Bill
 
 
 On Tue, Apr 5, 2011 at 2:54 PM, Markus Jelsma
 
 markus.jel...@openindex.io wrote:
  Hi,
  
  This seems alright as it leaves the current index in place, doesn't mess
  with the spellchecker and leave the properties alone. But, there are two
  problems:
  
  1. it doesn't take into account the commitsToKeep value set in the
  deletion policy, and;
  2. it will remove any directory to which a current downloading
  replication is targetted to.
  
  Issue 1 may not be a big issue as most users leave only one commit on
  disk but 2 is a real problem in master/slave architectures.
  
  Cheers,
  
  There is a bug that leaves old index.* directories in the Solr data
  directory.
  
  Here is a script that will clean it up. I wanted to make sure this is
  okay, without doing a core reload.
  
  Thanks.
  
  #!/bin/bash
  
  DIR=/mnt/servers/solr/data
  LIST=`ls $DIR`
  INDEX=`cat $DIR/index.properties | grep index\= | awk 'BEGIN { FS =
  = } ; { print $2 }'`
  echo $INDEX
  
  for file in  $LIST
  do
  if [ $INDEX == $file -o $file == index -o $file ==
  index.properties -o $file == replication.properties -o $file ==
  spellchecker ]
  then
  echo skip: $file
  else
  echo rm -rf $DIR/$file
  rm -rf $DIR/$file
  fi
  done

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: how to reset the index in solr

2011-04-06 Thread Gabriele Kahlout
Hi Marcus,

Your curl cmds don't work in that format on my unix. I conver them as
follows, and they still don't work:

$ curl --fail $solrIndex/update?commit=true -d  '*:*'
$ curl --fail $solrIndex/update -d  '' 

From the browser:
http://localhost:8080/solr/update?commit=true%20-d%20%27%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E%27

This is the response I get.

−

0
18



The only thing that works:
$rm - r SOLR_HOME/solr 
$CATALINA_HOME/bin/catalina.sh stop
$CATALINA_HOME/bin/catalina.sh start

I'm running a single core instance.
I'm using this nutch script [1] and this[2] hints at my solr config.

[1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
[2]
http://wiki.apache.org/solr/Troubleshooting%20HTTP%20Status%20404%20-%20missing%20core%20name%20in%20path?action=recallrev=1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-reset-the-index-in-solr-tp496574p2784198.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr architecture diagram

2011-04-06 Thread Jan Høydahl
Hi,

At Cominvent we've often had the need to visualize the internal architecture of 
Apache Solr in order to explain both the relationships of the components as 
well as the flow of data and queries. The result is a conceptual architecture 
diagram, clearly showing how Solr relates to the app-server, how cores relate 
to a Solr instance, how documents enter through an UpdateRequestHandler, 
through an UpdateChain and Analysis and into the Lucene index etc.

The drawing is created using Google draw, and the original is shared on Google 
Docs. We have licensed the diagram under the permissive Creative Commons 
CC-by license which lets you use, modify and re-distribute the diagram, even 
commercially, as long as you attribute us with a link.

Check it out at http://ow.ly/4sOTm
We'd love your comments

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Re: how to reset the index in solr

2011-04-06 Thread Gabriele Kahlout
Solved. The correct translation of Marcus cmd:
$ curl http://localhost:8080/solr/update?commit=true -H Content-Type:
text/xml --data-binary 'deletequery*:*/query/delete'

http://stackoverflow.com/questions/2358476/solr-delete-not-working-for-some-reason

NB: the response is still not what I'd expect:

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime57/int/lst
/response


On Wed, Apr 6, 2011 at 11:39 AM, Gabriele Kahlout
gabri...@mysimpatico.comwrote:

 Hi Marcus,

 Your curl cmds don't work in that format on my unix. I conver them as
 follows, and they still don't work:

 $ curl --fail $solrIndex/update?commit=true -d  '*:*'
 $ curl --fail $solrIndex/update -d  ''

 From the browser:

 http://localhost:8080/solr/update?commit=true%20-d%20%27%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E%27

 This is the response I get.

 −

 0
 18



 The only thing that works:
 $rm - r SOLR_HOME/solr
 $CATALINA_HOME/bin/catalina.sh stop
 $CATALINA_HOME/bin/catalina.sh start

 I'm running a single core instance.
 I'm using this nutch script [1] and this[2] hints at my solr config.

 [1]
 http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script
 [2]

 http://wiki.apache.org/solr/Troubleshooting%20HTTP%20Status%20404%20-%20missing%20core%20name%20in%20path?action=recallrev=1

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/how-to-reset-the-index-in-solr-tp496574p2784198.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Solr architecture diagram

2011-04-06 Thread Stevo Slavić
Nice, thank you!

Wish there was something similar or extra to this one depicting where
do SolrJ's CommonsHttpSolrServer and EmbeddedSolrServer fit in.

Regards,
Stevo.

On Wed, Apr 6, 2011 at 11:44 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 At Cominvent we've often had the need to visualize the internal architecture 
 of Apache Solr in order to explain both the relationships of the components 
 as well as the flow of data and queries. The result is a conceptual 
 architecture diagram, clearly showing how Solr relates to the app-server, how 
 cores relate to a Solr instance, how documents enter through an 
 UpdateRequestHandler, through an UpdateChain and Analysis and into the Lucene 
 index etc.

 The drawing is created using Google draw, and the original is shared on 
 Google Docs. We have licensed the diagram under the permissive Creative 
 Commons CC-by license which lets you use, modify and re-distribute the 
 diagram, even commercially, as long as you attribute us with a link.

 Check it out at http://ow.ly/4sOTm
 We'd love your comments

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com




sort by function problem

2011-04-06 Thread ramzesua
I try to use sort by function in a new release of SOLR 3.1, but I have some
problems, for example:
http://localhost:8983/new_search/select?q=mothers
dayindent=truefl=templateSetId,score,templateSetPopularitysort=product(templateSetPopularity,query(mothers
day)) desc
templateSetPopularity - my field with popularity rank
query(mothers+day,0.0) - I try to get score value
At result I get error:
HTTP Status 400 - Can't determine Sort Order:
'sum(templateSetPopularity,query(mothers day)) desc', pos=3
Where is my error? 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/sort-by-function-problem-tp2784493p2784493.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: FW: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Ephraim Ofir
Hi all,
I'd love to share the diagram, just not sure how to do that on the list
(it's a word document I tried to send as attachment).

Jens, to answer your questions:
1. Correct, in our setup the source of the data is a DB from which we
pull the data using DIH (search the list for my previous post DIH -
deleting documents, high performance (delta) imports, and passing
parameters if you want info about that).  We were lucky enough to have
the data sharded at the DB level before we started using Solr, so using
the same shards was an easy extension.  Note that we're not (yet...)
using SolrCloud, it was just something I thought you should consider.
2. I got the idea for the aggregator from the Solr book (PACKT).  I
don't remember if that term was used in the book or if I made it up (if
Google doesn't know it, I probably mad it up...), but I think it conveys
what this part of the puzzle does.  As you said, this is simply a Solr
instance which doesn't hold its own index, but shares the same schema as
the slaves and masters.  I actually defined the default query handler on
this instance to include the shards parameter (see below), so the client
doesn't have to know anything about the internal workings of the sharded
setup, it just hits the aggregator load balancer with a regular query
and everything is handled behind the scenes.  This simplifies the client
and allows me to change the architecture in the future (i.e. change the
number of shards or their DNS name) without requiring a client change.

Sharded query handler:

  requestHandler name=sharded class=solr.SearchHandler
default=${aggregator:false}
!-- default values for query parameters --
 lst name=defaults
   str name=echoParamsexplicit/str
   str name=shards${slaveUrls:null}/str
 /lst
  /requestHandler

All of our Solr instances share the same configs (solrconfig.xml,
schema.xml, etc.) and different instances take different roles according
to properties defined in solr.xml which is generated by a script
specifically for each Solr instance (the script has a map of which
instances should be on which host, and has to be run once on each host).
In this case, this is how the generated solr.xml looks:

solr sharedLib=../lib persistent=true
   property name=name value=aggregator /-- just a name that
appears in Solr management
  -- to make it easier
to know which instance you're on

   property name=aggregator value=true /-- this tells the
instance is an aggregator,
  -- so it should use
the sharded request handler by default
  -- masters and slaves
have master/slave accordingly do define
  -- replication, a
regular default search handler for slaves,
  -- and DIH on masters

   property name=shardID value= /  -- this is used by instances
which are shards in order to determine which
 -- DB they should import from
(masters)
 -- and which master they should
replicate from (slaves)

   property name=slaveUrls value=long,list.of,shard.urls /  --
used by the sharded request handler

   property name=HealthCheckDir value=/data/servers/x_solr/
aggregator/core0/conf / -- used by load balancer to
 
-- know if this instance is alive
   cores adminPath=/admin/cores defaultCoreName=prod
  core name=prod instanceDir=core0//-- just
one core for this instance
  --
indexers have 2 cores, one prod and one for full reindex
   /cores
/solr


Let me know if I can assist any further.
Ephraim Ofir


-Original Message-
From: Jonathan DeMello [mailto:demello@googlemail.com] 
Sent: Wednesday, April 06, 2011 8:58 AM
To: solr-user@lucene.apache.org
Cc: Isan Fulia; Tirthankar Chatterjee
Subject: Re: FW: Very very large scale Solr Deployment = how to do
(Expert Question)?

I third that request.

Would greatly appreciate taking a look at that diagram!

Regards,

Jonathan

On Wed, Apr 6, 2011 at 9:12 AM, Isan Fulia isan.fu...@germinait.com
wrote:

 Hi Ephraim/Jen,

 Can u share that diagram with all.It may really help all of us.
 Thanks,
 Isan Fulia.

 On 6 April 2011 10:15, Tirthankar Chatterjee
tchatter...@commvault.com
 wrote:

  Hi Jen,
  Can you please forward the diagram attachment too that Ephraim sent.
:-)
  Thanks,
  Tirthankar
 
  -Original Message-
  From: Jens Mueller [mailto:supidupi...@googlemail.com]
  Sent: Tuesday, April 05, 2011 10:30 PM
  To: solr-user@lucene.apache.org
  Subject: Re: FW: Very very large scale Solr Deployment = how to do
 (Expert
  Question)?
 
  Hello Ephraim,
 
  thank you so much for the great Document/Scaling-Concept!!
 
  First I think you really should publish this on the solr wiki. This
  approach is nowhere 

solr-2351 patch

2011-04-06 Thread Isha Garg

Hi,
 Tell me for which solr version does Patch file 
SOLR-2351(https://issues.apache.org/jira/secure/attachment/12470560/mlt.patch) 
fixed for .


Regards!
Isha


RE: Embedded Solr constructor not returning

2011-04-06 Thread Steven A Rowe
Hi Greg,

 I need the servlet API in my app for it to work, despite being command
 line.
 So adding this to the maven POM fixed everything:
 dependency
 groupIdjavax.servlet/groupId
 artifactIdservlet-api/artifactId
 version2.5/version
 /dependency
 
 Perhaps this dependency could be listed on the wiki? Alongside the sample
 code for using embedded solr?
 http://wiki.apache.org/solr/Solrj

Sounds good.  Please go ahead and make this change yourself.

FYI, the Solr 3.1 POM has a servlet-api dependency, but the scope is 
provided, because the servlet container includes this dependency.  When *you* 
are the container, you have to provide it.

Steve


Re: Solrj and display which Solr version is used

2011-04-06 Thread Erick Erickson
The only way I know of (and it's a little, well, a lot arcane)
is to ping the admin/system handler. As it happens, I just
had to do something like this.  This uses apache commons
http client 3X, NOT the most recent FWIW...
The URl can be admin/see solrconfig.xml

I'd really like to find out that there's an easier way. This brings
back everything on the admin/info page.

  public static void main(String[] args) {
HttpMethod method = new GetMethod(
http://localhost:8983/solr/admin/system;);
try {
  CommonsHttpSolrServer server = new CommonsHttpSolrServer(
http://localhost:;);
  HttpClient client = server.getHttpClient();

  int statusCode = client.executeMethod(method);
  // Really, you'd want to do something here.
  byte[] responseBody = method.getResponseBody();
  System.out.println(new String(responseBody));

} catch (Exception e) {
  e.printStackTrace();
} finally {
  // Release the connection.
  method.releaseConnection();
}
  }

On Tue, Apr 5, 2011 at 5:46 AM, Marc SCHNEIDER
marc.schneide...@gmail.comwrote:

 Hi,

 I'm wondering how to find out which version of Solr is currently running
 using the Solrj library?

 Thanks,
 Marc.



Re: what happens to docsPending if stop solr before commit

2011-04-06 Thread Erick Erickson
They're lost, never to be seen again. You'll have to reindex them.

Best
Erick

On Tue, Apr 5, 2011 at 4:25 PM, Robert Petersen rober...@buy.com wrote:

 Hello fellow enthusiastic solr users,



 I tried to find the answer to this simple question online, but failed.
 I was wondering about this, what happens to uncommitted docsPending if I
 stop solr and then restart solr?  Are they lost?  Are they still there
 but still uncommitted?  Do they get committed at startup?  I noticed
 after a restart my 250K pending doc count went to 0 is what got me
 wondering.



 TIA!

 Robi




Re: Synonym-time Reindexing Issues

2011-04-06 Thread Erick Erickson
Hmmm, this should work just fine. Here are my questions.

1 are you absolutely sure that the new synonym file
 is available when reindexing?
2 does the sunspot program do anything wonky with
 the ids? The documents
 will only be replaced if the IDs are identical.
3 are you sure that a commit is done at the end?
4 What happens if you optimize? At that point, maxdocs
 and numdocs should be the same, and should be the count
 of documents. if they differ by a factor of 2, I'd suspect your
 id field isn't being used correctly.

If the hypothesis that you id field isn't working correctly, your number
of hits should be going up after re-indexing...

If none of that is relevant, let us know what you find and we'll
try something else

Best
Erick

On Tue, Apr 5, 2011 at 10:46 PM, Preston Marshall pres...@synergyeoc.comwrote:

 Hello all, I am having an issue with Solr and the SynonymFilterFactory.  I
 am using a library to interface with Solr called sunspot.  I realize that
 is not what this list is for, but I believe this may be an issue with Solr,
 not the library (plus the lib author doesn't know the answer). I am using
 the SynonymFilterFactory in my index-time analyzer, and it works great.  My
 only problem is when it comes to changing the synonyms file.  I would expect
 to be able to edit the file, run a reindex (this is through the library),
 and have the new synonyms function when the reindex is complete.
  Unfortunately this is not the case, as changing the synonyms file doesn't
 actually affect the search results.  What DOES work is deleting the existing
 index, and starting from scratch.  This is unacceptable for my usage though,
 because I need the old index to remain online while the new one is being
 built, so there is no downtime.

 Here's my schema in case anyone needs it:
 https://gist.github.com/88f8fb763e99abe4d5b8

 Thanks,
 Preston

 P.S. Sorry if this dupes, first post and I didn't see it show up in the
 archives.



Re: solr faceted search performance reason

2011-04-06 Thread Erick Erickson
Please re-post the question here so others can see
the discussion without going to another list.

Best
Erick

On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai m.palotai.ro...@gmail.comwrote:

 Hello List,

 Please see my question at

 http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search
 ,
 I would be interested to know some details.

 Thank you,
 Robin



Re: sort by function problem

2011-04-06 Thread Yonik Seeley
The problem is query(mothers day)
See http://wiki.apache.org/solr/FunctionQuery#query

You can't directly include query syntax because the function parser
wouldn't know how to get to the end of that syntax.
You could either do
   query($qq)  and then add a qq=mothers day to the request
Or if you really wanted the whole thing inline, you could do
   query({!v='mothers day'})

But the first form is nicer since you don't have to worry about
escaping at all, and I think it's also more readable.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


On Wed, Apr 6, 2011 at 6:48 AM, ramzesua michaelnaza...@gmail.com wrote:
 I try to use sort by function in a new release of SOLR 3.1, but I have some
 problems, for example:
 http://localhost:8983/new_search/select?q=mothers
 dayindent=truefl=templateSetId,score,templateSetPopularitysort=product(templateSetPopularity,query(mothers
 day)) desc
 templateSetPopularity - my field with popularity rank
 query(mothers+day,0.0) - I try to get score value
 At result I get error:
 HTTP Status 400 - Can't determine Sort Order:
 'sum(templateSetPopularity,query(mothers day)) desc', pos=3
 Where is my error?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/sort-by-function-problem-tp2784493p2784493.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Migrating from solr 1.4.1 to 3.1.0

2011-04-06 Thread Isan Fulia
Hi all,

Solr 3.1.0 uses different javabin format from 1.4.1
So if I use Solrj 1.4.1 jar  , then i get javabin error while saving to
3.1.0
and if I use Solrj 3.1.0 jar , then I get javabin error  while reading the
document from solr 1.4.1.

How to go for reindexing in this situation.

-- 
Thanks  Regards,
Isan Fulia.


Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
Hello everyone, i need to know if some has used solr for indexing and
storing images (upt to 16MB) or binary docs.

How does solr behaves with this type of docs? How affects performance?

Thanks Everyone

-- 
__
Ezequiel.

Http://www.ironicnet.com


dataimporhandler

2011-04-06 Thread Gastone Penzo
Hello,
i have a problem with dataimporthandler.
i want to index many products directly from db with this component.
i want to index some products little by little.. and every time i finish a
piece
i want to be sure that indexes are committed before go on with the other
piece.
i see that i can answer solr and he responds with xml which says to me
committed inside the text,
so i i tried to go on but it was not true..i loose some documents..do you
know whY?

thanx


-- 
Gastone Penzo
*www.solr-italia.it*
*The first italian blog about Apache Solr*


Re: Solrj and display which Solr version is used

2011-04-06 Thread Marc SCHNEIDER
Ok thanks, that's an idea :-)
Maybe we should suggest to have a method in CommonsHttpSolrServer that is
returning Solr's version...

Marc.

On Wed, Apr 6, 2011 at 2:58 PM, Erick Erickson erickerick...@gmail.comwrote:

 The only way I know of (and it's a little, well, a lot arcane)
 is to ping the admin/system handler. As it happens, I just
 had to do something like this.  This uses apache commons
 http client 3X, NOT the most recent FWIW...
 The URl can be admin/see solrconfig.xml

 I'd really like to find out that there's an easier way. This brings
 back everything on the admin/info page.

  public static void main(String[] args) {
HttpMethod method = new GetMethod(
 http://localhost:8983/solr/admin/system;);
try {
  CommonsHttpSolrServer server = new CommonsHttpSolrServer(
 http://localhost:;);
  HttpClient client = server.getHttpClient();

  int statusCode = client.executeMethod(method);
  // Really, you'd want to do something here.
  byte[] responseBody = method.getResponseBody();
  System.out.println(new String(responseBody));

} catch (Exception e) {
  e.printStackTrace();
} finally {
  // Release the connection.
  method.releaseConnection();
}
  }

 On Tue, Apr 5, 2011 at 5:46 AM, Marc SCHNEIDER
 marc.schneide...@gmail.comwrote:

  Hi,
 
  I'm wondering how to find out which version of Solr is currently running
  using the Solrj library?
 
  Thanks,
  Marc.
 



Re: solr faceted search performance reason

2011-04-06 Thread Robin Palotai
Carbon copied:

*Context*

This is a question mainly about Lucene (or possibly Solr) internals. The
main topic is *faceted search*, in which search can happen along multiple
independent dimensions (facets) of objects (for example size, speed, price
of a car).

When implemented with relational database, for a large number of facets
multi-field indices are not useful, since facets can be searched in any
order, so a specific ordered multi-index is used with low chance, and
creating all possible orderings of indices is unbearable.

Solr is advertised to cope well with the faceted search task, which if I
think correctly has to be connected with Lucene (supposedly) performing well
on multi-field queries (where fields of a document relate to facets of an
object).

*Question*

The *inverted index* of Lucene can be stored in a relational database, and
naturally taking the intersections of the matching documents can also be
trivially achieved with RDBMS using single-field indices.

Therefore, Lucene supposedly has some advanced technique for multi-field
queries other than just taking the intersection of matching documents based
on the inverted index.

So the question is, what is this technique/trick? More broadly: Why can
Lucene/Solr achieve better faceted search performance theoretically than
RDBMS could (if so)?

*Note: My first guess would be that Lucene would use some space partitioning
method for partitioning a vector space built from the document fields as
dimensions, but as I understand Lucene is not purely vector space based.*
Thanks,
Robin

On Wed, Apr 6, 2011 at 3:15 PM, Erick Erickson erickerick...@gmail.comwrote:

 Please re-post the question here so others can see
 the discussion without going to another list.

 Best
 Erick

 On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotai m.palotai.ro...@gmail.com
 wrote:

  Hello List,
 
  Please see my question at
 
 
 http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search
  ,
  I would be interested to know some details.
 
  Thank you,
  Robin
 



Re: dismax boost query not useful?

2011-04-06 Thread Shawn Heisey

On 4/5/2011 1:17 PM, Chris Hostetter wrote:

the boost param of edismax is probably a lot better choice then either
bq/bf -- but it really depends on wether you want an additive boost or a
multiplicitive one (of course with teh function query syntax add(),
product() and (query() can be combined in anyway you want)

in terms of the merrits of the bq vs bf, if we wanted to get rid of one or
the other, i'd argue eliminating bf since it has *very* brittle parsing
rules in place (for historic reasons).  while you can use variable
derefrencing to get the guts of either a bf=query($a) or a bq={$func
v=$a}, promoting the use of bq over bf makes using the param body inline
simpler so people are less likely to run into problems (ie: bq={!func}...
doesn't require any special escaping, but bf=query(...) does)


We aren't yet using dismax in production, but I've had it in my config 
for a while now.  I've changed it to edismax in the 3.1 setup I'm 
putting together now.  It has the following in the bf parameter:


recip(ms(NOW/DAY,pd),3.16e-11,1,1)

Is there a way to do this without bf?  I couldn't make heads or tails of 
what you wrote above.


Thanks,
Shawn



Re: dismax boost query not useful?

2011-04-06 Thread Yonik Seeley
On Wed, Apr 6, 2011 at 12:00 PM, Shawn Heisey s...@elyograg.org wrote:
 We aren't yet using dismax in production, but I've had it in my config for a
 while now.  I've changed it to edismax in the 3.1 setup I'm putting together
 now.  It has the following in the bf parameter:

 recip(ms(NOW/DAY,pd),3.16e-11,1,1)

 Is there a way to do this without bf?  I couldn't make heads or tails of
 what you wrote above.

bf parsing is fragile because it is a space-delimited list of
functions (meaning no functions may have whitespace in them).
bf also adds the function to the query score, but for boosting one
is normally better off multiplying.  With edismax, you can get a
multiplicative boost via
boost=recip(ms(NOW/DAY,pd),3.16e-11,1,1)

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


RE: Using MLT feature

2011-04-06 Thread Frederico Azeiteiro
Yes, I had already check the code for it and use it to compile a c# method that 
returns the same signature.

But I have a strange issue:
For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the text 
frederico (simple text no big deal here): 

1. using my c# app returns 8b92e01d67591dfc60adf9576f76a055
2. using SOLR, passing a doc with HeadLine frederico I get 
8d9a5c35812ba75b8383d4538b91080f on my signature field.
3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
SignatureUpdateProcessorFactory class (please check code below) and I get 
8b92e01d67591dfc60adf9576f76a055.

Java app code:
TextProfileSignature textProfileSignature = new 
TextProfileSignature();
NamedListString params = new NamedListString();
params.add(, );
SolrParams solrParams = SolrParams.toSolrParams(params);
textProfileSignature.init(solrParams);
textProfileSignature.add(frederico);


byte[] signature =  textProfileSignature.getSignature();
char[] arr = new char[signature.length  1];
for (int i = 0; i  signature.length; i++) {
int b = signature[i];
int idx = i  1;
arr[idx] = StrUtils.HEX_DIGITS[(b  4)  0xf];
arr[idx + 1] = StrUtils.HEX_DIGITS[b  0xf];
}
String sigString = new String(arr);
System.out.println(sigString);




Here's my processor configs:

updateRequestProcessorChain name=dedupe
 processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   str name=signatureFieldsig/str
   bool name=overwriteDupesfalse/bool
   str name=fieldsHeadLine/str
   str 
name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str
   str name=minTokenLen2/str
   /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain


So both my apps (Java and C#)  return the same signature but SOLR returns a 
different one.. 
Can anyone understand what I should be doing wrong?

Thank you once again.

Frederico

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: terça-feira, 5 de Abril de 2011 15:20
To: solr-user@lucene.apache.org
Cc: Frederico Azeiteiro
Subject: Re: Using MLT feature

If you check the code for TextProfileSignature [1] your'll notice the init 
method reading params. You can set those params as you did. Reading Javadoc 
[2] might help as well. But what's not documented in the Javadoc is how QUANT 
is computed; it rounds.

[1]: 
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
[2]: 
http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
 Thank you, I'll try to create a c# method to create the same sig of SOLR,
 and then compare both sigs before index the doc. This way I can avoid the
 indexation of existing docs.
 
 If anyone needs to use this parameter (as this info is not on the wiki),
 you can add the option
 
 str name=minTokenLen5/str
 
 On the processor tag.
 
 Best regards,
 Frederico 
 
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: terça-feira, 5 de Abril de 2011 12:01
 To: solr-user@lucene.apache.org
 Cc: Frederico Azeiteiro
 Subject: Re: Using MLT feature
 
 On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
  Sorry, the reply I made yesterday was directed to Markus and not the
  list...
  
  Here's my thoughts on this. At this point I'm a little confused if SOLR
  is a good option to find near duplicate docs.
  
   Yes there is, try set overwriteDupes to true and documents yielding
  
  the same signature will be overwritten
  
  The problem is that I don't want to overwrite the doc, I need to
  maintain the original version (because the doc has others fields I need
  to maintain).
  
  If you have need both fuzzy and exact matching then add a second
  
  update processor inside the chain and create another signature field.
  
  I just need the fuzzy search but the quick tests I made, return
  different signatures for what I consider duplicate docs.
  Army deploys as clan war kills 11 in Philippine south
  Army deploys as clan war kills 11 in Philippine south.
  
  Same sig for the above 2 strings, that's ok.
  
  But a different sig was created for:
  Army deploys as clan war kills 11 in Philippine south the.
  
  Is there a way to setup the TextProfileSignature parameters to adjust
  the sensibility on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
  
  Do you think that these parameters can help creating the same sig for
  the 

Re: dataimporhandler

2011-04-06 Thread Erick Erickson
There's not much to go on here, can you provide details
on how you check that you've committed? How are you
configuring DIH? etc.

It might be helpful to review:
 http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Wed, Apr 6, 2011 at 10:11 AM, Gastone Penzo gastone.pe...@gmail.comwrote:

 Hello,
 i have a problem with dataimporthandler.
 i want to index many products directly from db with this component.
 i want to index some products little by little.. and every time i finish a
 piece
 i want to be sure that indexes are committed before go on with the other
 piece.
 i see that i can answer solr and he responds with xml which says to me
 committed inside the text,
 so i i tried to go on but it was not true..i loose some documents..do you
 know whY?

 thanx


 --
 Gastone Penzo
 *www.solr-italia.it*
 *The first italian blog about Apache Solr*



Re: dismax boost query not useful?

2011-04-06 Thread Smiley, David W.

On Apr 5, 2011, at 3:17 PM, Chris Hostetter wrote:

 one of the original use cases for bq was for artificial keyword boosting, 
 in which case it still comes in handy...
 
 bq=meta:promote^100 text:new^10 category:featured^100 (*:* 
 -category:accessories)^10

Yeah I thought of this specific use-case. There are two issues with it though:
1. Each piece is still subject to the IDF component of the score, requiring me 
to make each individual category have a boost factoring that in.  For example, 
if I want meta:promote to be twice as boosted as category:featured, I can't 
simply boost the first to 2 and the second to 1 (the default) -- I have enable 
debugQuery and carefully skew them appropriately to what I want.  And the IDF 
might change as the data changes.
2. It still ads instead of multiplies which is always what I want. (should I 
not always want it?)

It's hard to actually avoid the IDF irrespective of which parameter you use.  
The only way I know to give a fielded query a constant score is a range query 
which is a total hack, e.g. meta:[promote TO promote] which you could then 
boost.  Ick!

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/






Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
Another question that maybe is easier to answer, how can i store binary
data? Any example schema?

2011/4/6 Ezequiel Calderara ezech...@gmail.com

 Hello everyone, i need to know if some has used solr for indexing and
 storing images (upt to 16MB) or binary docs.

 How does solr behaves with this type of docs? How affects performance?

 Thanks Everyone

 --
 __
 Ezequiel.

 Http://www.ironicnet.com




-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: solr faceted search performance reason

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 10:55 AM, Robin Palotai wrote:

Therefore, Lucene supposedly has some advanced technique for multi-field
queries other than just taking the intersection of matching documents based
on the inverted index.


I don't think so, neccesarily.  It's just that Lucene's algorithms to 
doing this is very fast, with some additional  optimizations to make it 
even faster. There may be some edge cases where the optimizations take 
some shortcuts on top of this -- ie, if you ask for only the first ten 
facet values ordered by number of hits, in some cases solr/lucene won't 
even calculate the hit counts for facet values it already knows aren't 
going to be in the top 10.  The facetting code in 1.4+ is actually kind 
of tangled, in that several different calculation approaches can be 
taken depending on the nature of the result set and schema.



But anyway, I think you're right that you could set up an rdbms schema 
to _conceptually_ allow very similar operations to a lucene index. It 
would be unlikely to perform as well, because the devil is in the 
details of the storage formats and algorithms, and lucene has been 
optimized for these particular cases (at the expense of not covering a 
great many cases that an rdbms can cover).


In fact, while I can't find it now on Google, I think someone HAS in the 
past written an extension to lucene to have it store it's indexes in an 
rdbms using a schema much like you describe, instead of in the file 
system. I'm not sure why they would want to do this instead of just 
using the rdbms -- either lucene's access algorithms still provide a 
performance benefit even when using an rdbms as the underlying 'file 
system', or lucene provides convenient functions that you wouldn't want 
to have to re-implement yourself solely in terms of an rdbms, or both. 
Ah, here's a brief reference to that approach in the lucene FAQ: 
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F


Jonathan


So the question is, what is this technique/trick? More broadly: Why can
Lucene/Solr achieve better faceted search performance theoretically than
RDBMS could (if so)?

*Note: My first guess would be that Lucene would use some space partitioning
method for partitioning a vector space built from the document fields as
dimensions, but as I understand Lucene is not purely vector space based.*
Thanks,
Robin

On Wed, Apr 6, 2011 at 3:15 PM, Erick Ericksonerickerick...@gmail.comwrote:


Please re-post the question here so others can see
the discussion without going to another list.

Best
Erick

On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotaim.palotai.ro...@gmail.com

wrote:
Hello List,

Please see my question at



http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search

,
I would be interested to know some details.

Thank you,
Robin



Re: solr faceted search performance reason

2011-04-06 Thread Jonathan Rochkind
PS: If you want to see how Solr actually computes facetting (the 
facetting code lives in the 'Solr' codebase, not in the lower level 
lucene codebase), here's the file to look at, this web snapshot is from 
1.4.1 dont' know if it's been changed more recently, but I don't think 
majorly:


http://www.jarvana.com/jarvana/view/org/apache/solr/solr-core/1.4.1/solr-core-1.4.1-sources.jar!/org/apache/solr/request/SimpleFacets.java?format=ok

It's kind of confusing, precisely because it takes several different 
approaches depending on the nature of the result set and schema, trying 
to pick the most performant approach for the context.  I still haven't 
wrapped my head around it entirely (I am not a Solr/lucene developer, 
just a user).


On 4/6/2011 2:06 PM, Jonathan Rochkind wrote:

On 4/6/2011 10:55 AM, Robin Palotai wrote:

Therefore, Lucene supposedly has some advanced technique for multi-field
queries other than just taking the intersection of matching documents based
on the inverted index.

I don't think so, neccesarily.  It's just that Lucene's algorithms to
doing this is very fast, with some additional  optimizations to make it
even faster. There may be some edge cases where the optimizations take
some shortcuts on top of this -- ie, if you ask for only the first ten
facet values ordered by number of hits, in some cases solr/lucene won't
even calculate the hit counts for facet values it already knows aren't
going to be in the top 10.  The facetting code in 1.4+ is actually kind
of tangled, in that several different calculation approaches can be
taken depending on the nature of the result set and schema.


But anyway, I think you're right that you could set up an rdbms schema
to _conceptually_ allow very similar operations to a lucene index. It
would be unlikely to perform as well, because the devil is in the
details of the storage formats and algorithms, and lucene has been
optimized for these particular cases (at the expense of not covering a
great many cases that an rdbms can cover).

In fact, while I can't find it now on Google, I think someone HAS in the
past written an extension to lucene to have it store it's indexes in an
rdbms using a schema much like you describe, instead of in the file
system. I'm not sure why they would want to do this instead of just
using the rdbms -- either lucene's access algorithms still provide a
performance benefit even when using an rdbms as the underlying 'file
system', or lucene provides convenient functions that you wouldn't want
to have to re-implement yourself solely in terms of an rdbms, or both.
Ah, here's a brief reference to that approach in the lucene FAQ:
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F

Jonathan


So the question is, what is this technique/trick? More broadly: Why can
Lucene/Solr achieve better faceted search performance theoretically than
RDBMS could (if so)?

*Note: My first guess would be that Lucene would use some space partitioning
method for partitioning a vector space built from the document fields as
dimensions, but as I understand Lucene is not purely vector space based.*
Thanks,
Robin

On Wed, Apr 6, 2011 at 3:15 PM, Erick Ericksonerickerick...@gmail.comwrote:


Please re-post the question here so others can see
the discussion without going to another list.

Best
Erick

On Wed, Apr 6, 2011 at 4:09 AM, Robin Palotaim.palotai.ro...@gmail.com

wrote:
Hello List,

Please see my question at



http://stackoverflow.com/questions/5552919/how-does-lucene-solr-achieve-high-performance-in-multi-field-faceted-search

,
I would be interested to know some details.

Thank you,
Robin



Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ryan McKinley
You can store binary data using a binary field type -- then you need
to send the data base64 encoded.

I would strongly recommend against storing large binary files in solr
-- unless you really don't care about performance -- the file system
is a good option that springs to mind.

ryan




2011/4/6 Ezequiel Calderara ezech...@gmail.com:
 Another question that maybe is easier to answer, how can i store binary
 data? Any example schema?

 2011/4/6 Ezequiel Calderara ezech...@gmail.com

 Hello everyone, i need to know if some has used solr for indexing and
 storing images (upt to 16MB) or binary docs.

 How does solr behaves with this type of docs? How affects performance?

 Thanks Everyone

 --
 __
 Ezequiel.

 Http://www.ironicnet.com




 --
 __
 Ezequiel.

 Http://www.ironicnet.com



Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Jonathan Rochkind
I put binary data in an ordinary Solr stored field, don't need any 
special schema.


I have run into trouble making sure the data is not corrupted on the way 
in during indexing, depending on exactly what form of communication is 
being used to index (SolrJ, SolrJ with EmbeddedSolr, DIH, etc.), as well 
as settings in the container (eg jetty or tomcat) used to house Solr.   
But I think it's possible to get it working no matter what the path, if 
you run into trouble someone may be able to help you.


My binary data is not very large though (generally under 1 meg).

However, in general, _indexing_ large data should be fine, although it 
will create a larger index which can require more RAM, or be slower, 
etc.  But that's geenrally just a function of total size of index, or 
really total number of unique terms, doesn't matter if the docs they 
come from are big or small.


_Storing_ large fields can sometimes be a problem, lucene/Solr are 
really optimized as an index, not a key/value store.  Some people choose 
to _store_ their large objects in some external store (rdbms, nosql 
key/value, whatever), and have the client application look up the 
objects themselves by primary-key/unique-id, after the pk/uid's 
themselves are retrieved from Solr. Use Solr for what it's good at, 
indexing, use something else good at storing for storing large objects.  
But other people sometimes store large objects directly in Solr without 
problems, can depend on the exact nature of your index and use.


On 4/6/2011 2:09 PM, Ezequiel Calderara wrote:

Another question that maybe is easier to answer, how can i store binary
data? Any example schema?

2011/4/6 Ezequiel Calderaraezech...@gmail.com


Hello everyone, i need to know if some has used solr for indexing and
storing images (upt to 16MB) or binary docs.

How does solr behaves with this type of docs? How affects performance?

Thanks Everyone

--
__
Ezequiel.

Http://www.ironicnet.com






Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Jonathan Rochkind

Ha, there's a binary field type?!

I've stored binary data in an ordinary String field type, and it's 
worked.  But there were some headaches to get it to work, might have 
been smoother if I had realized there was actually a binary field type.


But wait I'm talking about Solr 'stored field', not about indexing. I 
didn't try to index my binary data, just store it for later retrieval 
(knowing this can sometimes be a performance problem, doing it anyway 
with relatively small data, got away with it).  Does the field type even 
effect the _stored values_ in a Solr field?


On 4/6/2011 2:25 PM, Ryan McKinley wrote:

You can store binary data using a binary field type -- then you need
to send the data base64 encoded.

I would strongly recommend against storing large binary files in solr
-- unless you really don't care about performance -- the file system
is a good option that springs to mind.

ryan




2011/4/6 Ezequiel Calderaraezech...@gmail.com:

Another question that maybe is easier to answer, how can i store binary
data? Any example schema?

2011/4/6 Ezequiel Calderaraezech...@gmail.com


Hello everyone, i need to know if some has used solr for indexing and
storing images (upt to 16MB) or binary docs.

How does solr behaves with this type of docs? How affects performance?

Thanks Everyone

--
__
Ezequiel.

Http://www.ironicnet.com




--
__
Ezequiel.

Http://www.ironicnet.com



Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
Hi, your answers were really helpfull

I was thinking in putting the base64 encoded file into a string field. But
was a little worried about solr trying to stem it or vectorize or those
stuff.

Seen in the example of the schema.xml:
!--Binary data type. The data should be sent/retrieved in as Base64
encoded Strings --
fieldtype name=binary class=solr.BinaryField/

Anyone knows any storage for images that performs well, other than FS ?

Thanks


On Wed, Apr 6, 2011 at 3:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Ha, there's a binary field type?!

 I've stored binary data in an ordinary String field type, and it's
 worked.  But there were some headaches to get it to work, might have been
 smoother if I had realized there was actually a binary field type.

 But wait I'm talking about Solr 'stored field', not about indexing. I
 didn't try to index my binary data, just store it for later retrieval
 (knowing this can sometimes be a performance problem, doing it anyway with
 relatively small data, got away with it).  Does the field type even effect
 the _stored values_ in a Solr field?


 On 4/6/2011 2:25 PM, Ryan McKinley wrote:

 You can store binary data using a binary field type -- then you need
 to send the data base64 encoded.

 I would strongly recommend against storing large binary files in solr
 -- unless you really don't care about performance -- the file system
 is a good option that springs to mind.

 ryan




 2011/4/6 Ezequiel Calderaraezech...@gmail.com:

 Another question that maybe is easier to answer, how can i store binary
 data? Any example schema?

 2011/4/6 Ezequiel Calderaraezech...@gmail.com

  Hello everyone, i need to know if some has used solr for indexing and
 storing images (upt to 16MB) or binary docs.

 How does solr behaves with this type of docs? How affects performance?

 Thanks Everyone

 --
 __
 Ezequiel.

 Http://www.ironicnet.com



 --
 __
 Ezequiel.

 Http://www.ironicnet.com




-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Markus Jelsma

 Ha, there's a binary field type?!
 
 I've stored binary data in an ordinary String field type, and it's
 worked.  But there were some headaches to get it to work, might have
 been smoother if I had realized there was actually a binary field type.

How, you can't just embed control characters in an XML body? The need to be at 
least encoded as not to write tabs, deletes, backspaces and whatever garbage, 
base64 in Solr's case.
 
 But wait I'm talking about Solr 'stored field', not about indexing. I
 didn't try to index my binary data, just store it for later retrieval
 (knowing this can sometimes be a performance problem, doing it anyway
 with relatively small data, got away with it).  Does the field type even
 effect the _stored values_ in a Solr field?

Solr decodes the data and stores it. It reencodes the data when writing a 
response.

 
 On 4/6/2011 2:25 PM, Ryan McKinley wrote:
  You can store binary data using a binary field type -- then you need
  to send the data base64 encoded.
  
  I would strongly recommend against storing large binary files in solr
  -- unless you really don't care about performance -- the file system
  is a good option that springs to mind.
  
  ryan
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com:
  Another question that maybe is easier to answer, how can i store binary
  data? Any example schema?
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com
  
  Hello everyone, i need to know if some has used solr for indexing and
  storing images (upt to 16MB) or binary docs.
  
  How does solr behaves with this type of docs? How affects performance?
  
  Thanks Everyone
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Markus Jelsma

 Hi, your answers were really helpfull
 
 I was thinking in putting the base64 encoded file into a string field. But
 was a little worried about solr trying to stem it or vectorize or those
 stuff.

String field types are not analyzed. So it doesn't brutalize your data. Better 
use BinaryField.

 
 Seen in the example of the schema.xml:
 !--Binary data type. The data should be sent/retrieved in as Base64
 encoded Strings --
 fieldtype name=binary class=solr.BinaryField/
 
 Anyone knows any storage for images that performs well, other than FS ?

CouchDB can deliver file attachments over HTTP. It needs to be sent encoded (of 
course).

 
 Thanks
 
 On Wed, Apr 6, 2011 at 3:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
  Ha, there's a binary field type?!
  
  I've stored binary data in an ordinary String field type, and it's
  worked.  But there were some headaches to get it to work, might have been
  smoother if I had realized there was actually a binary field type.
  
  But wait I'm talking about Solr 'stored field', not about indexing. I
  didn't try to index my binary data, just store it for later retrieval
  (knowing this can sometimes be a performance problem, doing it anyway
  with relatively small data, got away with it).  Does the field type even
  effect the _stored values_ in a Solr field?
  
  On 4/6/2011 2:25 PM, Ryan McKinley wrote:
  You can store binary data using a binary field type -- then you need
  to send the data base64 encoded.
  
  I would strongly recommend against storing large binary files in solr
  -- unless you really don't care about performance -- the file system
  is a good option that springs to mind.
  
  ryan
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com:
  Another question that maybe is easier to answer, how can i store binary
  data? Any example schema?
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com
  
   Hello everyone, i need to know if some has used solr for indexing and
   
  storing images (upt to 16MB) or binary docs.
  
  How does solr behaves with this type of docs? How affects performance?
  
  Thanks Everyone
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:39 PM, Markus Jelsma wrote:

Ha, there's a binary field type?!

I've stored binary data in an ordinary String field type, and it's
worked.  But there were some headaches to get it to work, might have
been smoother if I had realized there was actually a binary field type.

How, you can't just embed control characters in an XML body? The need to be at
least encoded as not to write tabs, deletes, backspaces and whatever garbage,
base64 in Solr's case.


In my case using SolrJ with BinaryUpdateHandler. I think. That code was 
actually written by someone else, a while ago.


However I've managed to do it at indexing -- ultimately getting it into 
a String-type stored field -- my binary data comes back not UUEncoded, 
but XML-escaped, ie:


#30;

This works for me because my binary data is actually MOSTLY ascii (so 
this isn't as terribly inefficient as it could be), but it has some 
control characters in it that need to be preserved. And nearly any 
library you use for consuming XML responses will properly un-escape 
things like #30; when reading.


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Adam Estrada
Well...by default there is a pretty decent schema that you can use as a
template in the example project that builds with Solr. Tika is the library
that does the actual content extraction so it would be a good idea to try
the example project out first.

Adam

2011/4/6 Ezequiel Calderara ezech...@gmail.com

 Another question that maybe is easier to answer, how can i store binary
 data? Any example schema?

 2011/4/6 Ezequiel Calderara ezech...@gmail.com

  Hello everyone, i need to know if some has used solr for indexing and
  storing images (upt to 16MB) or binary docs.
 
  How does solr behaves with this type of docs? How affects performance?
 
  Thanks Everyone
 
  --
  __
  Ezequiel.
 
  Http://www.ironicnet.com
 



 --
 __
 Ezequiel.

 Http://www.ironicnet.com



Re: Concatenate multivalued DIH fields

2011-04-06 Thread alexei
Hi Everyone,

I am having an identical problem with concatenating author's first and last
names stored in an xml blob.
Because this field is multivalued copyfield does not work.

Does anyone have a solution? 
 
Regards,
Alexei

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Concatenate-multivalued-DIH-fields-tp2749988p2786506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Stefan Matheis

Ezequiel,

Am 06.04.2011 20:38, schrieb Ezequiel Calderara:

Anyone knows any storage for images that performs well, other than FS ?


you may have a look on http://www.danga.com/mogilefs/ ? :)

Regards
Stefan


Re: DIH: Indexing multiple datasources with the same schema

2011-04-06 Thread alexei
Sorry about bringing an old thread back, I thought my solution could be
useful.
 
I also had to deal with multiple data sources. If the data source number
could be queried for in one of your parent entities then you could get it
using a variable as follows:

entity name=ChildEntity dataSource=db${YourParentEntity.DbId} ... 

For the above to work I had to modify the 
org.apache.solr.handler.dataimport.ContextImpl.getDataSource()
Here is the replacement code for getDataSource: 


  public DataSource getDataSource() { 
if (ds != null) return ds; 
if(entity == null) return  null; 

String dataSourceResolved =
this.getResolvedEntityAttribute(dataSource); 
  
if (entity.dataSrc == null) {   
entity.dataSrc = dataImporter.getDataSourceInstance(entity,
dataSourceResolved, this); 
entity.dataSource = dataSourceResolved; 
} else if (!dataSourceResolved.equals(entity.dataSource)) { 
entity.dataSrc.close(); 
entity.dataSrc = dataImporter.getDataSourceInstance(entity,
dataSourceResolved, this); 
entity.dataSource = dataSourceResolved; 
} 
if (entity.dataSrc != null  docBuilder != null 
docBuilder.verboseDebug  
 Context.FULL_DUMP.equals(currentProcess())) { 
  //debug is not yet implemented properly for deltas 
  entity.dataSrc =
docBuilder.writer.getDebugLogger().wrapDs(entity.dataSrc); 
} 
return entity.dataSrc; 
  } 


Cheers,
Alexei

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Indexing-multiple-datasources-with-the-same-schema-tp877781p2786599.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
On Wed, Apr 6, 2011 at 15:31 PM, Adam Estrada estrada.adam.gro...@gmail.com
 wrote:

 Well...by default there is a pretty decent schema that you can use as a
 template in the example project that builds with Solr. Tika is the library
 that does the actual content extraction so it would be a good idea to try
 the example project out first.


I wanted to know how large field's size affects performance.

But i wasn't sure how to design the schema.


On Wed, Apr 6, 2011 at 4:23 PM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 Ezequiel,

 Am 06.04.2011 20:38, schrieb Ezequiel Calderara:

  Anyone knows any storage for images that performs well, other than FS ?


 you may have a look on http://www.danga.com/mogilefs/ ? :)

 Regards
 Stefan


Stefan, we looked at mogilefs, also couchdb and mongodb.
AFAIR (As Far as I Read :P), mogilefs runs on *nix OS, while we are using
microsoft as the OS. (yeah, we are the open source evangelist in our
company :P)

Just for the moment we well start using Solr for storing and indexing (some
info at least) images and docs. We have yet to see what are the needs in
terms of scalability to choose between the options.

Thanks all...
If you have more info send it :)

-- 
__
Ezequiel.

Http://www.ironicnet.com


unindexible Chars?

2011-04-06 Thread Charles Wardell

Once and awhile, my post.jar seems to fail on commit. Durring the commit 
process, I have gotten a few errors. One is that EOF character found, and 
another is that semicolon expected after the. I also have come across a  was 
expected.

So my question is what characters do I need to strip out of the source text to 
ensure all posts are sucessful?

One side note. I have placed the text fields within ![CDATA[ ]] before adding 
the document.

Thanks,
Charlie 
 



Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Markus Jelsma

 On Wed, Apr 6, 2011 at 15:31 PM, Adam Estrada
 estrada.adam.gro...@gmail.com
 
 I wanted to know how large field's size affects performance.

If you use replication then it's a huge impact on performance as the data gets 
sent over the network. It's also a memory hog so there's less memory and more 
garbage collection. Indexing and merging is slower because of additional bytes 
being copied. If there's a lot of binary data and performance is important and 
diskspace is not a commodity then you shouldn't store it in the index; the 
index size can double during optimizing.

 
 But i wasn't sure how to design the schema.
 
 
 On Wed, Apr 6, 2011 at 4:23 PM, Stefan Matheis 
 
 matheis.ste...@googlemail.com wrote:
  Ezequiel,
  
  Am 06.04.2011 20:38, schrieb Ezequiel Calderara:
   Anyone knows any storage for images that performs well, other than FS ?
  
  you may have a look on http://www.danga.com/mogilefs/ ? :)
  
  Regards
  Stefan
 
 Stefan, we looked at mogilefs, also couchdb and mongodb.
 AFAIR (As Far as I Read :P), mogilefs runs on *nix OS, while we are using
 microsoft as the OS. (yeah, we are the open source evangelist in our
 company :P)
 
 Just for the moment we well start using Solr for storing and indexing (some
 info at least) images and docs. We have yet to see what are the needs in
 terms of scalability to choose between the options.
 
 Thanks all...
 If you have more info send it :)


Re: unindexible Chars?

2011-04-06 Thread Markus Jelsma

 Once and awhile, my post.jar seems to fail on commit. Durring the commit
 process, I have gotten a few errors. One is that EOF character found, and
 another is that semicolon expected after the. I also have come across a 
 was expected.
 
 So my question is what characters do I need to strip out of the source text
 to ensure all posts are sucessful?

The usual, it _must_ be valid XML.

 
 One side note. I have placed the text fields within ![CDATA[ ]] before
 adding the document.

That's not a bad idea, then at least nothing bad can happen with the data 
embedded in the element. Usually these errors indicate invalid XML.

Try xmllint with some XML body giving errors.


 
 Thanks,
 Charlie


ClobTransformer Issues

2011-04-06 Thread Stephen Garvey
Hi All,

I'm hoping someone can give me some pointers. I've got Solr 1.4.1 and am
using DIH to import a table from and Ingres database. The table contains
a column which is a CLOB type. I've tried to use a CLOB transformer to
transform the CLOB to a String but the index only contains something
like INGRES-CLOB:(Loc 10).

Does anyone have any ideas on why the CLOB transformer is not
transforming this column?

Thanks,

Stephen



Re: Eclipse: Invalid character constant

2011-04-06 Thread Eric Grobler
Hi Stefan,

Thanks, my eclipse is now perfectly configured.
It makes it very easy for amateurs like me!

For other amateurs the steps are:
1. checkout the sources:
*svn checkout
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/*
2. the root folder (lucene_solr_3_1 in this example) contains a special
build.xml to create project settings for either eclipse or IntelliJ
IDEA. (it is not the build.xml in the solr subfolder that compiles
tomcat/jetty)
   run:  *ant eclipse*
3. Create a new Eclipse Java project, we need to specify an external folder.
GALILEO: Create project from existing source
HELIOS: Unclick Use Default Location
*Select the root svn folder *(lucene_solr_3_1)
Click finish and you should have solr configured in eclipse!

Regards
Ericz

On Tue, Apr 5, 2011 at 11:34 PM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 Eric,

 have a look at Line #67 in build.xml :)
 target name=eclipse description=Setup Eclipse configuration -- Only
 available with SVN checkout

 Regards
 Stefan

 Am 06.04.2011 00:28, schrieb Eric Grobler:

  Hi Robert,

 Thanks for the fast response!

 I used
 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/
 but did not find 'ant eclipse'.

 However setting my projects Resouce encoding to UTF-8 worked.

 Thanks for your help and have a nice day :-)

 Regards
 Ericz

 On Tue, Apr 5, 2011 at 11:14 PM, Robert Muirrcm...@gmail.com  wrote:

  in eclipse you need to set your project's character encoding to UTF-8.

 if you are checking out the source code from svn, you can run 'ant
 eclipse'
 from the top level, and then hit refresh on your project. it will set
 your
 encoding and your classpath up.

 On Tue, Apr 5, 2011 at 6:10 PM, Eric Groblerimpalah...@googlemail.com

 wrote:


  Hi Everyone,

 Some language specific classes like GermanLightStemmer has invalid
 character
 compiler errors for code like:
  switch(s[i]) {
case 'ä':
case 'Ã ':
case 'á':
 in Eclipse with JDK 1.6

 How do I get rid of these errors?
 Thanks  Regards

 Ericz






where is INFOSTREAM.txt located?

2011-04-06 Thread Tirthankar Chatterjee






**Legal Disclaimer***
This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you.
*

Re: Synonym-time Reindexing Issues

2011-04-06 Thread Preston Marshall
Reply Inline:
On Apr 6, 2011, at 8:12 AM, Erick Erickson wrote:

 Hmmm, this should work just fine. Here are my questions.
 
 1 are you absolutely sure that the new synonym file
 is available when reindexing?
Not sure what you mean here, solr is running as root, and the file is never 
moved around or anything crazy.
 2 does the sunspot program do anything wonky with
 the ids? The documents
 will only be replaced if the IDs are identical.
Is there a way I can add debugging to show what it's doing with the IDs or 
something to view the index?  I tried using Luke, but I can't get it to 
actually show me the actual data of the objects, only the name and some other 
basic info.
 3 are you sure that a commit is done at the end?
It appears that it commits a few times during reindexing.
 4 What happens if you optimize? At that point, maxdocs
 and numdocs should be the same, and should be the count
 of documents. if they differ by a factor of 2, I'd suspect your
 id field isn't being used correctly.
I'm unaware of what you mean by optimizing, or even viewing maxdocs and 
numdocs, but I will RTFM to find out.  I did notice something strange earlier 
though that may relate to this.  When I ran a search there were duplicate 
results.
 
 If the hypothesis that you id field isn't working correctly, your number
 of hits should be going up after re-indexing...
 
 If none of that is relevant, let us know what you find and we'll
 try something else
 
 Best
 Erick
 
 On Tue, Apr 5, 2011 at 10:46 PM, Preston Marshall 
 pres...@synergyeoc.comwrote:
 
 Hello all, I am having an issue with Solr and the SynonymFilterFactory.  I
 am using a library to interface with Solr called sunspot.  I realize that
 is not what this list is for, but I believe this may be an issue with Solr,
 not the library (plus the lib author doesn't know the answer). I am using
 the SynonymFilterFactory in my index-time analyzer, and it works great.  My
 only problem is when it comes to changing the synonyms file.  I would expect
 to be able to edit the file, run a reindex (this is through the library),
 and have the new synonyms function when the reindex is complete.
 Unfortunately this is not the case, as changing the synonyms file doesn't
 actually affect the search results.  What DOES work is deleting the existing
 index, and starting from scratch.  This is unacceptable for my usage though,
 because I need the old index to remain online while the new one is being
 built, so there is no downtime.
 
 Here's my schema in case anyone needs it:
 https://gist.github.com/88f8fb763e99abe4d5b8
 
 Thanks,
 Preston
 
 P.S. Sorry if this dupes, first post and I didn't see it show up in the
 archives.
 



smime.p7s
Description: S/MIME cryptographic signature


RE: what happens to docsPending if stop solr before commit

2011-04-06 Thread Robert Petersen
Oh woe is me...  lol NP good to know.  I'll get them on the next go
'round.  :) 

Thanks for the answer!



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, April 06, 2011 6:05 AM
To: solr-user@lucene.apache.org
Subject: Re: what happens to docsPending if stop solr before commit

They're lost, never to be seen again. You'll have to reindex them.

Best
Erick

On Tue, Apr 5, 2011 at 4:25 PM, Robert Petersen rober...@buy.com
wrote:

 Hello fellow enthusiastic solr users,



 I tried to find the answer to this simple question online, but failed.
 I was wondering about this, what happens to uncommitted docsPending if
I
 stop solr and then restart solr?  Are they lost?  Are they still there
 but still uncommitted?  Do they get committed at startup?  I noticed
 after a restart my 250K pending doc count went to 0 is what got me
 wondering.



 TIA!

 Robi




RE: where is INFOSTREAM.txt located?

2011-04-06 Thread Tirthankar Chatterjee
Thanks All, I figured it out. 

http://lucene.472066.n3.nabble.com/general-debugging-techniques-td868300.html

See the last line on this page.

-Original Message-
From: Tirthankar Chatterjee [mailto:tchatter...@commvault.com] 
Sent: Wednesday, April 06, 2011 6:15 PM
To: solr-user@lucene.apache.org
Subject: where is INFOSTREAM.txt located?







**Legal Disclaimer***
This communication may contain confidential and privileged material for the 
sole use of the intended recipient. Any unauthorized review, use or 
distribution by others is strictly prohibited. If you have received the message 
in error, please advise the sender by reply email and delete the message. Thank 
you.
*


Re: Embedded Solr constructor not returning

2011-04-06 Thread Greg Pendlebury
 Sounds good.  Please go ahead and make this change yourself.

Done.

Ta,
Greg

On 6 April 2011 22:52, Steven A Rowe sar...@syr.edu wrote:

 Hi Greg,

  I need the servlet API in my app for it to work, despite being command
  line.
  So adding this to the maven POM fixed everything:
  dependency
  groupIdjavax.servlet/groupId
  artifactIdservlet-api/artifactId
  version2.5/version
  /dependency
 
  Perhaps this dependency could be listed on the wiki? Alongside the sample
  code for using embedded solr?
  http://wiki.apache.org/solr/Solrj

 Sounds good.  Please go ahead and make this change yourself.

 FYI, the Solr 3.1 POM has a servlet-api dependency, but the scope is
 provided, because the servlet container includes this dependency.  When
 *you* are the container, you have to provide it.

 Steve



Re: what happens to docsPending if stop solr before commit

2011-04-06 Thread Koji Sekiguchi

(11/04/06 5:25), Robert Petersen wrote:

I tried to find the answer to this simple question online, but failed.
I was wondering about this, what happens to uncommitted docsPending if I
stop solr and then restart solr?  Are they lost?  Are they still there
but still uncommitted?  Do they get committed at startup?  I noticed
after a restart my 250K pending doc count went to 0 is what got me
wondering.


Robi,

Usually they are never lost, but they are committed.

When you stop Solr, servlet container (Jetty) calls servlets/filters
destroy() methods. This causes closing all SolrCores. Then SolrCore.close()
calls UpdateHandler.close(). It calls SolrIndexWriter.close(). Then
pending docs are flushed, then committed.

Koji
--
http://www.rondhuit.com/en/


RE: what happens to docsPending if stop solr before commit

2011-04-06 Thread Robert Petersen
Really?  Great!  I was wondering if there was some cleanup cycle like
that which would occur upon shutdown.  That sounds like much more
logical behavior! 

-Original Message-
From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] 
Sent: Wednesday, April 06, 2011 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: what happens to docsPending if stop solr before commit

(11/04/06 5:25), Robert Petersen wrote:
 I tried to find the answer to this simple question online, but failed.
 I was wondering about this, what happens to uncommitted docsPending if
I
 stop solr and then restart solr?  Are they lost?  Are they still there
 but still uncommitted?  Do they get committed at startup?  I noticed
 after a restart my 250K pending doc count went to 0 is what got me
 wondering.

Robi,

Usually they are never lost, but they are committed.

When you stop Solr, servlet container (Jetty) calls servlets/filters
destroy() methods. This causes closing all SolrCores. Then
SolrCore.close()
calls UpdateHandler.close(). It calls SolrIndexWriter.close(). Then
pending docs are flushed, then committed.

Koji
-- 
http://www.rondhuit.com/en/


Shared conf

2011-04-06 Thread Mark
Is there a configuration value I can specify for multiple cores to use 
the same conf directory?


Thanks


difference between geospatial search from database angle and from solr angle

2011-04-06 Thread Sean Bigdatafun
I understand Solr can do pretty powerful geospatial search
http://www.ibm.com/developerworks/java/library/j-spatial/

http://www.ibm.com/developerworks/java/library/j-spatial/But I also
understand lots of DB researchers have done lots of geospatial related work,
can someone give an overview of the difference from the different angel?

Thanks,
-- 
--Sean


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Lance Norskog
I would not use replication. LinkedIn consumer search is a flat system
where one process indexes new entries and does queries simultaneously.
It's a custom Lucene app called Zoie. Their stuff is on Github..

I would get documents to indexers via a multicast IP-based queueing
system. This scales very well and there's a lot of hardware support.

The problem with distributed search is that it is a) inherently slower
and b) has inherently more and longer jitter. The airplane wing
distribution of query times becomes longer and flatter.

This is going to have to be a federated system, where the front-end
app aggregates results rather than Solr.

On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com wrote:
 Hello Experts,



 I am a Solr newbie but read quite a lot of docs. I still do not understand
 what would be the best way to setup very large scale deployments:



 Goal (threoretical):

  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)

  B) Queries: 10 Queries/ per Second

  C) Updates: 10 Updates / per Second




 Solr offers:

 1.)    Replication = Scales Well for B)  BUT  A) and C) are not satisfied


 2.)    Sharding = Scales well for A) BUT B) and C) are not satisfied (= As
 I understand the Sharding approach all goes through a central server, that
 dispatches the updates and assembles the quries retrieved from the different
 shards. But this central server has also some capacity limits...)




 What is the right approach to handle such large deployments? I would be
 thankfull for just a rough sketch of the concepts so I can experiment/search
 further…


 Maybe I am missing something very trivial as I think some of the “Solr
 Users/Use Cases” on the homepage are that kind of large deployments. How are
 they implemented?



 Thanky very much!!!

 Jens




-- 
Lance Norskog
goks...@gmail.com


Re: Using MLT feature

2011-04-06 Thread Lance Norskog
A fuzzy signature system will not work here. You are right, you want
to try MLT instead.

Lance

On Wed, Apr 6, 2011 at 9:47 AM, Frederico Azeiteiro
frederico.azeite...@cision.com wrote:
 Yes, I had already check the code for it and use it to compile a c# method 
 that returns the same signature.

 But I have a strange issue:
 For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the 
 text frederico (simple text no big deal here):

 1. using my c# app returns 8b92e01d67591dfc60adf9576f76a055
 2. using SOLR, passing a doc with HeadLine frederico I get 
 8d9a5c35812ba75b8383d4538b91080f on my signature field.
 3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
 SignatureUpdateProcessorFactory class (please check code below) and I get 
 8b92e01d67591dfc60adf9576f76a055.

 Java app code:
                TextProfileSignature textProfileSignature = new 
 TextProfileSignature();
                NamedListString params = new NamedListString();
                params.add(, );
                SolrParams solrParams = SolrParams.toSolrParams(params);
                textProfileSignature.init(solrParams);
                textProfileSignature.add(frederico);


                byte[] signature =  textProfileSignature.getSignature();
                char[] arr = new char[signature.length  1];
                for (int i = 0; i  signature.length; i++) {
                        int b = signature[i];
                        int idx = i  1;
                        arr[idx] = StrUtils.HEX_DIGITS[(b  4)  0xf];
                        arr[idx + 1] = StrUtils.HEX_DIGITS[b  0xf];
                }
                String sigString = new String(arr);
                System.out.println(sigString);




 Here's my processor configs:

 updateRequestProcessorChain name=dedupe
     processor 
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
       bool name=enabledtrue/bool
       str name=signatureFieldsig/str
       bool name=overwriteDupesfalse/bool
       str name=fieldsHeadLine/str
       str 
 name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str
       str name=minTokenLen2/str
       /processor
     processor class=solr.LogUpdateProcessorFactory /
     processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain


 So both my apps (Java and C#)  return the same signature but SOLR returns a 
 different one..
 Can anyone understand what I should be doing wrong?

 Thank you once again.

 Frederico

 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: terça-feira, 5 de Abril de 2011 15:20
 To: solr-user@lucene.apache.org
 Cc: Frederico Azeiteiro
 Subject: Re: Using MLT feature

 If you check the code for TextProfileSignature [1] your'll notice the init
 method reading params. You can set those params as you did. Reading Javadoc
 [2] might help as well. But what's not documented in the Javadoc is how QUANT
 is computed; it rounds.

 [1]:
 http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
 [2]:
 http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

 On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
 Thank you, I'll try to create a c# method to create the same sig of SOLR,
 and then compare both sigs before index the doc. This way I can avoid the
 indexation of existing docs.

 If anyone needs to use this parameter (as this info is not on the wiki),
 you can add the option

 str name=minTokenLen5/str

 On the processor tag.

 Best regards,
 Frederico


 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: terça-feira, 5 de Abril de 2011 12:01
 To: solr-user@lucene.apache.org
 Cc: Frederico Azeiteiro
 Subject: Re: Using MLT feature

 On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
  Sorry, the reply I made yesterday was directed to Markus and not the
  list...
 
  Here's my thoughts on this. At this point I'm a little confused if SOLR
  is a good option to find near duplicate docs.
 
   Yes there is, try set overwriteDupes to true and documents yielding
 
  the same signature will be overwritten
 
  The problem is that I don't want to overwrite the doc, I need to
  maintain the original version (because the doc has others fields I need
  to maintain).
 
  If you have need both fuzzy and exact matching then add a second
 
  update processor inside the chain and create another signature field.
 
  I just need the fuzzy search but the quick tests I made, return
  different signatures for what I consider duplicate docs.
  Army deploys as clan war kills 11 in Philippine south
  Army deploys as clan war kills 11 in Philippine south.
 
  Same sig for the above 2 strings, that's ok.
 
  But a different sig was created for:
  Army deploys as clan war kills 11 in Philippine south the.
 
  Is there a way to setup the 

Re: SOLR - problems with non-english symbols when extracting HTML

2011-04-06 Thread Lance Norskog
Tomcat has to be configured to use UTF-8.

http://wiki.apache.org/solr/SolrTomcat?highlight=%28tomcat%29#URI_Charset_Config

On Fri, Mar 25, 2011 at 6:58 PM, kushti sandyl...@gmail.com wrote:

 Grijesh wrote:

 Try to send HTML data using format CDATA .

 Doesn't work with


 $content = ;


 And my goal is not to avoid extraction, but have no problems with
 non-english chars


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2733858.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Walter Underwood
The bigger answer is that you cannot get to this size by just configuring Solr. 
You may have to invent a lot of stuff. Like all of Google.

Where did you get these numbers? The proposed query rate is twice as big as 
Google (Feb 2010 estimate, 34K qps).

I work at MarkLogic, and we scale to 100's of terabytes, with fast update and 
query rates. If you want a real system that handles that, you might want to 
look at our product.

wunder

On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:

 I would not use replication. LinkedIn consumer search is a flat system
 where one process indexes new entries and does queries simultaneously.
 It's a custom Lucene app called Zoie. Their stuff is on Github..
 
 I would get documents to indexers via a multicast IP-based queueing
 system. This scales very well and there's a lot of hardware support.
 
 The problem with distributed search is that it is a) inherently slower
 and b) has inherently more and longer jitter. The airplane wing
 distribution of query times becomes longer and flatter.
 
 This is going to have to be a federated system, where the front-end
 app aggregates results rather than Solr.
 
 On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com 
 wrote:
 Hello Experts,
 
 
 
 I am a Solr newbie but read quite a lot of docs. I still do not understand
 what would be the best way to setup very large scale deployments:
 
 
 
 Goal (threoretical):
 
  A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
 
  B) Queries: 10 Queries/ per Second
 
  C) Updates: 10 Updates / per Second
 
 
 
 
 Solr offers:
 
 1.)Replication = Scales Well for B)  BUT  A) and C) are not satisfied
 
 
 2.)Sharding = Scales well for A) BUT B) and C) are not satisfied (= As
 I understand the Sharding approach all goes through a central server, that
 dispatches the updates and assembles the quries retrieved from the different
 shards. But this central server has also some capacity limits...)
 
 
 
 
 What is the right approach to handle such large deployments? I would be
 thankfull for just a rough sketch of the concepts so I can experiment/search
 further…
 
 
 Maybe I am missing something very trivial as I think some of the “Solr
 Users/Use Cases” on the homepage are that kind of large deployments. How are
 they implemented?
 
 
 
 Thanky very much!!!
 
 Jens
 
 






solr-2351 patch

2011-04-06 Thread Isha Garg



Hi,
 Tell me for which solr version does Patch file
SOLR-2351(https://issues.apache.org/jira/secure/attachment/12470560/mlt.patch)
fixed for .

Regards!
Isha



Re: difference between geospatial search from database angle and from solr angle

2011-04-06 Thread David Smiley (@MITRE.org)
Sean,
Geospatial search in Lucene/Solr is of course implemented based on
Lucene's underlying index technology. That technology was originally just
for text but it's been adapted very successfully for numerics and querying
ranges too. The only mature geospatial field type in Solr 3.1 is LatLonType
which under the hood is simply a pair of latitude  longitude numeric
fields.  There really isn't anything sophisticated (geospatially speaking)
in Solr 3.1. I'm not sure what sort of geospatial DB research you have in
mind but I would expect other systems would be free to use an indexing
strategy designed for spatial such as R-Trees. Nevertheless, I think
Lucene offers the underlying primitives to compete with systems using other
technologies.  Case in point is my patch SOLR-2155 which indexes a single
point in the form of a geohash at multiple resolutions (geohash lengths
AKA spatial prefixes / grids) and uses a recursive algorithm to efficiently
query an arbitrary shape.  It's quite fast and bests LatLonType already; and
there's a lot more I can do to make it faster.
This is definitely a field of interest and a growing one in the
Lucene/Solr community.  There are even some external spatial providers
(JTeam, MetaCarta) and I'm partnering with other individuals to create a new
one.  Expect to see more in the coming months.  If you're looking for some
specific geospatial capabilities then let us know.

~ David Smiley 
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/difference-between-geospatial-search-from-database-angle-and-from-solr-angle-tp2788442p2788972.html
Sent from the Solr - User mailing list archive at Nabble.com.


Tips for getting unique results?

2011-04-06 Thread Peter Spam
Hi,

I have documents with a field that has 1A2B3C alphanumeric characters.  I can 
query for * and sort results based on this field, however I'd like to uniq 
these results (remove duplicates) so that I can get the 5 largest unique 
values.  I can't use the StatsComponent because my values have letters in them 
too.

Faceting (and ignoring the counts) gets me half of the way there, but I can 
only sort ascending.  If I could also sort facet results descending, I'd be 
done.  I'd rather not return all documents and just parse the last few results 
to work around this.

Any ideas?


-Pete


Re: Tips for getting unique results?

2011-04-06 Thread Otis Gospodnetic
Hi,

I think you are saying dupes are the main problem?  If so, 
http://wiki.apache.org/solr/Deduplication ?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Peter Spam ps...@mac.com
 To: solr-user@lucene.apache.org
 Sent: Thu, April 7, 2011 1:13:44 AM
 Subject: Tips for getting unique results?
 
 Hi,
 
 I have documents with a field that has 1A2B3C alphanumeric  characters.  I 
can query for * and sort results based on this field,  however I'd like to 
uniq these results (remove duplicates) so that I can get  the 5 largest 
unique 
values.  I can't use the StatsComponent because my  values have letters in 
them 
too.
 
 Faceting (and ignoring the counts) gets  me half of the way there, but I can 
only sort ascending.  If I could also  sort facet results descending, I'd be 
done.  I'd rather not return all  documents and just parse the last few 
results 
to work around this.
 
 Any  ideas?
 
 
 -Pete
 


Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Jens Mueller
Hello Ephraim, hello Lance, hello Walter,

thanks for your replies:

Ephraim, thanks very much for the further detailed explanation. I will try
to setup a demo system in the next few days and use your advice.
LoadBalancers are an important aspect of your design. Can you recommend one
LB specificallly? (I would be using haproxy.1wt.eu) . I think the Idea with
uploading your document is very good. However Google-Docs seemed not be be
working (at least for me with the docx format?), but maybe you can simply
output the document as PDF and then I think Google Docs is working, so all
the others can also have a look at your concept. The best approach would be
if you could upload your advice directly somewhere to the solr wiki as it is
really helpful.I found some other documents meanwhile, but yours is much
clearer and more complete, with the LBs and the Aggregators (
http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)

Lance, thanks I will have a look at what linkedin is doing.

Walter, thanks for the advice: Well you are right, mentioning google. My
question was also to understand how such large systems like google/facebook
are actually working. So my numbers are just theoretical and made up. My
system will be smaller,  but I would be very happy to understand how such
large systems are build and I think the approach Ephraim showd should be
working quite well at large scale. If you know a good documents (besides the
bigtable research paper that I already know) that technically describes how
google is working in detail that would be of great interest. You seem to be
working for a company that handles large datasets. Does google use this
approach, sharing the index into N writers, and the procuded index is then
replicated to N read only searchers?

thank you all.
best regards
jens



2011/4/7 Walter Underwood wun...@wunderwood.org

 The bigger answer is that you cannot get to this size by just configuring
 Solr. You may have to invent a lot of stuff. Like all of Google.

 Where did you get these numbers? The proposed query rate is twice as big as
 Google (Feb 2010 estimate, 34K qps).

 I work at MarkLogic, and we scale to 100's of terabytes, with fast update
 and query rates. If you want a real system that handles that, you might want
 to look at our product.

 wunder

 On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:

  I would not use replication. LinkedIn consumer search is a flat system
  where one process indexes new entries and does queries simultaneously.
  It's a custom Lucene app called Zoie. Their stuff is on Github..
 
  I would get documents to indexers via a multicast IP-based queueing
  system. This scales very well and there's a lot of hardware support.
 
  The problem with distributed search is that it is a) inherently slower
  and b) has inherently more and longer jitter. The airplane wing
  distribution of query times becomes longer and flatter.
 
  This is going to have to be a federated system, where the front-end
  app aggregates results rather than Solr.
 
  On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com
 wrote:
  Hello Experts,
 
 
 
  I am a Solr newbie but read quite a lot of docs. I still do not
 understand
  what would be the best way to setup very large scale deployments:
 
 
 
  Goal (threoretical):
 
   A.) Index-Size: 1 Petabyte (1 Document is about 5 KB in Size)
 
   B) Queries: 10 Queries/ per Second
 
   C) Updates: 10 Updates / per Second
 
 
 
 
  Solr offers:
 
  1.)Replication = Scales Well for B)  BUT  A) and C) are not
 satisfied
 
 
  2.)Sharding = Scales well for A) BUT B) and C) are not satisfied
 (= As
  I understand the Sharding approach all goes through a central server,
 that
  dispatches the updates and assembles the quries retrieved from the
 different
  shards. But this central server has also some capacity limits...)
 
 
 
 
  What is the right approach to handle such large deployments? I would be
  thankfull for just a rough sketch of the concepts so I can
 experiment/search
  further…
 
 
  Maybe I am missing something very trivial as I think some of the “Solr
  Users/Use Cases” on the homepage are that kind of large deployments. How
 are
  they implemented?
 
 
 
  Thanky very much!!!
 
  Jens
 
 







Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-06 Thread Otis Gospodnetic
Just a quick comment re LinkedIn's stuff.  You can look at Zoie (also covered 
in 
Lucene in Action 2), but you may be more interested in Sensei.

And yes, big systems like that need sharding and replication, multiple master 
and lots of slaves.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Jens Mueller supidupi...@googlemail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, April 7, 2011 1:29:40 AM
 Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
Question)?
 
 Hello Ephraim, hello Lance, hello Walter,
 
 thanks for your  replies:
 
 Ephraim, thanks very much for the further detailed explanation.  I will try
 to setup a demo system in the next few days and use your  advice.
 LoadBalancers are an important aspect of your design. Can you  recommend one
 LB specificallly? (I would be using haproxy.1wt.eu) . I think  the Idea with
 uploading your document is very good. However Google-Docs  seemed not be be
 working (at least for me with the docx format?), but maybe  you can simply
 output the document as PDF and then I think Google Docs is  working, so all
 the others can also have a look at your concept. The best  approach would be
 if you could upload your advice directly somewhere to the  solr wiki as it is
 really helpful.I found some other documents meanwhile, but  yours is much
 clearer and more complete, with the LBs and the Aggregators  (
 http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
 
 Lance,  thanks I will have a look at what linkedin is doing.
 
 Walter, thanks for  the advice: Well you are right, mentioning google. My
 question was also to  understand how such large systems like google/facebook
 are actually working.  So my numbers are just theoretical and made up. My
 system will be  smaller,  but I would be very happy to understand how such
 large systems  are build and I think the approach Ephraim showd should be
 working quite well  at large scale. If you know a good documents (besides the
 bigtable research  paper that I already know) that technically describes how
 google is working  in detail that would be of great interest. You seem to be
 working for a  company that handles large datasets. Does google use this
 approach, sharing  the index into N writers, and the procuded index is then
 replicated to N  read only searchers?
 
 thank you all.
 best  regards
 jens
 
 
 
 2011/4/7 Walter Underwood wun...@wunderwood.org
 
   The bigger answer is that you cannot get to this size by just  configuring
  Solr. You may have to invent a lot of stuff. Like all of  Google.
 
  Where did you get these numbers? The proposed query rate  is twice as big as
  Google (Feb 2010 estimate, 34K qps).
 
   I work at MarkLogic, and we scale to 100's of terabytes, with fast  update
  and query rates. If you want a real system that handles that, you  might 
want
  to look at our product.
 
   wunder
 
  On Apr 6, 2011, at 8:06 PM, Lance Norskog  wrote:
 
   I would not use replication. LinkedIn consumer  search is a flat system
   where one process indexes new entries and  does queries simultaneously.
   It's a custom Lucene app called Zoie.  Their stuff is on Github..
  
   I would get documents to  indexers via a multicast IP-based queueing
   system. This scales very  well and there's a lot of hardware support.
  
   The  problem with distributed search is that it is a) inherently slower
and b) has inherently more and longer jitter. The airplane wing
distribution of query times becomes longer and flatter.
  
This is going to have to be a federated system, where the  front-end
   app aggregates results rather than Solr.
   
   On Mon, Apr 4, 2011 at 6:25 PM, Jens Mueller supidupi...@googlemail.com
   wrote:
   Hello Experts,
  
   
  
   I am a Solr newbie but read quite a  lot of docs. I still do not
  understand
   what would be  the best way to setup very large scale deployments:
  
   
  
   Goal (threoretical):
   
A.) Index-Size: 1 Petabyte (1 Document is about  5 KB in Size)
  
B) Queries: 10  Queries/ per Second
  
C) Updates: 10  Updates / per Second
  
  
  
   
   Solr offers:
  
1.)Replication = Scales Well for B)  BUT  A) and C)  are not
  satisfied
  
  
2.)Sharding = Scales well for A) BUT B) and C) are not  satisfied
  (= As
   I understand the Sharding approach  all goes through a central server,
  that
   dispatches the  updates and assembles the quries retrieved from the
  different
shards. But this central server has also some capacity  limits...)
  
  
  
   
   What is the right approach to handle such large  deployments? I would be
   thankfull for just a rough sketch of  the concepts so I can
  experiment/search
further…
  
  
   Maybe I am missing  something very trivial as I think some of the “Solr
   Users/Use  Cases” on the homepage are that kind of large deployments. How