date:20110802

Strange, anything out of the ordinary in the syslog?

On Tuesday 02 August 2011 12:01:35 alexander sulz wrote:
 Hello folks,
 
 I'm using the latest stable Solr release - 3.3 and I encounter strange
 phenomena with it.
 After about 19 hours it just crashes, but I can't find anything in the
 logs, no exceptions, no warnings,
 no suspicious info entries..
 
 I have an index-job running from 6am to 8pm every 10 minutes. After each
 job there is a commit.
 An optimize-job is done twice a day at 12:15pm and 9:15pm.
 
 Does anyone have an idea what could possibly be wrong or where to look
 for further debug info?
 
 regards and thank you
   alex

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread Pranav Prakash

What do you mean by it just crashes? Does the process stops execution? Does
it takes too long to respond which might result in lots of 503s in your
application? Does the system run out of resources?

Are you indexing and serving from the same server? It happened once with us
that Solr was performing commit and then optimize while the load from app
server was at its peak. This caused slow response from search server, which
caused requests getting stacked up at app server and causing 503s. Could you
look if you have a similar syndrome?

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Tue, Aug 2, 2011 at 15:31, alexander sulz a.s...@digiconcept.net wrote:

 Hello folks,

 I'm using the latest stable Solr release - 3.3 and I encounter strange
 phenomena with it.
 After about 19 hours it just crashes, but I can't find anything in the
 logs, no exceptions, no warnings,
 no suspicious info entries..

 I have an index-job running from 6am to 8pm every 10 minutes. After each
 job there is a commit.
 An optimize-job is done twice a day at 12:15pm and 9:15pm.

 Does anyone have an idea what could possibly be wrong or where to look for
 further debug info?

 regards and thank you
  alex

RE: changing the root directory where solrCloud stores info inside zookeeper File system

2011-08-02 Thread Yatir Ben Shlomo

Thanks A lot mark,
Since My SolrCloud code was old I tried downloading and building the
newest code from here
https://svn.apache.org/repos/asf/lucene/dev/trunk/
I am using tomcat6
I manually created the sc sub-directory in my zooKeeper ensemble
file-system
I used this connection String to my ZK ensemble
zook1:2181/sc,zook2:2181/sc,zook3:2181/sc
but I still get the same problem
here is the entire catalina.out log with the exception

Using CATALINA_BASE:   /opt/tomcat6
Using CATALINA_HOME:   /opt/tomcat6
Using CATALINA_TMPDIR: /opt/tomcat6/temp
Using JRE_HOME:/usr/java/default/
Using CLASSPATH:   /opt/tomcat6/bin/bootstrap.jar
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory
(errno = 12).
Aug 2, 2011 4:28:46 AM org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal
performance in production environments was not found on the
java.library.path:
/usr/java/jdk1.6.0_21/jre/lib/amd64/server:/usr/java/jdk1.6.0_21/jre/lib/a
md64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:/usr/java/packages/lib/amd64:/
usr/lib64:/lib64:/lib:/usr/lib
Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8983
Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8080
Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 448 ms
Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardService start
INFO: Starting service Catalina
Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.29
Aug 2, 2011 4:28:46 AM org.apache.catalina.startup.HostConfig
deployDescriptor
INFO: Deploying configuration descriptor solr1.xml
Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: Using JNDI solr.home: /home/tomcat/solrCloud1
Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init
INFO: Solr home set to '/home/tomcat/solrCloud1/'
Aug 2, 2011 4:28:46 AM org.apache.solr.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: Using JNDI solr.home: /home/tomcat/solrCloud1
Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer$Initializer
initialize
INFO: looking for solr.xml: /home/tomcat/solrCloud1/solr.xml
Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer init
INFO: New CoreContainer 853527367
Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: Using JNDI solr.home: /home/tomcat/solrCloud1
Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader init
INFO: Solr home set to '/home/tomcat/solrCloud1/'
Aug 2, 2011 4:28:46 AM org.apache.solr.cloud.SolrZkServerProps
getProperties
INFO: Reading configuration from: /home/tomcat/solrCloud1/zoo.cfg
Aug 2, 2011 4:28:46 AM org.apache.solr.core.CoreContainer initZooKeeper
INFO: Zookeeper client=zook1:2181/sc,zook2:2181/sc,zook3:2181/sc
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:zookeeper.version=3.3.1-942149, built on
05/07/2010 17:14 GMT
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:host.name=ob1079.nydc1.outbrain.com
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.version=1.6.0_21
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.vendor=Sun Microsystems Inc.
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.home=/usr/java/jdk1.6.0_21/jre
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.class.path=/opt/tomcat6/bin/bootstrap.jar
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client
environment:java.library.path=/usr/java/jdk1.6.0_21/jre/lib/amd64/server:/
usr/java/jdk1.6.0_21/jre/lib/amd64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:
/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.io.tmpdir=/opt/tomcat6/temp
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.compiler=NA
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:os.name=Linux
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:os.arch=amd64
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:os.version=2.6.18-194.8.1.el5
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:user.name=tomcat
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:user.home=/home/tomcat
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client

Re: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory'

2011-08-02 Thread Robert Muir

did you add the analysis-extras jar itself? thats what has this factory.

On Tue, Aug 2, 2011 at 5:03 AM, Satish Talim satish.ta...@gmail.com wrote:
 I am using Solr 3.3 on a Windows box.

 I want to use the solr.ICUTokenizerFactory in my schema.xml and added the
 fieldType name=text_icu as per the URL -
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory

 I also added the following files to my apache-solr-3.3.0\example\lib folder:
 lucene-icu-3.3.0.jar
 lucene-smartcn-3.3.0.jar
 icu4j-4_8.jar
 lucene-stempel-3.3.0.jar

 When I start my Solr server from apache-solr-3.3.0\example folder:
 java -jar start.jar

 I get the following errors:

 SEVERE: org.apache.solr.common.SolrException: Error loading class
 'solr.ICUTokenizerFactory'

 SEVERE: org.apache.solr.common.SolrException: analyzer without class or
 tokenizer  filter list

 SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype 'text_icu'
 specified on field subject

 I tried adding various other jar files to the lib folder but it does not
 help.

 What am I doing wrong?

 Satish




-- 
lucidimagination.com

indexing taking very long time

2011-08-02 Thread Naveen Gupta

Hi

We have a requirement where we are indexing all the messages of a a thread,
a thread may have attachment too . We are adding to the solr for indexing
and searching for applying few business rule.

For a user, we have almost many threads (100k) in number and each thread may
be having 10-20 messages.

Now what we are finding is that it is taking 30 mins to index the entire
threads.

When we run optimize then it is taking faster time.

The question here is that how frequently this optimize should be called and
when ?

Please note that we are following commit strategy (that is every after 10k
threads, commit is called). we are not calling commit after every doc.

Secondly how can we use multi threading from solr perspective in order to
improve jvm and other utilization ?


Thanks
Naveen

DIH + signature

2011-08-02 Thread jodehaes

Hi,

I'm using solr 3.3 and want to add a signature field to solr to later be
able to deduplicate search results using field collapsing.  I'm using DIH to
fill solr.

Extract from solrconfig.xml

updateRequestProcessorChain name=dedupe
processor
class=solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldsignature/str
  str name=fieldsctcontent/str
  str
name=signatureClasssolr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
  lst name=defaults
  str name=configdata-config.xml/str
  str name=update.processordedupe/str
/lst
/requestHandler

in the schema.xml there is:

field name=signature type=string indexed=true stored=true
multiValued=false /
and
field name=ctcontent type=text_nl_splitting indexed=true
stored=true termVectors=on termPositions=on termOffsets=on/

When I run a full-import however the signature field remains empty.  Any
insight on what I'm doing wrong would be greatly appreciated!

Kind regards,

Jo

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-signature-tp3218813p3218813.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread wakemaster 39

Monitor your memory usage.  I use to encounter a problem like this before
where nothing was in the logs and the process was just gone.

Turned out my system was out odd memory and swap got used up because of
another process which then forced the kernel to start killing off processes.
Google OOM linux and you will find plenty of other programs and people with
a similar problem.

Cameron
On Aug 2, 2011 6:02 AM, alexander sulz a.s...@digiconcept.net wrote:
 Hello folks,

 I'm using the latest stable Solr release - 3.3 and I encounter strange
 phenomena with it.
 After about 19 hours it just crashes, but I can't find anything in the
 logs, no exceptions, no warnings,
 no suspicious info entries..

 I have an index-job running from 6am to 8pm every 10 minutes. After each
 job there is a commit.
 An optimize-job is done twice a day at 12:15pm and 9:15pm.

 Does anyone have an idea what could possibly be wrong or where to look
 for further debug info?

 regards and thank you
 alex

Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread François Schiettecatte

Assuming you are running on Linux, you might want to check /var/log/messages 
too (the location might vary), I think the kernel logs forced process 
termination there. I recall that the kernel will usually picks the process 
consuming the most memory, there may be other factors involved too.

François

On Aug 2, 2011, at 9:04 AM, wakemaster 39 wrote:

 Monitor your memory usage.  I use to encounter a problem like this before
 where nothing was in the logs and the process was just gone.
 
 Turned out my system was out odd memory and swap got used up because of
 another process which then forced the kernel to start killing off processes.
 Google OOM linux and you will find plenty of other programs and people with
 a similar problem.
 
 Cameron
 On Aug 2, 2011 6:02 AM, alexander sulz a.s...@digiconcept.net wrote:
 Hello folks,
 
 I'm using the latest stable Solr release - 3.3 and I encounter strange
 phenomena with it.
 After about 19 hours it just crashes, but I can't find anything in the
 logs, no exceptions, no warnings,
 no suspicious info entries..
 
 I have an index-job running from 6am to 8pm every 10 minutes. After each
 job there is a commit.
 An optimize-job is done twice a day at 12:15pm and 9:15pm.
 
 Does anyone have an idea what could possibly be wrong or where to look
 for further debug info?
 
 regards and thank you
 alex

Re: Different options for autocomplete/autosuggestion

2011-08-02 Thread Erick Erickson

You have to tell us more information about what not right means.
Please review:

http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Wed, Jul 27, 2011 at 6:12 AM, scorpking lehoank1...@gmail.com wrote:
 HI Bell,
 i used autocomplete in solr 3.1. same this:

  searchComponent name=autocomplete class=solr.SpellCheckComponent
    lst name=spellchecker
      str name=nameautocomplete/str
      str
 name=classnameorg.apache.solr.spelling.suggest.Suggester/str
      str
 name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/s
 tr
      str name=fieldautocomplete/str
      str name=buildOnCommittrue/str
 /lst

 and i make following URL*
 http://solr.pl/en/2010/11/15/solr-and-autocomplete-part-2/* to index my
 data. and had a problem. with one word, it have done very good. But when i
 typed more two words, rerults return not right. I don't know why? Can any
 one know this problem? Thanks for your help.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Different-options-for-autocomplete-autosuggestion-tp2678899p3203032.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Master-slave master failover without data loss

2011-08-02 Thread Erick Erickson

Not OOB. You say that the index updates, but if the data hasn't been
committed, it isn't really in the index. After the commit (which varies
time-wise depending on merges etc.) the next replication from the slave
should get the new index, regardless of whether the master has gone down
or not.

One way to handle this issue is to re-index data from some time before the
master went down, relying on the uniqueKey to replace any duplicate
documents

Best
Erick

On Wed, Jul 27, 2011 at 10:43 AM, Nagendraprasad
nagu.nutalap...@gmail.com wrote:
 Suppose master goes down immediately after the index updates, while the
 updates haven't been replicated to the slaves, data loss seems to happen.
 Does Solr have any mechanism to deal with that?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Master-slave-master-failover-without-data-loss-tp3203644p3203644.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH + signature

2011-08-02 Thread jodehaes

Follow-up on this issue.

I eventually found the problem.

The naming scheme changed from solr 3.2 onwards.

The line as it states in the documentation:
str name=update.processordedupe/str

should now be:
str name=update.chaindedupe/str

https://issues.apache.org/jira/browse/SOLR-2105


--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-signature-tp3218813p3218979.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: xpath expression not working

Hi abhayd,

XPathEntityProcessor does only support a subset of xpath,
like div[@id=2] but not [id=2]
Take a look to
https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose

I solve this problem by using xslt a preprocessor (with full xpath).

The drawback is performance wasting: See
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Best regards
  Karsten

 Original-Nachricht 
 Datum: Mon, 1 Aug 2011 23:21:45 -0700 (PDT)
 Von: abhayd ajdabhol...@hotmail.com
 An: solr-user@lucene.apache.org
 Betreff: xpath expression not working

 hi 
 I have a xml doc whichi would like to index using xpath entity processor.
 add
 doc
  id1/id
  detailsxyz/details
 /doc
 doc
  id2/id
  detailsxyz2/details
 /doc
 /add
 
 if i want to just load document with id=2 how would that work? 
 
 I tried xpath expression that works with xpath tools but not in solr. 
 
 dataConfig
 dataSource type=FileDataSource /
 document
 entity name=f processor=FileListEntityProcessor
 baseDir=c:\temp fileName=promotions.xml 
 recursive=false rootEntity=false dataSource=null
 entity name=x processor=XPathEntityProcessor
 forEach=/add/doc url=${f.fileAbsolutePath} pk=id
 field column=id xpath=/add/doc/[id=2]/id/
 /entity
 /entity
 /document
 /dataConfig
 
 Any help how i can do this?
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/xpath-expression-not-working-tp3218133p3218133.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Store complete XML record (DIH XPathEntityProcessor)

Hi g, Hi Chantal

I had the same problem.
You can use XPathEntityProcessor but you have to insert an xsl. The drawback is
performance wasting: See
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Best regards
Karsten

Original-Nachricht
Datum: Mon, 1 Aug 2011 12:17:45 +0200
Von: Chantal Ackermann chantal.ackerm...@btelligent.de
An: solr-user@lucene.apache.org solr-user@lucene.apache.org
Betreff: Re: Store complete XML record (DIH XPathEntityProcessor)

Hi g,

ok, I understand your problem, now. (Sorry for answering that late.)

I don't think PlainTextEntityProcessor can help you. It does not take a
regex. LineEntityProcessor does but your record elements probably do not
come on their own line each and you wouldn't want to depend on that,
anyway.

I guess you would be best off writing your own entity processor - maybe
by extending XPath EP if that gives you some advantage. You can of
course also implement your own importer using SolrJ and your favourite
XML parser framework - or any other programming language.

If you are looking for a config-only solution - i'm not sure that there
is one. Someone else might be able to comment on that?

Cheers,
Chantal

On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote:
Thanks Chantal
I am ok with the second call and I already tried using that.
Unfortunatly
It reads the whole file into a field. My file is as below example
xml
record
...
/record

record
...
/record

/xml

Now the XPATH does the 'for each /record' part. For each record I also
need
to store the raw log in there. If I use the PlainTextEntityProcessor
then
it gives me the whole file (from xml .. /xml ) and not each of the
record /record

Am I using the PlainTextEntityProcessor wrong?

THanks
g

--
View this message in context:
http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread Mike Sokolov


You have a few choices:

1) flatten your field structure - like your undesirable example, but 
wouldn't you want to have the document identifier as a field value also?


2) use phrase queries to make sure the key/value pairs are adjacent

3) use a join query

That's all I can think of

-Mike

On 08/01/2011 08:08 PM, Suk-Hyun Cho wrote:

I'm sure someone asked this before, but I couldn't find a previous post
regarding this.


The problem:


Let's say that I have a multivalued field called myFriends that tokenizes on
whitespaces. Basically, I'm treating it like a List of Lists (attributes of
friends):


Document A:

myFriends = [
 isCool=true SOME_JUNK_HERE gender=male bloodType=A
]

Document B:

myFriends = [
 isCool=true SOME_JUNK_HERE gender=female bloodType=O,
 isCool=false SOME_JUNK_HERE gender=male bloodType=AB
]

Now, let's say that I want to search for all the cool male friends I have.
Naively, I can query q=myFriends:isCool=true+AND+myFriends:gender=male.
However, this returns documents A and B, because the two criteria are tested
against the entire collection, rather than against individual elements.


I could work around this by not tokenizing on whitespaces and using
wildcards:


q=myFriends:isCool=true\ *\ gender=male


but this becomes painful when the query becomes more complex. What if I
wanted to find cool friends who are either type A or type O? I could do
q=myFriends:(isCool=true\ *\ bloodType=A+OR+isCool=true\ *\ bloodType=O).
And you can see that the number of criteria will just explode as queries get
more complex.


There are other methods that I've considered, such as duplicating documents
for every friend, like so:


Document A1:

myFriend = [
 isCool=true,
 gender=male,
 bloodType=A
]

Document B1:

myFriend = [
 isCool=true,
 gender=female,
 bloodType=O
]

Document B2:

myFriend = [
 isCool=false,
 gender=male,
 bloodType=AB
]

but this would be less than desirable.

I would like to hear any other ideas around solving this problem, but going
back to the original question, is there a way to match multiple criteria on
a per-item basis rather than against the entire multifield?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3217432.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: performance crossover between single index and sharding

2011-08-02 Thread Shawn Heisey


On 8/2/2011 4:44 AM, Bernd Fehling wrote:

Is there any knowledge on this list about the performance
crossover between a single index and sharding and
when to change from a single index to sharding?

E.g. if index size is larger than 150GB and num of docs is
more than 25 mio. then it is better to change from single index
to sharding and have two shards.
Or something like this...

Sure, solr might even handle 50 mio. docs but performance is going down
and a sharded system with distributed search will be faster than
a single index, or not?


The answer I've always seen here boils down to it depends on a large 
number of variables unique to every situation.  The nature of your data 
will affect things, like the number of fields, number of unique terms 
per field, etc.  If you have really complicated queries, that will slow 
things down.


Probably the greatest limiting factor is memory.  Having enough free 
memory to fit the entire index into the operating system's disk cache is 
the best thing you can do for performance.  This is memory over and 
above whatever you give to your Java heap.  If you have a 150GB index 
and you can afford machines with at least 192GB of RAM, a single index 
would perform very well, once it is warmed up.  Performance on a cold 
index would not be very good.  In a sharded scenario, you want to try 
and size each machine so that its piece fits into RAM.


Next would be disk I/O.  Any data that won't fit in the disk cache must 
be retrieved from disk, which is typically the weakest link in the 
chain.  If you can put your index on solid state disks, that's almost as 
good as having the index entirely in memory.  Performance on a cold 
index with SSD would be incredible.


Having a lot of high speed CPU available will help, but not as much as 
memory and I/O.


Index rebuild time is another consideration that might lead you to go 
distributed, as long as your data source can keep up with multiple readers.


My own index is too big to fit in RAM, even sharded.  Each of the six 
large shards is getting close to 19GB.  Each machine has 14GB of RAM 
(it's a virtual environment with three large shards per physical host) 
and has 3GB allocated to Java.  I am in the process of upgrading the 
memory, at which point it will fit, but our growth will exceed the 
maximum server memory again in the next year or so.  I have plans to 
eliminate the virtualization and have three shards in cores on each server.


I know this isn't really what you were looking for, but there are no 
simple answers to your question.


Thanks,
Shawn

How to cut off hits with score below threshold?

2011-08-02 Thread Otis Gospodnetic

Hello,

If one wanted to cut off hits whose score is below some threshold (I know, I 
know, one doesn't typically want to do this), what are the most elegant options?
I can think of 2 options, but I wonder if there are better choices:

1) custom Collector (problem: one can't specify a custom Collector via an API, 
so one would have to modify Solr source code)

2) custom SearchComponent that filters hits with score  threshold (problem: if 
hits are removed from results then too few hits will be returned to the client, 
so one has to either request more rows from Solr or re-request more hits or do 
both to avoid this problem)

Is there something better one can do?

Thanks,
Otis

Sematext is hiring Search Engineers -- http://sematext.com/about/jobs.html

Re: Matching queries on a per-element basis against a multivalued field

Hi Suk-Hyun Cho,

if myFriend is the unit of retrieval you should use this as lucene document 
with the fields isCool gender bloodType ...

if you realy want to insert all myFriends in one field like your
myFriends = [
isCool=true SOME_JUNK_HERE gender=female bloodType=O,
isCool=false SOME_JUNK_HERE gender=male bloodType=AB
]
example, you can use SpanQueries

http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/

with SpanNotQuery you can search for all isCool true and gender male where 
no other isCool is between both phrases.

Best regards
  Karsten


P.S. see in context
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-td3217432.html

RE: Spell Check

2011-08-02 Thread Dyer, James

The most likely problem is forgetting to specify spellcheck.build=true on the 
first query since the last restart.  This builds the spell check dictionary 
used by the IndexBasedSpellChecker.  You should put this in a warming query or 
alternatively, specify build-on-commit or build-on-optimize.

It also looks like str name=queryAnalyzerFieldTypetextSpell/str should 
probably be str name=queryAnalyzerFieldTypetextSpellPhrase/str .

Finally, if you've done a build and changing the query Analyzer field type 
doesn't help, then you have to wonder if dizeagar exists somewhere in your 
data.  If the keyword exists in the spelling dictionary, Solr's spellchecker 
will not try to correct it.  See 
https://issues.apache.org/jira/browse/SOLR-2585 for a potential solution to 
this problem.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: tamanjit.bin...@yahoo.co.in [mailto:tamanjit.bin...@yahoo.co.in] 
Sent: Tuesday, August 02, 2011 12:30 AM
To: solr-user@lucene.apache.org
Subject: Spell Check

Hi All,
Facing some issue with Solr spellcheck. I got an index based dictionary
made.

My changes to *solrconfig.xml* are:

 searchComponent name=spellcheck class=solr.SpellCheckComponent

str name=queryAnalyzerFieldTypetextSpell/str

  lst name=spellchecker
str name=classnamesolr.IndexBasedSpellChecker/str
str name=namelocSpell/str
str name=fieldlocSpell/str
str name=buildOnOptimizetrue/str
str name=spellcheckIndexDir./spellchecker_loc_spell/str
   /lst
  /searchComponent

  requestHandler name=/spellCheckCompRH class=solr.SearchHandler
lst name=locSpell
str name=echoParamsexplicit/str
  
str name=spellcheck.dictionarylocSpell/str
  str name=spellcheck.onlyMorePopularfalse/str
  
  str name=spellcheck.extendedResultstrue/str
  
  str name=spellcheck.count5/str
/lst
arr name=last-components
  strspellcheck/str
/arr
  /requestHandler

I got my dictionary made to the folder spellchecker_loc_spell post an
optimize.

Now my changes to schema.xml are as follows:

New *fieldtype
*

fieldType name=textSpellPhrase class=solr.TextField
positionIncrementGap=100 stored=false multiValued=true
analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType


my *fields*:

field name=id type=integer indexed=true stored=true/   
   field name=locName type=string indexed=true stored=true/   
   field name=ct type=integer indexed=true stored=true/   
   field name=st type=integer indexed=true stored=true/   
   field name=ppd type=string indexed=true stored=true/   
   field name=ecd type=string indexed=true stored=true/   
   field name=city type=text indexed=true stored=true/   
   field name=state type=text indexed=true stored=true/   
  field name=locSpell type=textSpellPhrase indexed=true
stored=false/   


 defaultSearchFieldlocName/defaultSearchField

 

 copyField source=locName dest=locSpell/





Now when I send the following command

http://SolrIP/MagicBricks/Locality/spellCheckCompRH/?q=Dizeagarversion=2.2start=0rows=10indent=onspellcheck=truespellcheck.collate=truespellcheck.extendedResults=truespellcheck.count=3spellcheck.dictionary=locSpell


I get the following result::


−
response
−
lst name=responseHeader
int name=status0/int
int name=QTime1/int
/lst
result name=response numFound=0 start=0/
−
lst name=spellcheck
−
lst name=suggestions
bool name=correctlySpelledtrue/bool
/lst
/lst
/response


Which should not be the case as it is wrongly spelled. Could anyone help me
out as to why am I getting this strange result that it is
correctlySpelled=true when it is not.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spell-Check-tp3218037p3218037.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to cut off hits with score below threshold?

Hi Otis,

is this the same question as
http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html
?

If yes, perhaps something like (http://search-lucene.com/m/4AHNF17wIJW1/)
q={!frange l=0.85}query($qq)
qq=the original relevancy query
will help?

(BTW, a also would like to specify a custom Collector via API in Solr, possible 
an issue?)

Best regards
  Karsten


in context:
http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html

 Original-Nachricht 
 If one wanted to cut off hits whose score is below some threshold (I know,
 I know, one doesn't typically want to do this), what are the most elegant
 options?

Re: How to cut off hits with score below threshold?

Be careful with that approach as it will return score=1.0f for all documents 
(fl=*,score). This, however, doesn't affect the outcome of the frange.

Feels like a bug though

On Tuesday 02 August 2011 16:29:16 karsten-s...@gmx.de wrote:
 Hi Otis,
 
 is this the same question as
 http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html
 ?
 
 If yes, perhaps something like (http://search-lucene.com/m/4AHNF17wIJW1/)
 q={!frange l=0.85}query($qq)
 qq=the original relevancy query
 will help?
 
 (BTW, a also would like to specify a custom Collector via API in Solr,
 possible an issue?)
 
 Best regards
   Karsten
 
 
 in context:
 http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-thr
 eshold-td3219064.html
 
  Original-Nachricht 
 
  If one wanted to cut off hits whose score is below some threshold (I
  know, I know, one doesn't typically want to do this), what are the most
  elegant options?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

lucene/solr, raw indexing/searching

2011-08-02 Thread dhastings

Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method. I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.

are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do? My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.

for example:

203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
this is rev. 23.302

i need it indexed EXACTLY as it appears,
this is rev. 23.302

I do not want any of solr or lucenes attempts to fix my content or my
queries. rev. needs to stay rev. and not turn into rev, 23.302
needs to stay as such, and NOT turn into 23302. this is for BOTH indexing
and searching.

any hints?

right now for indexing i have:

Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha);

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);

writer.setUseCompoundFile(false) ;

and for searching i have in my schema :

fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Thanks. Very much appreciated.

--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.

CoreContainer from CommonsHttpSolrServer

2011-08-02 Thread Matthias

Hi everybody,

I'm using Solr (with multiple cores) in a Webapp and access the differnt
cores using CommonsHttpSolrServer. As I would like to know, which cores are
configured and what there status is I would like to get an instance of
CoreContainer.

The site http://wiki.apache.org/solr/CoreAdmin tells me how to interact with
the CoreAdminHandler via my browser. But I would like to get the information
provided by the STATUS action in my java application. As CoreContainer
provides appropriate methods I need to get access to such an object.

What's the best way to achieve that.
Thanks in advance.

Matthias 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/CoreContainer-from-CommonsHttpSolrServer-tp3219299p3219299.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to cut off hits with score below threshold?

I've created an issue to track this funny behaviour:
https://issues.apache.org/jira/browse/SOLR-2689

On Tuesday 02 August 2011 16:46:18 Markus Jelsma wrote:
 Be careful with that approach as it will return score=1.0f for all
 documents (fl=*,score). This, however, doesn't affect the outcome of the
 frange.
 
 Feels like a bug though
 
 On Tuesday 02 August 2011 16:29:16 karsten-s...@gmx.de wrote:
  Hi Otis,
  
  is this the same question as
  http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html
  ?
  
  If yes, perhaps something like (http://search-lucene.com/m/4AHNF17wIJW1/)
  q={!frange l=0.85}query($qq)
  qq=the original relevancy query
  will help?
  
  (BTW, a also would like to specify a custom Collector via API in Solr,
  possible an issue?)
  
  Best regards
  
Karsten
  
  in context:
  http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-t
  hr eshold-td3219064.html
  
   Original-Nachricht 
  
   If one wanted to cut off hits whose score is below some threshold (I
   know, I know, one doesn't typically want to do this), what are the most
   elegant options?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Jetty error message regarding EnvEntry in WebAppContext

2011-08-02 Thread Marian Steinbach

Hi!

I am trying to deploy Solr under Jetty 6.1.22-1ubuntu1 (installed the
jetty and libjetty-extra-java packages). However, it seems as if I can't
get the webapp configuration set right.

With this configuration...

 Configure class=org.mortbay.jetty.webapp.WebAppContext
 ...
 *Call name=addEnvEntry*
   Arg/solr/home/Arg
   Arg type=java.lang.String/opt/exptbx-solr/solr/Arg
   Arg type=java.lang.Booleantrue/Arg
 /Call
 /Configure

... I get the error:

426 [main] WARN org.mortbay.log - Config error at Call
name=addEnvEntryArg/solr/home/ArgArg
type=java.lang.String/opt/exptbx-solr/solr/ArgArg
type=java.lang.Booleantrue/Arg/Call
426 [main] ERROR org.mortbay.log - EXCEPTION
java.lang.IllegalStateException: No Method: Call
name=addEnvEntryArg/solr/home/ArgArg
type=java.lang.String/opt/exptbx-solr/solr/ArgArg
type=java.lang.Booleantrue/Arg/Call on class
org.mortbay.jetty.webapp.WebAppContext



With this configuration instead...

 Configure class=org.mortbay.jetty.webapp.WebAppContext
  ...
 *New class=org.mortbay.jetty.plus.naming.EnvEntry*
Arg/solr/home/Arg
Arg type=java.lang.String/opt/exptbx-solr/solr/Arg
Arg type=java.lang.Booleantrue/Arg
  /New
 /Configure

I get the following error:

438 [main] WARN org.mortbay.log - Config error at New
class=org.mortbay.jetty.plus.naming.EnvEntryArg/solr/home/ArgArg
type=java.lang.String/opt/exptbx-solr/solr/ArgArg
type=java.lang.Booleantrue/Arg/New
438 [main] WARN org.mortbay.log - EXCEPTION
java.lang.ClassNotFoundException: org.mortbay.jetty.plus.naming.EnvEntry


Both examples are derived from http://wiki.apache.org/solr/SolrJetty - the
second one being a user-contributed config. It seems that the second problem
occurs since I'm not using Jetty Plus. Or at least I don't have the library
in the path.

Can anyone tell me how a working configuration for Jetty 6.1.22 would have
to look like?

Thanks!

Marian

Re: Matching queries on a per-element basis against a multivalued field

Suk,

You're hitting on a well known limitation with Lucene, and the solutions
are work-arounds that may be unacceptable depending on the specifics of your
case.

Solr 4.0 (trunk)'s support for Joins is definitely an up and coming option,
as Mike pointed out.

Kersen's suggestion of using an index just for friends is very good,
although depending on the specifics of your actual needs it may not work or
be unscalable.

Mike also pointed out phrase queries, which will work, but remember to add a
proximity, e.g. isCool=true gender=male~50 You'll want to consider the
position increment gap setting in your schema. A limitation here is that
your text analysis options are limited since all the data is in the same
field. You're also limited to simple term search; no range queries.

I took a different approach for an app I built. I indexed into separate
fields (i.e. isCool, gender, bloodType) so that I could analyze each of them
appropriately. But I did have to add a filter that basically collapsed all
position offsets within a value to zero, effectively nullifying my ability
to do a phrase query for a particular value. That was acceptable to me and
it can be ameliorated with shingling. Then at search time I used Span
queries and their unique ability to positionally query over more than one
field. There were some edge conditions that were tricky to debug when I had
a null value, but it was at least fixable with a sentinal value kluge.

~ David Smiley

-
Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219352.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: changing the root directory where solrCloud stores info inside zookeeper File system

2011-08-02 Thread yatir

Thanks A lot mark,
Since My SolrCloud code was old I tried downloading and building the newest
code from here https://svn.apache.org/repos/asf/lucene/dev/trunk/
I am using tomcat6
I manually created the sc sub-directory in my zooKeeper ensemble file-system
I used this connection String to my ZK ensemble
zook1:2181/sc,zook2:2181/sc,zook3:2181/sc
but I still get the same problem
here is the entire catalina.out log with the exception

Using CATALINA_BASE:   /opt/tomcat6
Using CATALINA_HOME:   /opt/tomcat6
Using CATALINA_TMPDIR: /opt/tomcat6/temp
Using JRE_HOME:/usr/java/default/
Using CLASSPATH:   /opt/tomcat6/bin/bootstrap.jar
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory
(errno = 12).
Aug 2, 2011 4:28:46 AM org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal
performance in production environments was not found on the
java.library.path:
/usr/java/jdk1.6.0_21/jre/lib/amd64/server:/usr/java/jdk1.6.0_21/jre/lib/amd64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
Aug 2, 2011 4:28:46 AM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8983 Aug 2, 2011 4:28:46 AM
org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8080 Aug 2, 2011 4:28:46 AM
org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 448 ms Aug 2, 2011 4:28:46 AM
org.apache.catalina.core.StandardService start
INFO: Starting service Catalina
Aug 2, 2011 4:28:46 AM org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.29 Aug 2, 2011 4:28:46 AM
org.apache.catalina.startup.HostConfig deployDescriptor
INFO: Deploying configuration descriptor solr1.xml Aug 2, 2011 4:28:46 AM
org.apache.solr.core.SolrResourceLoader locateSolrHome
INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM
org.apache.solr.core.SolrResourceLoader init
INFO: Solr home set to '/home/tomcat/solrCloud1/'
Aug 2, 2011 4:28:46 AM org.apache.solr.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM
org.apache.solr.core.CoreContainer$Initializer initialize
INFO: looking for solr.xml: /home/tomcat/solrCloud1/solr.xml Aug 2, 2011
4:28:46 AM org.apache.solr.core.CoreContainer init
INFO: New CoreContainer 853527367
Aug 2, 2011 4:28:46 AM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: Using JNDI solr.home: /home/tomcat/solrCloud1 Aug 2, 2011 4:28:46 AM
org.apache.solr.core.SolrResourceLoader init
INFO: Solr home set to '/home/tomcat/solrCloud1/'
Aug 2, 2011 4:28:46 AM org.apache.solr.cloud.SolrZkServerProps getProperties
INFO: Reading configuration from: /home/tomcat/solrCloud1/zoo.cfg Aug 2,
2011 4:28:46 AM org.apache.solr.core.CoreContainer initZooKeeper
INFO: Zookeeper client=zook1:2181/sc,zook2:2181/sc,zook3:2181/sc
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:zookeeper.version=3.3.1-942149, built on 05/07/2010
17:14 GMT Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:host.name=ob1079.nydc1.outbrain.com
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.version=1.6.0_21 Aug 2, 2011 4:28:46 AM
org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.vendor=Sun Microsystems Inc.
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.home=/usr/java/jdk1.6.0_21/jre
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.class.path=/opt/tomcat6/bin/bootstrap.jar
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client
environment:java.library.path=/usr/java/jdk1.6.0_21/jre/lib/amd64/server:/usr/java/jdk1.6.0_21/jre/lib/amd64:/usr/java/jdk1.6.0_21/jre/../lib/amd64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.io.tmpdir=/opt/tomcat6/temp
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:java.compiler=NA Aug 2, 2011 4:28:46 AM
org.apache.zookeeper.Environment logEnv
INFO: Client environment:os.name=Linux
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:os.arch=amd64
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:os.version=2.6.18-194.8.1.el5
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:user.name=tomcat Aug 2, 2011 4:28:46 AM
org.apache.zookeeper.Environment logEnv
INFO: Client environment:user.home=/home/tomcat
Aug 2, 2011 4:28:46 AM org.apache.zookeeper.Environment logEnv
INFO: Client environment:user.dir=/opt/tomcat6

Re: performance crossover between single index and sharding

That's a fantastic answer, Shawn.

To more directly answer Bernd's question: Bernard, shard your data once
you've done reasonable performance optimizations to your single core index
setup (see Chapter 9 of my book) and the query response time isn't meeting
your requirements in spite of this.  Solr scales pretty darned well
horizontally --  so as you shard your data more and more, the query
responses will get faster.  At some extreme point there will be diminishing
returns and a performance decrease, but I wouldn't worry about that at all
until you've got many terabytes -- I don't know how many but don't worry
about it.

~ David

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/performance-crossover-between-single-index-and-sharding-tp3218561p3219397.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: performance crossover between single index and sharding

Actually, i do worry about it. Would be marvelous if someone could provide 
some metrics for an index of many terabytes.

 [..] At some extreme point there will be diminishing
 returns and a performance decrease, but I wouldn't worry about that at all
 until you've got many terabytes -- I don't know how many but don't worry
 about it.
 
 ~ David
 
 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
 dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
 list archive at Nabble.com.

RE: lucene/solr, raw indexing/searching

2011-08-02 Thread Craig Stires


dhastings,

my recommendation for the approaches from both sides ...

Lucene:
try on a whitespace analyzer for size

   Analyzer an = new WhitespaceAnalyzer(Version.LUCENE_31);


Solr:
in your /index/solr/conf/schema.xml

   fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
...
 /analyzer
   /fieldType


-craig


-Original Message-
From: dhastings [mailto:dhasti...@wshein.com] 
Sent: Tuesday, 2 August 2011 10:14 PM
To: solr-user@lucene.apache.org
Subject: lucene/solr, raw indexing/searching

Hello,
I am trying to get lucene and solr to agree on a completely Raw indexing
method.  I use lucene in my indexers that write to an index on disk, and
solr to search those indexes that i create, as creating the indexes without
solr is much much faster than using the solr server.

are there settings for BOTH solr and lucene to use EXACTLY whats in the
content as opposed to interpreting what it thinks im trying to do?  My
content is extremely specific and needs no interpretation or adjustment,
indexing or searching, a text field.

for example:

203.1 seems to be indexed as 2031.  searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
this is rev. 23.302

i need it indexed EXACTLY as it appears,
this is rev. 23.302

I do not want any of solr or lucenes attempts to fix my content or my
queries.  rev. needs to stay rev. and not turn into rev, 23.302
needs to stay as such, and NOT turn into 23302.  this is for BOTH indexing
and searching.  

any hints?

right now for indexing i have:

Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha);

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer  = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);

writer.setUseCompoundFile(false) ;


and for searching i have in my schema :


 fieldType name=text class=solr.TextField positionIncrementGap=100
   analyzer
tokenizer class=solr.StandardTokenizerFactory/
 
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType


Thanks.  Very much appreciated.


--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219
277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: lucene/solr, raw indexing/searching

2011-08-02 Thread Jonathan Rochkind

In your solr schema.xml, are the fields you are using defined as text
fields with analyzers? It sounds like you want no analysis at all, which
probably means you don't want text fields either, you just want string
fields. That will make it impossible to search for individual tokens
though, searches will match only on complete matches of the value.

I'm not quite sure how to do what you want, it depends on exactly what
you want. What kind of searching do you expect to support? If you still
do want tokenization, you'll still want some analysis... but I'm not
quite sure how that corresponds to what you'd want to do on the lucene
end. What you're trying to do is going to be inevitably confusing, I
think. Which doesn't mean it's not possible. You might find it less
confusing if you were willing to use Solr to index though, rather than
straight lucene -- you could use Solr via the SolrJ java classes, rather
than the HTTP interface.

On 8/2/2011 11:14 AM, dhastings wrote:

for example:

203.1 seems to be indexed as 2031. searching for 203.1 i can get to work
correctly, but then it wont find whats indexed using 3.1's standard
analyzer.

if i have content that is :
this is rev. 23.302

i need it indexed EXACTLY as it appears,
this is rev. 23.302

any hints?

right now for indexing i have:

Set nostopwords = new HashSet(); nostopwords.add(buahahahahahaha);

Analyzer an = new StandardAnalyzer(Version.LUCENE_31, nostopwords);
writer = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);
writer.setUseCompoundFile(false) ;

and for searching i have in my schema :

fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Thanks. Very much appreciated.

--
View this message in context:
http://lucene.472066.n3.nabble.com/lucene-solr-raw-indexing-searching-tp3219277p3219277.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Jetty error message regarding EnvEntry in WebAppContext

2011-08-02 Thread Jonathan Rochkind


On 8/2/2011 11:42 AM, Marian Steinbach wrote:

Can anyone tell me how a working configuration for Jetty 6.1.22 would have
to look like?


You know that Solr distro comes with a jetty with a Solr in it, right, 
as an example application? Even if you don't want to use it for some 
reason, that would probably be the best model to look at for a working 
jetty with solr.


Or is the problem that you want a different version of jetty?

As it happens, I just recently set up a jetty 6.1.26 for another 
project, not for solr. It was kind of a pain not being too familiar with 
java deployment or jetty.  But I did get JDNI working, by following the 
jetty instructions here: http://docs.codehaus.org/display/JETTY/JNDI  
(It was a bit confusing to figure out what they were talking about not 
being familiar with jetty, but eventually I got it, and the instructions 
were correct.)


But if I wanted to run Solr in jetty, I'd start with the jetty that is 
distributed with solr, rather than trying to build my own.

Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread Suk-Hyun Cho

I appreciate your replies and ideas.

SpanQuery would work, and I'll look into this further. However, what about
the original question? Is there no way to match documents on a per-element
basis against a multivalued field? If not, would it perhaps make sense to
create a feature request?

Also, regarding the join support you guys have mentioned: is it only on a
field within the same core, or is it across cores (as if cores are tables in
a database)? Joining on cores would eliminate most of the issues I'm having.
The examples I gave are simplified, but actually I have an entity A that has
entity B that has entity C, and I'm flattening out queriable fields of B and
C into the schema for A. This way, I can search for documents for the core A
that match some criteria for A, B, and/or C.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219565.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field

On Aug 2, 2011, at 1:09 PM, Suk-Hyun Cho [via Lucene] wrote:

I appreciate your replies and ideas.

SpanQuery would work, and I'll look into this further. However, what about
the original question? Is there no way to match documents on a per-element
basis against a multivalued field?

Correct; there is no way. Aside from Solr 4's Join feature, everything else
suggested is a hack / work-around for a fundamental limitation.

If not, would it perhaps make sense to create a feature request?

You could but I wouldn't bother because its unlikely to get any traction as
it's a fundamental issue with Lucene and at the Solr level there is a solution
on the horizon.

Also, regarding the join support you guys have mentioned: is it only on a
field within the same core, or is it across cores (as if cores are tables in
a database)? Joining on cores would eliminate most of the issues I'm having.
The examples I gave are simplified, but actually I have an entity A that has
entity B that has entity C, and I'm flattening out queriable fields of B and
C into the schema for A. This way, I can search for documents for the core A
that match some criteria for A, B, and/or C.

The Join support works across cores. See the wiki and associated JIRA issue
for it.

~ David Smiley

-
Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3219638.html
Sent from the Solr - User mailing list archive at Nabble.com.

how to get row no. of current record

2011-08-02 Thread Ranveer


Hi,

How to know the row number of current record.

i.e : suppose we have 10 million record indexed. Currently I am on 
5th records and id of the this record is XYZ00234, how to know that 
the current record rows no is 5th.


thanks..

regards
Ranveer

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom

Hi Markus,

Just as a data point for a very large sharded index, we have the full text of 
9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 
machines. Each machine has 3 shards. The size of each shard ranges between 
475GB and 550GB.  We are definitely I/O bound. Our machines have 144GB of 
memory with about 16GB dedicated to the tomcat instance running the 3 Solr 
instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. 
 We release a new index every morning and then warm the caches with several 
thousand queries.  I probably should add that our disk storage is a very high 
performance Isilon appliance that has over 500 drives and every block of every 
file is striped over no less than 14 different drives. (See blog for details *)

We have a very low number of queries per second (0.3-2 qps) and our modest 
response time goal is to keep 99th percentile response time for our application 
(i.e. Solr + application) under 10 seconds.

Our current performance statistics are:

average response time  300 ms
median response time   113 ms
90th percentile663 ms
95th percentile1,691 ms

We had plans to do some performance testing to determine the optimum shard size 
and optimum number of shards per machine, but that has remained on the back 
burner for a long time as other higher priority items keep pushing it down on 
the todo list.

We would be really interested to hear about the experiences of people who have 
so many shards that the overhead of distributing the queries, and 
consolidating/merging the responses becomes a serious issue.


Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

* 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, August 02, 2011 12:33 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Actually, i do worry about it. Would be marvelous if someone could provide 
some metrics for an index of many terabytes.

 [..] At some extreme point there will be diminishing
 returns and a performance decrease, but I wouldn't worry about that at all
 until you've got many terabytes -- I don't know how many but don't worry
 about it.
 
 ~ David
 
 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
 dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
 list archive at Nabble.com.

Re: performance crossover between single index and sharding

2011-08-02 Thread Jonathan Rochkind

What's the reasoning  behind having three shards on one machine, instead 
of just combining those into one shard? Just curious.  I had been 
thinking the point of shards was to get them on different machines, and 
there'd be no reason to have multiple shards on one machine.


On 8/2/2011 1:59 PM, Burton-West, Tom wrote:

Hi Markus,

Just as a data point for a very large sharded index, we have the full text of 
9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 
machines. Each machine has 3 shards. The size of each shard ranges between 
475GB and 550GB.  We are definitely I/O bound. Our machines have 144GB of 
memory with about 16GB dedicated to the tomcat instance running the 3 Solr 
instances, which leaves about 120 GB (or 40GB per shard) for the OS disk cache. 
 We release a new index every morning and then warm the caches with several 
thousand queries.  I probably should add that our disk storage is a very high 
performance Isilon appliance that has over 500 drives and every block of every 
file is striped over no less than 14 different drives. (See blog for details *)

We have a very low number of queries per second (0.3-2 qps) and our modest 
response time goal is to keep 99th percentile response time for our application 
(i.e. Solr + application) under 10 seconds.

Our current performance statistics are:

average response time  300 ms
median response time   113 ms
90th percentile663 ms
95th percentile1,691 ms

We had plans to do some performance testing to determine the optimum shard size 
and optimum number of shards per machine, but that has remained on the back 
burner for a long time as other higher priority items keep pushing it down on 
the todo list.

We would be really interested to hear about the experiences of people who have 
so many shards that the overhead of distributing the queries, and 
consolidating/merging the responses becomes a serious issue.


Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

* 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Tuesday, August 02, 2011 12:33 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Actually, i do worry about it. Would be marvelous if someone could provide
some metrics for an index of many terabytes.


[..] At some extreme point there will be diminishing
returns and a performance decrease, but I wouldn't worry about that at all
until you've got many terabytes -- I don't know how many but don't worry
about it.

~ David

-
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
list archive at Nabble.com.

Re: performance crossover between single index and sharding

Hi Tom,

Very interesting indeed! But i keep wondering why some engineers choose to 
store multiple shards of the same index on the same machine, there must be 
significant overhead. The only reason i can think of is ease of maintenance in 
moving shards to a separate physical machine.
I know that rearranging the shard topology can be a real pain in a large 
existing cluster (e.g. consistent hashing is not consistent anymore and having 
to shuffle docs to their new shard), is this the reason you choose this 
approach?

Cheers,

 Hi Markus,
 
 Just as a data point for a very large sharded index, we have the full text
 of 9.3 million books with an index size of about 6+ TB spread over 12
 shards on 4 machines. Each machine has 3 shards. The size of each shard
 ranges between 475GB and 550GB.  We are definitely I/O bound. Our machines
 have 144GB of memory with about 16GB dedicated to the tomcat instance
 running the 3 Solr instances, which leaves about 120 GB (or 40GB per
 shard) for the OS disk cache.  We release a new index every morning and
 then warm the caches with several thousand queries.  I probably should add
 that our disk storage is a very high performance Isilon appliance that has
 over 500 drives and every block of every file is striped over no less than
 14 different drives. (See blog for details *)
 
 We have a very low number of queries per second (0.3-2 qps) and our modest
 response time goal is to keep 99th percentile response time for our
 application (i.e. Solr + application) under 10 seconds.
 
 Our current performance statistics are:
 
 average response time  300 ms
 median response time   113 ms
 90th percentile663 ms
 95th percentile1,691 ms
 
 We had plans to do some performance testing to determine the optimum shard
 size and optimum number of shards per machine, but that has remained on
 the back burner for a long time as other higher priority items keep
 pushing it down on the todo list.
 
 We would be really interested to hear about the experiences of people who
 have so many shards that the overhead of distributing the queries, and
 consolidating/merging the responses becomes a serious issue.
 
 
 Tom Burton-West
 
 http://www.hathitrust.org/blogs/large-scale-search
 
 *
 http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-sea
 rch-50-volumes-5-million-volumes-and-beyond
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Tuesday, August 02, 2011 12:33 PM
 To: solr-user@lucene.apache.org
 Subject: Re: performance crossover between single index and sharding
 
 Actually, i do worry about it. Would be marvelous if someone could provide
 some metrics for an index of many terabytes.
 
  [..] At some extreme point there will be diminishing
  returns and a performance decrease, but I wouldn't worry about that at
  all until you've got many terabytes -- I don't know how many but don't
  worry about it.
  
  ~ David
  
  -
  
   Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
  
  --
  View this message in context:
  http://lucene.472066.n3.nabble.com/performance-crossover-between-single-i
  n dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User
  mailing list archive at Nabble.com.

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom

Hi Jonothan and Markus,

Why 3 shards on one machine instead of one larger shard per machine?

Good question!

We made this architectural decision several years ago and I'm not remembering 
the rationale at the moment. I believe we originally made the decision due to 
some tests showing a sweetspot for I/O performance for shards with 
500,000-600,000 documents, but those tests were made before we implemented 
CommonGrams and when we were still using attached storage.  I think we also 
might have had concerns about Java OOM errors with a really large shard/index, 
but we now know that we can keep memory usage under control by tweaking the 
amount of the terms index that gets read into memory.

We should probably do some tests and revisit the question.

The reason we don't have 12 shards on 12 machines is that current performance 
is good enough that we can't justify buying 8 more machines:)

Tom

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Tuesday, August 02, 2011 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Hi Tom,

Very interesting indeed! But i keep wondering why some engineers choose to 
store multiple shards of the same index on the same machine, there must be 
significant overhead. The only reason i can think of is ease of maintenance in 
moving shards to a separate physical machine.
I know that rearranging the shard topology can be a real pain in a large 
existing cluster (e.g. consistent hashing is not consistent anymore and having 
to shuffle docs to their new shard), is this the reason you choose this 
approach?

Cheers,
bble.com.

Re: performance crossover between single index and sharding

2011-08-02 Thread Ken Krugler

With low qps and multi-core servers, I believe one reason to have multiple
shards on one server is to provide better parallelism for a request, and thus
reduce your response time.

-- Ken

On Aug 2, 2011, at 11:06am, Jonathan Rochkind wrote:

What's the reasoning behind having three shards on one machine, instead of
just combining those into one shard? Just curious. I had been thinking the
point of shards was to get them on different machines, and there'd be no
reason to have multiple shards on one machine.

On 8/2/2011 1:59 PM, Burton-West, Tom wrote:
Hi Markus,

Just as a data point for a very large sharded index, we have the full text
of 9.3 million books with an index size of about 6+ TB spread over 12 shards
on 4 machines. Each machine has 3 shards. The size of each shard ranges
between 475GB and 550GB. We are definitely I/O bound. Our machines have
144GB of memory with about 16GB dedicated to the tomcat instance running the
3 Solr instances, which leaves about 120 GB (or 40GB per shard) for the OS
disk cache. We release a new index every morning and then warm the caches
with several thousand queries. I probably should add that our disk storage
is a very high performance Isilon appliance that has over 500 drives and
every block of every file is striped over no less than 14 different drives.
(See blog for details *)

We have a very low number of queries per second (0.3-2 qps) and our modest
response time goal is to keep 99th percentile response time for our
application (i.e. Solr + application) under 10 seconds.

Our current performance statistics are:

average response time 300 ms
median response time 113 ms
90th percentile663 ms
95th percentile1,691 ms

We had plans to do some performance testing to determine the optimum shard
size and optimum number of shards per machine, but that has remained on the
back burner for a long time as other higher priority items keep pushing it
down on the todo list.

We would be really interested to hear about the experiences of people who
have so many shards that the overhead of distributing the queries, and
consolidating/merging the responses becomes a serious issue.

Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

*
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Tuesday, August 02, 2011 12:33 PM
To: solr-user@lucene.apache.org
Subject: Re: performance crossover between single index and sharding

Actually, i do worry about it. Would be marvelous if someone could provide
some metrics for an index of many terabytes.

[..] At some extreme point there will be diminishing
returns and a performance decrease, but I wouldn't worry about that at all
until you've got many terabytes -- I don't know how many but don't worry
about it.

~ David

-
Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/performance-crossover-between-single-in
dex-and-sharding-tp3218561p3219397.html Sent from the Solr - User mailing
list archive at Nabble.com.

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions

Re: performance crossover between single index and sharding

2011-08-02 Thread Shawn Heisey


On 8/2/2011 12:06 PM, Jonathan Rochkind wrote:
What's the reasoning  behind having three shards on one machine, 
instead of just combining those into one shard? Just curious.  I had 
been thinking the point of shards was to get them on different 
machines, and there'd be no reason to have multiple shards on one 
machine.


I'd be interested in hearing Tom's answer as well, but my answer boils 
down to the time it takes to do a full index rebuild and worries about 
performance.


Because I'm in a virtualized environment, I effectively have three large 
shards on each machine even though they are logically separate.  When I 
first got involved, we had a distributed EasyAsk index on 20 separate 
low-end physical servers.  That evolved into basically the same solution 
with a smaller number of virtual machines, on a pair of very powerful 
physical hosts.  On this system, doing a full rebuild took nearly two 
days and wasn't an atomic operation.  The EasyAsk system (also based on 
Lucene) was unable to deal with more than about 4 million documents per 
machine (real or virtual).  The only way to get acceptable performance 
was distributed search.  The cost of providing redundancy was too high, 
so we didn't have any.


When we first started implementing Solr, we assumed from our previous 
experience that we'd need distributed search, especially if query volume 
were to go up.  For that reason, we continued our virtualization model, 
but with only seven shards - six large static shards and a smaller 
incremental shard to hold data less than a week old.  This is where we 
are now, and performance is MUCH better than the old solution.  The low 
shard count made redundancy affordable, so we now have that too.


At the time Solr was first implemented, we could rebuild the entire 
index in about two hours and swap it into place all at once.  Our index 
has grown enough since then that it takes a little less than three 
hours, which is still pretty quick for 60 million documents.


I did try some early tests with a single large index.  Performance was 
pretty decent once it got warmed up, but I was worried about how it 
would perform under a heavy load, and how it would cope with frequent 
updates.  I never really got very far with testing those fears, because 
the full rebuild time was unacceptable - at least 8 hours.  The source 
database can keep up with six DIH instances reindexing at once, which 
completes much quicker than a single machine grabbing the entire 
database.  I may increase the number of shards after I remove 
virtualization, but I'll need to fix a few limitations in my build system.


Thanks,
Shawn

Re: Query on multi valued field

2011-08-02 Thread Chris Hostetter


: The query is get only those documents which have multiple elements for
: that multivalued field.
: 
: I.e, doc 2 and 3  should be returned from the above set..

The only way to do something like this is to add a field when you index 
your documents that contains the number and then filter on that field 
using a range query.

With an UpdateProcessor (or a ScriptTransformer in DIH) you can automate 
counting how many values there are -- but it has to be indexed to 
search/filter on it.



-Hoss

Re: Why Slop doens't match anything?

2011-08-02 Thread Alexander Ramos Jardim

Hey dude,

Sorry for the long absence. (Need to check my personal email more times o0)

I am not using dismax. I didn't find the solution for the problem. I just
made a full-import and the problem ended. Still odd.

2011/7/27 Gora Mohanty g...@mimirtech.com

 On Wed, Jul 27, 2011 at 8:38 PM, Alexander Ramos Jardim
 alexander.ramos.jar...@gmail.com wrote:
  Hello pals,
 
  Using solr 1.4.0. Trying to understand something. When I run the query
  *fieldA:nokia
  c3*, I get 5 results. All with nokia c3, as expected. But when I run
  fieldA:nokia c3~100, I don get any result!
 
  As far as I understand the ~100 should make my query bring even more
  results as not only documents with nokia c3 in their fieldA will be
 found.
  Something like nokia blue c3 should match too. Right?
 [...]

 That does seem odd. You are not using the dismax query handler by
 any chance, are you? If so, then the query slop needs to be specified
 by adding qs=100 to the query.

 Regards,
 Gora




-- 
Alexander Ramos Jardim

Re: Matching queries on a per-element basis against a multivalued field

2011-08-02 Thread Suk-Hyun Cho

Thanks. I saw the related jira issue but didn't follow closely enough to see
the cross-core join being added later. Any idea/hint on when I can expect
Solr 4 to be released?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Matching-queries-on-a-per-element-basis-against-a-multivalued-field-tp3217432p3220091.html
Sent from the Solr - User mailing list archive at Nabble.com.

TikaEntityProcessor is filling logs

2011-08-02 Thread O. Klein

I want to use TikaEntityProcessor for URLs defined in link from the parent
entity. This field can be empty as well. While the dataimport is working OK,
the logging is filling up with exceptions in case link is null. Is there way
to prevent this?


field column=id xpath=/doc/id /
field column=text xpath=/doc/text /
field column=link xpath=/doc/link /
entity name=tika processor=TikaEntityProcessor url=${crawl.link}
dataSource=bin onError=continue format=text
field column=text / 
/entity

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-is-filling-logs-tp3220100p3220100.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Matching queries on a per-element basis against a multivalued field