Exception using SolrJ

2011-12-15 Thread Shawn Heisey
I am seeing exceptions from some code I have written using SolrJ.I have 
placed it into a pastebin:



http://pastebin.com/XnB83Jay


I am creating a MultiThreadedHttpConnectionManager object, which I use 
to create an HttpClient, and that is used by all my 
CommonsHttpSolrServer objects, of which there are 56 total.That's two 
index chains, seven shards per chain, two cores per shard (live and 
build).  This accounts for half of the objects - for each of those, 
there is a matching machine-level server object for CoreAdmin.  In most 
circumstances only 14 of those are in active use, but when a full index 
rebuild is needed, almost all of them are used.  I have some plans that 
will reduce the number of machine-level server objects from 28 to 4, 
which is the actual number of machines I'm running.



Static._mgr = new MultiThreadedHttpConnectionManager();

_mgrParams = Static._mgr.getParams();

_mgrParams.setTcpNoDelay(true);

_mgrParams.setConnectionTimeout(3);

_mgrParams.setStaleCheckingEnabled(true);

Static._mgr.setParams(_mgrParams);

_mgrParams = null;

Static._client = new HttpClient(Static._mgr);


Here's the code that creates the server objects.  The setMaxRetries is a 
recent change, the problem was happening before I added it, though it 
does seem to happen less often now:



  _solrServer = new CommonsHttpSolrServer(serverBaseUrl, Static._client);
  _solrCore = new CommonsHttpSolrServer(coreBaseUrl, Static._client);
  _solrServer.setMaxRetries(1);
  _solrCore.setMaxRetries(1);

The exception linked above will typically show up within a second or two 
of the start of an update cycle, well before the 30 second connection 
timeout I've specified in the parameters.What it's doing when this 
happens is part of a delete process, specifically it is executing a 
query to count the number of matching items.If any are found, then it 
will follow this with the actual deleteByQuery.



The Solr servers are running a slightly modified 3.5.0, with patches 
from SOLR-2906 and SOLR-1972 applied.I am not actually using the LFU 
cache implemented in SOLR-2906.The same problem happened when I was 
using version 3.4.0 with only SOLR-1972 applied.  The SolrJ jar comes 
from the same build as the custom Solr I'm running.



It looks like something is resetting the TCP connection, but I can't 
tell what, or where the problem is.Solr works fine as far as I can 
tell.Can anyone help?Have I done something wrong in creating my 
HttpClient or my server objects?



Thanks,

Shawn




RE: Highlighter highlighting terms which are not part of the search

2011-12-15 Thread Shyam Bhaskaran
Hi Erick,

I tried looking into our analyzers and also adding each of the filters that we 
were using one by one and getting the documents indexed and during this testing 
it was found that when using the "solr.SynonymFilterFactory" on top of the 
latest Solr 4.0 trunk code there is issue with highlighting.
Some unwanted terms which are not part of the search are getting highlighted. 
This issue has come up after using the latest Solr4.0 trunk, earlier the search 
and highlighting was working fine. Looks like some issue with 
SynonymFilterFactory.

-Shyam


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, December 13, 2011 7:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Highlighter highlighting terms which are not part of the search

Well, we need some more details to even guess.
Please review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick


On Mon, Dec 12, 2011 at 12:04 AM, Shyam Bhaskaran
 wrote:
> Hi
>
> We recently upgraded our Solr to the latest 4.0 trunk and we are seeing a 
> weird behavior with highlighting which was not seen earlier.
>
> When a search query for example "generate test pattern" is passed in the 
> results et obtained the first few results shows the highlighting properly but 
> in the later results we see terms which were not part of the search like 
> "Question", "Answer", "used" etc. are being highlighted. We are using regular 
> and termVectorHighlighter and never faced this kind of scenario, edismax is 
> used in our configuration.
>
> Can someone point to what is causing this problem and where I need to look 
> into for fixing this?
>
> -Shyam


Re: Solr sentiment analysis

2011-12-15 Thread maha
i am interested to work in sentimental analysis.help me

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-sentiment-analysis-tp3151415p3590952.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr sentiment analysis

2011-12-15 Thread Husain, Yavar
This is a generic Machine Learning question and is not related to Solr (for 
which this thread is). You can ask this question on Stackoverflow.com.
However one of the approaches: Just go through the chapter in O'reilly 
Programming Collective Intelligence on Non Negative Matrix Factorization. That 
might help you out. It's simple and concise.

-Original Message-
From: maha [mailto:mahab...@gmail.com] 
Sent: Friday, December 16, 2011 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr sentiment analysis

hai i am dng research in sentimental analysis.pls give your valuable
suggestions.how to start my research

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-sentiment-analysis-tp3151415p3590933.html
Sent from the Solr - User mailing list archive at Nabble.com.
**
 
This message may contain confidential or proprietary information intended only 
for the use of the 
addressee(s) named above or may contain information that is legally privileged. 
If you are 
not the intended addressee, or the person responsible for delivering it to the 
intended addressee, 
you are hereby notified that reading, disseminating, distributing or copying 
this message is strictly 
prohibited. If you have received this message by mistake, please immediately 
notify us by 
replying to the message and delete the original message and any copies 
immediately thereafter. 

Thank you.- 
**
FAFLD



Re: Solr sentiment analysis

2011-12-15 Thread maha
hai i am dng research in sentimental analysis.pls give your valuable
suggestions.how to start my research

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-sentiment-analysis-tp3151415p3590933.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Trim and copy a solr field

2011-12-15 Thread Swapna Vuppala
Hi Juan,

I think UpdateProcessor is what I would be needing. Can you please tell me more 
about it, as to how it works and all ?

Thanks and Regards,
Swapna.

-Original Message-
From: Juan Grande [mailto:juan.gra...@gmail.com] 
Sent: Thursday, December 15, 2011 11:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Trim and copy a solr field

Hi Swapna,

Do you want to modify the *indexed* value or the *stored* value? The
analyzers modify the indexed value. To modify the stored value, the only
option that I'm aware of is to write an UpdateProcessor that changes the
document before it's indexed.

*Juan*



On Tue, Dec 13, 2011 at 2:05 AM, Swapna Vuppala wrote:

> Hi Juan,
>
> Thanks for the reply. I tried using this, but I don't see any effect of
> the analyzer/filter.
>
> I tried copying my Solr field to another field of the type defined below.
> Then I indexed couple of documents with the new schema, but I see that both
> fields have got the same value.
> Am looking at the indexed data in Luke.
>
> Am assuming that analyzers process the field value (as specified by
> various filters etc) and then store the modified value. Is that true ? What
> else could I be missing here ?
>
> Thanks and Regards,
> Swapna.
>
> -Original Message-
> From: Juan Grande [mailto:juan.gra...@gmail.com]
> Sent: Monday, December 12, 2011 11:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Trim and copy a solr field
>
> Hi Swapna,
>
> You could try using a copyField to a field that uses
> PatternReplaceFilterFactory:
>
>
>  
>
> replacement="$1"/>
>  
>
>
> The regular expression may not be exactly what you want, but it will give
> you an idea of how to do it. I'm pretty sure there must be some other ways
> of doing this, but this is the first that comes to my mind.
>
> *Juan*
>
>
>
> On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala  >wrote:
>
> > Hi,
> >
> > I have a Solr field that contains the absolute path of the file that is
> > indexed, which will be something like
> >
> file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msg.
> >
> > Am interested in indexing the location in a separate field.  I was
> looking
> > for some way to trim the field value from last occurrence of char "/", so
> > that I can get the location value, something like
> >
> file:/myserver/Folder1/SubFol1/Sub-Fol2,
> > and store it in a new field. Can you please suggest some way to achieve
> > this ?
> >
> > Thanks and Regards,
> > Swapna.
> > 
> > Electronic mail messages entering and leaving Arup  business
> > systems are scanned for acceptability of content and viruses
> >
>


Re: Solr Version Upgrade issue

2011-12-15 Thread Pawan Darira
Thanks. I re-started from scratch & at least things have started working
now. I upgraded by deploying 3.2 war in my jboss. Also, did conf changes as
mentioned in CHANGES.txt

It did expected to have a separate directory which was not required in
1.4.

New problem is that it's taking very long to build indexes more than an
hour. it took only 10 minutes in 1.4. Can u please guide regarding this.

Should i attach my solrconfig.xml for reference

On Wed, Dec 7, 2011 at 8:22 PM, Erick Erickson wrote:

> How did you upgrade? What steps did you follow? Do you have
> any custom code? Any additional  entries in your
> solrconfig.xml?
>
> These details help us diagnose your problem, but it's almost certainly
> that you have a mixture of jar files lying around your machine in
> a place you don't expect.
>
> Best
> Erick
>
> On Wed, Dec 7, 2011 at 1:28 AM, Pawan Darira 
> wrote:
> > I checked that. there are only latest jars. I am not able to figure out
> the
> > issue.
> >
> > On Tue, Dec 6, 2011 at 6:57 PM, Mark Miller 
> wrote:
> >
> >> Looks like you must have a mix of old and new jars.
> >>
> >> On Tuesday, December 6, 2011, Pawan Darira 
> wrote:
> >> > Hi
> >> >
> >> > I am trying to upgrade my SOLR version from 1.4 to 3.2. but it's
> giving
> >> me
> >> > below exception. I have checked solr home path & it is correct..
> Please
> >> help
> >> >
> >> > SEVERE: Could not start Solr. Check solr/home property
> >> > java.lang.NoSuchMethodError:
> >> >
> >>
> >>
> org.apache.solr.common.SolrException.logOnce(Lorg/slf4j/Logger;Ljava/lang/String;Ljava/lang/Throwable;)V
> >> >at
> org.apache.solr.core.CoreContainer.load(CoreContainer.java:321)
> >> >at
> org.apache.solr.core.CoreContainer.load(CoreContainer.java:207)
> >> >at
> >> >
> >>
> >>
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:130)
> >> >at
> >> >
> >>
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:94)
> >> >at
> >> >
> >>
> >>
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
> >> >at
> >> >
> >>
> >>
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
> >> >at
> >> >
> >>
> >>
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
> >> >at
> >> >
> >>
> >>
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3720)
> >> >at
> >> >
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4358)
> >> >at
> >> >
> >>
> >>
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
> >> >at
> >> >
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:732)
> >> >at
> >> > org.apache.catalina.core.StandardHost.addChild(StandardHost.java:553)
> >> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >at
> >> >
> >>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> >at
> >> >
> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >at java.lang.reflect.Method.invoke(Method.java:585)
> >> >at
> >> >
> >>
> >>
> org.apache.tomcat.util.modeler.BaseModelMBean.invoke(BaseModelMBean.java:297)
> >> >at
> >> >
> org.jboss.mx.server.RawDynamicInvoker.invoke(RawDynamicInvoker.java:164)
> >> >at
> >> > org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
> >> >at
> >> >
> org.apache.catalina.core.StandardContext.init(StandardContext.java:5300)
> >> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >at
> >> >
> >>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> >at
> >> >
> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >at java.lang.reflect.Method.invoke(Method.java:585)
> >> >at
> >> >
> >>
> >>
> org.apache.tomcat.util.modeler.BaseModelMBean.invoke(BaseModelMBean.java:297)
> >> >at
> >> >
> org.jboss.mx.server.RawDynamicInvoker.invoke(RawDynamicInvoker.java:164)
> >> >at
> >> > org.jboss.mx.server.MBeanServerImpl.invoke(MBeanServerImpl.java:659)
> >> >at
> >> >
> >>
> >>
> org.jboss.web.tomcat.service.TomcatDeployer.performDeployInternal(TomcatDeployer.java:301)
> >> >at
> >> >
> >>
> >>
> org.jboss.web.tomcat.service.TomcatDeployer.performDeploy(TomcatDeployer.java:104)
> >> >at
> >> > org.jboss.web.AbstractWebDeployer.start(AbstractWebDeployer.java:375)
> >> >at org.jboss.web.WebModule.startModule(WebModule.java:83)
> >> >
> >>
> >> --
> >> - Mark
> >>
> >> http://www.lucidimagination.com
> >>
>



-- 
Thanks,
Pawan


Re: cache monitoring tools?

2011-12-15 Thread Justin Caratzas
Dmitry,

Thats beyond the scope of this thread, but Munin essentially runs
"plugins" which are essentially scripts that output graph configuration
and values when polled by the Munin server.  So it uses a plain text
protocol, so that the scripts can be written in any language.  Munin
then feeds this info into RRDtool, which displays the graph.  There are
some examples[1] of solr plugins that people have used to scrape the
stats.jsp page.

Justin

1. http://exchange.munin-monitoring.org/plugins/search?keyword=solr

Dmitry Kan  writes:

> Thanks, Justin. With zabbix I can gather jmx exposed stats from SOLR, how
> about munin, what protocol / way it uses to accumulate stats? It wasn't
> obvious from their online documentation...
>
> On Mon, Dec 12, 2011 at 4:56 PM, Justin Caratzas
> wrote:
>
>> Dmitry,
>>
>> The only added stress that munin puts on each box is the 1 request per
>> stat per 5 minutes to our admin stats handler.  Given that we get 25
>> requests per second, this doesn't make much of a difference.  We don'tg
>> have a sharded index (yet) as our index is only 2-3 GB, but we do have
>> slave servers with replicated
>> indexes that handle the queries, while our master handles
>> updates/commits.
>>
>> Justin
>>
>> Dmitry Kan  writes:
>>
>> > Justin, in terms of the overhead, have you noticed if Munin puts much of
>> it
>> > when used in production? In terms of the solr farm: how big is a shard's
>> > index (given you have sharded architecture).
>> >
>> > Dmitry
>> >
>> > On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas
>> > wrote:
>> >
>> >> At my work, we use Munin and Nagio for monitoring and alerts.  Munin is
>> >> great because writing a plugin for it so simple, and with Solr's
>> >> statistics handler, we can track almost any solr stat we want.  It also
>> >> comes with included plugins for load, file system stats, processes,
>> >> etc.
>> >>
>> >> http://munin-monitoring.org/
>> >>
>> >> Justin
>> >>
>> >> Paul Libbrecht  writes:
>> >>
>> >> > Allow me to chim in and ask a generic question about monitoring tools
>> >> > for people close to developers: are any of the tools mentioned in this
>> >> > thread actually able to show graphs of loads, e.g. cache counts or CPU
>> >> > load, in parallel to a console log or to an http request log??
>> >> >
>> >> > I am working on such a tool currently but I have a bad feeling of
>> >> reinventing the wheel.
>> >> >
>> >> > thanks in advance
>> >> >
>> >> > Paul
>> >> >
>> >> >
>> >> >
>> >> > Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit :
>> >> >
>> >> >> Otis, Tomás: thanks for the great links!
>> >> >>
>> >> >> 2011/12/7 Tomás Fernández Löbbe 
>> >> >>
>> >> >>> Hi Dimitry, I pointed to the wiki page to enable JMX, then you can
>> use
>> >> any
>> >> >>> tool that visualizes JMX stuff like Zabbix. See
>> >> >>>
>> >> >>>
>> >>
>> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
>> >> >>>
>> >> >>> On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan 
>> >> wrote:
>> >> >>>
>> >>  The culprit seems to be the merger (frontend) SOLR. Talking to one
>> >> shard
>> >>  directly takes substantially less time (1-2 sec).
>> >> 
>> >>  On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan 
>> >> wrote:
>> >> 
>> >> > Tomás: thanks. The page you gave didn't mention cache
>> specifically,
>> >> is
>> >> > there more documentation on this specifically? I have used
>> solrmeter
>> >>  tool,
>> >> > it draws the cache diagrams, is there a similar tool, but which
>> would
>> >> >>> use
>> >> > jmx directly and present the cache usage in runtime?
>> >> >
>> >> > pravesh:
>> >> > I have increased the size of filterCache, but the search hasn't
>> >> become
>> >>  any
>> >> > faster, taking almost 9 sec on avg :(
>> >> >
>> >> > name: search
>> >> > class: org.apache.solr.handler.component.SearchHandler
>> >> > version: $Revision: 1052938 $
>> >> > description: Search using components:
>> >> >
>> >> 
>> >> >>>
>> >>
>> org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
>> >> >
>> >> > stats: handlerStart : 1323255147351
>> >> > requests : 100
>> >> > errors : 3
>> >> > timeouts : 0
>> >> > totalTime : 885438
>> >> > avgTimePerRequest : 8854.38
>> >> > avgRequestsPerSecond : 0.008789442
>> >> >
>> >> > the stats (copying fieldValueCache as well here, to show term
>> >>  statistics):
>> >> >
>> >> > name: fieldValueCache
>> >> > class: org.apache.solr.search.FastLRUCache
>> >> > version: 1.0
>> >> > description: Concurrent LRU Cache(maxSize=1, initialSize=10,
>> >> > minSize=9000, acceptableSize=9500, cleanupThread=false)
>> >> > stat

Replication file become very very big

2011-12-15 Thread ZiLi
Hi all,
I meet a very strange problem .
We use a windows server as master  serviced for  5 windows slaves and 3 
Linux slaves .
It has worked normally for 2 months .But today we find one of the Linux 
slave's index file become very very big (150G! Others is 300M ). And we can't 
find the index folder under data folder .There is just four files 
:index.20111203090855 
(150G)、index.properties、replication.properties、spellchecker 。 By  the way , 
although this file is 150G , its service is normal and the query is very fast .
By the way, our Linux slaves' index will poll from server every 40 minutes 
and every 15 minutes our program will update these server's  solr index.   
   We forbidden AutoCommit in solrconfig.xml . Is this caused the problem via 
some big transaction ?
   Any suggestion will be appreciate .



Re: Core overhead

2011-12-15 Thread Ted Dunning
Here is a talk I did on this topic at HPTS a few years ago.

On Thu, Dec 15, 2011 at 4:28 PM, Robert Petersen  wrote:

> I see there is a lot of discussions about "micro-sharding", I'll have to
> read them.  I'm on an older version of solr and just use master index
> replicating out to a farm of slaves.  It always seemed like sharding
> causes a lot of background traffic to me when I read about it, but I
> never tried it out.  Thanks for the heads up on that topic...  :)
>
> -Original Message-
> From: Yury Kats [mailto:yuryk...@yahoo.com]
> Sent: Thursday, December 15, 2011 2:16 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Core overhead
>
> On 12/15/2011 4:46 PM, Robert Petersen wrote:
> > Sure that is possible, but doesn't that defeat the purpose of
> sharding?
> > Why distribute across one machine?  Just keep all in one index in that
> > case is my thought there...
>
> To be able to scale w/o re-indexing. Also often referred to as
> "micro-sharding".
>


RE: Is there an issue with hypens in SpellChecker with StandardTokenizer?

2011-12-15 Thread Steven A Rowe
Brandon,

Looks like SOLR-2509  fixed 
the problem - that's where OffsetAttribute was added (as you noted).

I ran my test method on branches/lucene_solr_3_5/, and I got the same failure 
there as you did, so I can confirm that Solr 3.5 has this bug, but that it will 
be fixed in Solr 3.6.

Steve

> -Original Message-
> From: Brandon Fish [mailto:brandon.j.f...@gmail.com]
> Sent: Thursday, December 15, 2011 6:16 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is there an issue with hypens in SpellChecker with
> StandardTokenizer?
> 
> Yes the branch_3x works for me as well. The addition of the
> OffsetAttribute
> probably corrected this issue.  I will either switch to
> WhitespaceAnalyzer,
> patch my distribution or wait for 3.6 to resolve this.
> 
> Thanks.
> 
> On Thu, Dec 15, 2011 at 4:17 PM, Brandon Fish
> wrote:
> 
> > Hi Steve,
> >
> > I was using branch 3.5. I will try this on tip of branch_3x too.
> >
> > Thanks.
> >
> >
> > On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe  wrote:
> >
> >> Hi Brandon,
> >>
> >> When I add the following to SpellingQueryConverterTest.java on the tip
> of
> >> branch_3x (will be released as Solr 3.6), the test succeeds:
> >>
> >> @Test
> >> public void testStandardAnalyzerWithHyphen() {
> >>   SpellingQueryConverter converter = new SpellingQueryConverter();
> >>  converter.init(new NamedList());
> >>  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >>  String original = "another-test";
> >>  Collection tokens = converter.convert(original);
> >>   assertTrue("tokens is null and it shouldn't be", tokens != null);
> >>  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
> >> tokens.size());
> >>   assertTrue("Token offsets do not match", isOffsetCorrect(original,
> >> tokens));
> >> }
> >>
> >> What version of Solr/Lucene are you using?
> >>
> >> Steve
> >>
> >> > -Original Message-
> >> > From: Brandon Fish [mailto:brandon.j.f...@gmail.com]
> >> > Sent: Thursday, December 15, 2011 3:08 PM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Is there an issue with hypens in SpellChecker with
> >> > StandardTokenizer?
> >> >
> >> > I am getting an error using the SpellChecker component with the query
> >> > "another-test"
> >> > java.lang.StringIndexOutOfBoundsException: String index out of range:
> -7
> >> >
> >> > This appears to be related to this
> >> > issue which
> >> > has been marked as fixed. My configuration and test case that follows
> >> > appear to reproduce the error I am seeing. Both "another" and "test"
> get
> >> > changed into tokens with start and end offsets of 0 and 12.
> >> >   
> >> > 
> >> >  >> > words="stopwords.txt"/>
> >> > 
> >> >   
> >> >
> >> >  &spellcheck=true&spellcheck.collate=true
> >> >
> >> > Is this an issue with my configuration/test or is there an issue with
> >> the
> >> > SpellingQueryConverter? Is there a recommended work around such as
> the
> >> > WhitespaceTokenizer as mention in the issue comments?
> >> >
> >> > Thank you for your help.
> >> >
> >> > package org.apache.solr.spelling;
> >> > import static org.junit.Assert.assertTrue;
> >> > import java.util.Collection;
> >> > import org.apache.lucene.analysis.Token;
> >> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> >> > import org.apache.lucene.util.Version;
> >> > import org.apache.solr.common.util.NamedList;
> >> > import org.junit.Test;
> >> > public class SimpleQueryConverterTest {
> >> >  @Test
> >> > public void testSimpleQueryConversion() {
> >> > SpellingQueryConverter converter = new SpellingQueryConverter();
> >> >  converter.init(new NamedList());
> >> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> >> > String original = "another-test";
> >> >  Collection tokens = converter.convert(original);
> >> > assertTrue("Token offsets do not match",
> >> > isOffsetCorrect(original, tokens));
> >> >  }
> >> > private boolean isOffsetCorrect(String s, Collection tokens) {
> >> > for (Token token : tokens) {
> >> >  int start = token.startOffset();
> >> > int end = token.endOffset();
> >> > if (!s.substring(start, end).equals(token.toString()))
> >> >  return false;
> >> > }
> >> > return true;
> >> > }
> >> > }
> >>
> >
> >


RE: Core overhead

2011-12-15 Thread Robert Petersen
I see there is a lot of discussions about "micro-sharding", I'll have to
read them.  I'm on an older version of solr and just use master index
replicating out to a farm of slaves.  It always seemed like sharding
causes a lot of background traffic to me when I read about it, but I
never tried it out.  Thanks for the heads up on that topic...  :)

-Original Message-
From: Yury Kats [mailto:yuryk...@yahoo.com] 
Sent: Thursday, December 15, 2011 2:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 4:46 PM, Robert Petersen wrote:
> Sure that is possible, but doesn't that defeat the purpose of
sharding?
> Why distribute across one machine?  Just keep all in one index in that
> case is my thought there...

To be able to scale w/o re-indexing. Also often referred to as
"micro-sharding".


SearchComponents and ShardResponse

2011-12-15 Thread Ken Krugler
Hi all,

I feel like I must be missing something here...

I'm working on a customized version of the SearchHandler, which supports 
distributed searching in multiple *local* cores.

Assuming you want to support SearchComponents, then my handler needs to 
create/maintain a ResponseBuilder, which is passed to various SearchComponent 
methods.

The ResponseBuilder has a "finished" list of ShardRequest objects, for requests 
that have received responses from shards.

Inside the ShardRequest is a "responses" list of ShardResponse objects, which 
contain things like the SolrResponse.

The SolrResponse field in ShardResponse is private, and the method to set it is 
package private.

So it doesn't appear like there's any easy way to create the ShardResponse 
objects that the SearchComponents expect to receive inside of the 
ResponseBuilder.

If I put my custom SearchHandler class into the same package as the 
ShardResponse class, then I can call setSolrResponse().

It builds, and I can run locally. But if I deploy a jar with this code, then at 
runtime I get an illegal access exception when running under Jetty.

I can make it work by re-building the solr.war with my custom SearchHandler, 
but that's pretty painful.

Any other ideas/input?

Thanks,

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: Is there an issue with hypens in SpellChecker with StandardTokenizer?

2011-12-15 Thread Brandon Fish
Yes the branch_3x works for me as well. The addition of the OffsetAttribute
probably corrected this issue.  I will either switch to WhitespaceAnalyzer,
patch my distribution or wait for 3.6 to resolve this.

Thanks.

On Thu, Dec 15, 2011 at 4:17 PM, Brandon Fish wrote:

> Hi Steve,
>
> I was using branch 3.5. I will try this on tip of branch_3x too.
>
> Thanks.
>
>
> On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe  wrote:
>
>> Hi Brandon,
>>
>> When I add the following to SpellingQueryConverterTest.java on the tip of
>> branch_3x (will be released as Solr 3.6), the test succeeds:
>>
>> @Test
>> public void testStandardAnalyzerWithHyphen() {
>>   SpellingQueryConverter converter = new SpellingQueryConverter();
>>  converter.init(new NamedList());
>>  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>>  String original = "another-test";
>>  Collection tokens = converter.convert(original);
>>   assertTrue("tokens is null and it shouldn't be", tokens != null);
>>  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
>> tokens.size());
>>   assertTrue("Token offsets do not match", isOffsetCorrect(original,
>> tokens));
>> }
>>
>> What version of Solr/Lucene are you using?
>>
>> Steve
>>
>> > -Original Message-
>> > From: Brandon Fish [mailto:brandon.j.f...@gmail.com]
>> > Sent: Thursday, December 15, 2011 3:08 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Is there an issue with hypens in SpellChecker with
>> > StandardTokenizer?
>> >
>> > I am getting an error using the SpellChecker component with the query
>> > "another-test"
>> > java.lang.StringIndexOutOfBoundsException: String index out of range: -7
>> >
>> > This appears to be related to this
>> > issue which
>> > has been marked as fixed. My configuration and test case that follows
>> > appear to reproduce the error I am seeing. Both "another" and "test" get
>> > changed into tokens with start and end offsets of 0 and 12.
>> >   
>> > 
>> > > > words="stopwords.txt"/>
>> > 
>> >   
>> >
>> >  &spellcheck=true&spellcheck.collate=true
>> >
>> > Is this an issue with my configuration/test or is there an issue with
>> the
>> > SpellingQueryConverter? Is there a recommended work around such as the
>> > WhitespaceTokenizer as mention in the issue comments?
>> >
>> > Thank you for your help.
>> >
>> > package org.apache.solr.spelling;
>> > import static org.junit.Assert.assertTrue;
>> > import java.util.Collection;
>> > import org.apache.lucene.analysis.Token;
>> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> > import org.apache.lucene.util.Version;
>> > import org.apache.solr.common.util.NamedList;
>> > import org.junit.Test;
>> > public class SimpleQueryConverterTest {
>> >  @Test
>> > public void testSimpleQueryConversion() {
>> > SpellingQueryConverter converter = new SpellingQueryConverter();
>> >  converter.init(new NamedList());
>> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>> > String original = "another-test";
>> >  Collection tokens = converter.convert(original);
>> > assertTrue("Token offsets do not match",
>> > isOffsetCorrect(original, tokens));
>> >  }
>> > private boolean isOffsetCorrect(String s, Collection tokens) {
>> > for (Token token : tokens) {
>> >  int start = token.startOffset();
>> > int end = token.endOffset();
>> > if (!s.substring(start, end).equals(token.toString()))
>> >  return false;
>> > }
>> > return true;
>> > }
>> > }
>>
>
>


Call RequestHandler from QueryComponent

2011-12-15 Thread Vazquez, Maria (STM)
Hi!

I have a solrconfig.xml like:

 

  all
  0
  10
  ABC
  score desc,rating asc
  CUSTOM FQ
  2.2
  CUSTOM FL


validate
CUSTOM ABC QUERY COMPONENT
stats
debug

  

  

  all
  0
  1
  XYZ
  score desc
  CUSTOM FL
  2.2

  edismax
  1
  CUSTOM QF
  0
  1
  *:*


validate
CUSTOM XYZ QUERY COMPONENT
stats
debug

  

In ABC QUERY COMPONENT, I customize prepare() and process(). In its
process() I want to call the /XYZ request handler and include those
results in the results for ABC. Is that possible?
I know the org.apache.solr.spelling.SpellCheckCollator calls a
QueryComponent and invokes prepare and process on it, but I want to
invoke the request handler directly. It'd be silly to use SolrJ since
both handlers are in the same core.

Any suggestions?

Thanks!
Maria

 



Re: edismax doesn't obey 'pf' parameter

2011-12-15 Thread entdeveloper
I'm observing strange results with both the correct and incorrect behavior
happening depending on which field I put in the 'pf' param. I wouldn't think
this should be analyzer specific, but is it?

If I try:
http://localhost:8080/solr/collection1/select?qt=%2Fsearch&q=mickey%20mouse&debugQuery=on&defType=edismax&pf=blah_exact&qf=blah

It looks correct:
mickey mouse
mickey mouse
+((DisjunctionMaxQuery((blah:mickey))
DisjunctionMaxQuery((blah:mouse)))~2)
DisjunctionMaxQuery((blah_exact:"mickey mouse"))
+(((blah:mickey) (blah:mouse))~2)
(blah_exact:"mickey mouse")

However, If I put in the field I want, for some reason that phrase portion
of the query just completely drops off:
http://localhost:8080/solr/collection1/select?qt=%2Fsearch&q=mickey%20mouse&debugQuery=on&defType=edismax&pf=name_exact&qf=name

Results:
mickey mouse
mickey mouse
+((DisjunctionMaxQuery((name:mickey))
DisjunctionMaxQuery((name:mouse)))~2) ()
+(((name:mickey) (name:mouse))~2) ()

The name_exact field's analyzer uses KeywordTokenizer, but again, I think
this query is being formed too early in the process for that to matter at
this point

--
View this message in context: 
http://lucene.472066.n3.nabble.com/edismax-doesn-t-obey-pf-parameter-tp3589763p3590153.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr AutoComplete - Address Search

2011-12-15 Thread Vijay Sampath
Hi, 

  I'm trying to implement autocomplete functionality for Address search.
I've used the KeywordTokenizerFactory & LowerCaseFilterFactory. Problem is,
when I start typing the numbers at start, I got any results from SOLR (Eg:
3500 W South).  Could you please guide on this




  
   





Thanks,
Vijay 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-AutoComplete-Address-Search-tp3590112p3590112.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Is there an issue with hypens in SpellChecker with StandardTokenizer?

2011-12-15 Thread Brandon Fish
Hi Steve,

I was using branch 3.5. I will try this on tip of branch_3x too.

Thanks.

On Thu, Dec 15, 2011 at 4:14 PM, Steven A Rowe  wrote:

> Hi Brandon,
>
> When I add the following to SpellingQueryConverterTest.java on the tip of
> branch_3x (will be released as Solr 3.6), the test succeeds:
>
> @Test
> public void testStandardAnalyzerWithHyphen() {
>   SpellingQueryConverter converter = new SpellingQueryConverter();
>  converter.init(new NamedList());
>  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
>  String original = "another-test";
>  Collection tokens = converter.convert(original);
>   assertTrue("tokens is null and it shouldn't be", tokens != null);
>  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2,
> tokens.size());
>   assertTrue("Token offsets do not match", isOffsetCorrect(original,
> tokens));
> }
>
> What version of Solr/Lucene are you using?
>
> Steve
>
> > -Original Message-
> > From: Brandon Fish [mailto:brandon.j.f...@gmail.com]
> > Sent: Thursday, December 15, 2011 3:08 PM
> > To: solr-user@lucene.apache.org
> > Subject: Is there an issue with hypens in SpellChecker with
> > StandardTokenizer?
> >
> > I am getting an error using the SpellChecker component with the query
> > "another-test"
> > java.lang.StringIndexOutOfBoundsException: String index out of range: -7
> >
> > This appears to be related to this
> > issue which
> > has been marked as fixed. My configuration and test case that follows
> > appear to reproduce the error I am seeing. Both "another" and "test" get
> > changed into tokens with start and end offsets of 0 and 12.
> >   
> > 
> >  > words="stopwords.txt"/>
> > 
> >   
> >
> >  &spellcheck=true&spellcheck.collate=true
> >
> > Is this an issue with my configuration/test or is there an issue with the
> > SpellingQueryConverter? Is there a recommended work around such as the
> > WhitespaceTokenizer as mention in the issue comments?
> >
> > Thank you for your help.
> >
> > package org.apache.solr.spelling;
> > import static org.junit.Assert.assertTrue;
> > import java.util.Collection;
> > import org.apache.lucene.analysis.Token;
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.util.Version;
> > import org.apache.solr.common.util.NamedList;
> > import org.junit.Test;
> > public class SimpleQueryConverterTest {
> >  @Test
> > public void testSimpleQueryConversion() {
> > SpellingQueryConverter converter = new SpellingQueryConverter();
> >  converter.init(new NamedList());
> > converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> > String original = "another-test";
> >  Collection tokens = converter.convert(original);
> > assertTrue("Token offsets do not match",
> > isOffsetCorrect(original, tokens));
> >  }
> > private boolean isOffsetCorrect(String s, Collection tokens) {
> > for (Token token : tokens) {
> >  int start = token.startOffset();
> > int end = token.endOffset();
> > if (!s.substring(start, end).equals(token.toString()))
> >  return false;
> > }
> > return true;
> > }
> > }
>


Re: Core overhead

2011-12-15 Thread Yury Kats
On 12/15/2011 4:46 PM, Robert Petersen wrote:
> Sure that is possible, but doesn't that defeat the purpose of sharding?
> Why distribute across one machine?  Just keep all in one index in that
> case is my thought there...

To be able to scale w/o re-indexing. Also often referred to as "micro-sharding".


RE: Is there an issue with hypens in SpellChecker with StandardTokenizer?

2011-12-15 Thread Steven A Rowe
Hi Brandon,

When I add the following to SpellingQueryConverterTest.java on the tip of 
branch_3x (will be released as Solr 3.6), the test succeeds:

@Test
public void testStandardAnalyzerWithHyphen() {
  SpellingQueryConverter converter = new SpellingQueryConverter();
  converter.init(new NamedList());
  converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
  String original = "another-test";
  Collection tokens = converter.convert(original);
  assertTrue("tokens is null and it shouldn't be", tokens != null);
  assertEquals("tokens Size: " + tokens.size() + " is not 2", 2, tokens.size());
  assertTrue("Token offsets do not match", isOffsetCorrect(original, tokens));
}

What version of Solr/Lucene are you using?

Steve

> -Original Message-
> From: Brandon Fish [mailto:brandon.j.f...@gmail.com]
> Sent: Thursday, December 15, 2011 3:08 PM
> To: solr-user@lucene.apache.org
> Subject: Is there an issue with hypens in SpellChecker with
> StandardTokenizer?
> 
> I am getting an error using the SpellChecker component with the query
> "another-test"
> java.lang.StringIndexOutOfBoundsException: String index out of range: -7
> 
> This appears to be related to this
> issue which
> has been marked as fixed. My configuration and test case that follows
> appear to reproduce the error I am seeing. Both "another" and "test" get
> changed into tokens with start and end offsets of 0 and 12.
>   
> 
>  words="stopwords.txt"/>
> 
>   
> 
>  &spellcheck=true&spellcheck.collate=true
> 
> Is this an issue with my configuration/test or is there an issue with the
> SpellingQueryConverter? Is there a recommended work around such as the
> WhitespaceTokenizer as mention in the issue comments?
> 
> Thank you for your help.
> 
> package org.apache.solr.spelling;
> import static org.junit.Assert.assertTrue;
> import java.util.Collection;
> import org.apache.lucene.analysis.Token;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.util.Version;
> import org.apache.solr.common.util.NamedList;
> import org.junit.Test;
> public class SimpleQueryConverterTest {
>  @Test
> public void testSimpleQueryConversion() {
> SpellingQueryConverter converter = new SpellingQueryConverter();
>  converter.init(new NamedList());
> converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
> String original = "another-test";
>  Collection tokens = converter.convert(original);
> assertTrue("Token offsets do not match",
> isOffsetCorrect(original, tokens));
>  }
> private boolean isOffsetCorrect(String s, Collection tokens) {
> for (Token token : tokens) {
>  int start = token.startOffset();
> int end = token.endOffset();
> if (!s.substring(start, end).equals(token.toString()))
>  return false;
> }
> return true;
> }
> }


Poor performance on distributed search

2011-12-15 Thread ku3ia
Hi, all!

I have a problem with distributed search. I downloaded one shard from my
production. It has:
* ~29M docs
* 11 fields
* ~105M terms
* size of shard is: 13GB
On production there are near 30 the same shards. I split this shard to 4
more smaller shards, so now I have:
small shard1:
docs: 6.2M
terms: 27.2M
size: 2.89GB
small shard2:
docs: 6.3M
terms: 28.7M
size: 2.98GB
small shard3:
docs: 7.9M
terms: 32.8M
size: 3.60GB
small shard4:
docs: 8.2M
terms: 32.6M
size: 3.70GB

My machine confguration:
ABIT AX-78
AMD Athlon 64 X2 5200+
DDR2 Kingston 2x2G+2x1G = 6G
WDC WD2500JS (System here)
WDC WD20EARS (6 partitions = 30 GB for shards at begin of drive, and other
empty, all partitions are well aligned)
GNU/Linux Debian Squeeze
Tomcat 6.0.32 with JAVA_OPTS:
JAVA_OPTS="$JAVA_OPTS -XX:+DisableExplicitGC -server \
-XX:PermSize=512M -XX:MaxPermSize=512M -Xmx4096M -Xms4096M
-XX:NewSize=128M -XX:MaxNewSize=128M \
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled \
-XX:CMSInitiatingOccupancyFraction=50 -XX:GCTimeRatio=9
-XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=25 \
-verbose:gc -XX:+PrintGCTimeStamps -Xloggc:$CATALINA_HOME/logs/gc.log"
Solr 3.5

I configured 4 cores and start Tomcat. I write a bash script. It's runing
during 300 seconds and sending every 6 seconds queries like
http://127.0.0.1:8080/solr/shard1/select/?ident=true&q=(assistants)&rows=2000&start=0&fl=*,score&qt=requestShards
where qt=requestShards is my 4 shards. After test I have the results:

Elapsed time: 299 secs
--- solr ---
Queries processed: 21 << this is full response file
Queries cancelled: 29 << this is number of killed curls
Average QTime is: 59645.6 ms 
Average RTime is: 59.7619 sec(s) << this is average time difference between
start and stop the curl. There is a part of script:
# >>dcs=`date +%s`
# >>curl ${url} -s -H 'Content-type:text/xml; charset=utf-8' >
${F_DATADIR}/$dest.fdata
# >>dce=`date +%s`
# >>dcd=$(echo "$dce - $dcs" | bc)
Size of data-dir is: 3346766 bytes << this is response dir size

I'm using nmon to to monitor R/W disk speed, and I was surprised that read
speed of my shards volumes WDC20EAR's drive was nearly 3 MB/s when script is
working. After this I run benchmark test from disk utility. Here is results:
Minimum read rate: 53.2MB/s
Maximum Read rate: 126.4 MB/s
Average Read rate: 95.8 MB/s

But from the other side I tested queries like
http://127.0.0.1:8080/solr/shard1/select/?ident=true&q=(assistants)&rows=2000&start=0&fl=*,score
results is:
Elapsed time: 299 secs
--- solr ---
Queries processed: 50
Queries cancelled: 0
Average QTime is: 139.76 ms
Average RTime is: 2.2 sec(s)
Size of data-dir is: 6819259 bytes

and quesries like
http://127.0.0.1:8080/solr/shard1/select/?ident=true&q=(assistants)&rows=2000&start=0&fl=*,score&shards=127.0.0.1:8080/solr/shard1
and result is:
Elapsed time: 299 secs
--- solr ---
Queries processed: 49
Queries cancelled: 1
Average QTime is: 1878.37 ms
Average RTime is: 1.95918 sec(s)
Size of data-dir is: 4274099 bytes
So we see the results are the same.

My big question is: why is so slow drive read speed when Solr is working? 
Thanks for any replies

P.S. And maybe my general problem is too much terms in shard, for example,
query 
http://127.0.0.1:8080/solr/shard1/terms?terms.fl=field1
shows:

58641
45022
36339
35637
34247
33869
28961
28147
27654
26940


Thanks.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Poor-performance-on-distributed-search-tp3590028p3590028.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Core overhead

2011-12-15 Thread Robert Petersen
Sure that is possible, but doesn't that defeat the purpose of sharding?
Why distribute across one machine?  Just keep all in one index in that
case is my thought there...

-Original Message-
From: Yury Kats [mailto:yuryk...@yahoo.com] 
Sent: Thursday, December 15, 2011 11:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 1:41 PM, Robert Petersen wrote:
> loading.  Try it out, but make sure that the functionality you are
> actually looking for isn't sharding instead of multiple cores...  

Yes, but the way to achieve sharding is to have multiple cores.
The question is then becomes -- how many cores (shards)?


Re: Using LocalParams in StatsComponent to create a price slider?

2011-12-15 Thread Chris Hostetter

: I really don't understand what you're asking, could you clarify with
: an example or two?

I *believe* the question is about wanting to exlcude the effects of some 
"fq" params from the set of documents used to compute stats -- similar to 
how you can exclude tagged filters when generating facet counts...

https://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters

So i think Mark is asking if it's possible to do something like this...

http://localhost:8983/solr/select?fq={!tag=cat}cat:electronics&q=*:*&stats=true&stats.field={!ex=cat}price

...but at the moment, stats.field parsing doesn't understand local params 
at all (let alone doing filter exclusion)

If you'd like to open a Jira requesting this feature, I suspect adding 
it wouldn't be too complicated (but i'm not really familiar with the code 
so that's just a guess)

: > I'm using the StatsComponent to receive to lower and upper bounds of a
: > price field to create a "price slider".
: > If someone sets the price range to $100-$200 I have to add a filter to
: > the query. But then the lower and upper bound are calculated of the
: > filtered result.

-Hoss


Re: edismax doesn't obey 'pf' parameter

2011-12-15 Thread Chris Hostetter

: If I switch back and forth between defType=dismax and defType=edismax, the
: edismax doesn't seem to obey my pf parameter. I dug through the code a

I just tried a sample query using Solr 3.5 with the example configs+data.

This is the query i tried...

http://localhost:8983/solr/select/?debugQuery=true&defType=edismax&qf=name^5+features^3&pf=features^4&q=this+document

echoParams shows...


  name^5 features^3
  features^4
  true
  this document
  edismax


...it matched the document i expected, and the debug info showed the query 
structure i expected...


 +((DisjunctionMaxQuery((features:this^3.0 | name:this^5.0)) 
DisjunctionMaxQuery((features:document^3.0 | name:document^5.0))
   )~2
  ) 
  DisjunctionMaxQuery((features:"this document"^4.0))


: Is this a known bug or am I missing something in my configuration? My config
: is very simple:
: 
:  
: 
:   explicit
:   edismax
:   name
:   name_exact^2
:   id,name,image_url,url
:   *:*
: 
:   

what did your request look like with that config?  what did the debugQuery 
output look like?


-Hoss


edismax doesn't obey 'pf' parameter

2011-12-15 Thread entdeveloper
If I switch back and forth between defType=dismax and defType=edismax, the
edismax doesn't seem to obey my pf parameter. I dug through the code a
little bit and in the ExtendedDismaxQParserPlugin (Solr 3.4/Solr3.5), the
part that is supposed to add the phrase comes here:

Query phrase = pp.parse(userPhraseQuery.toString());

The code in the parse method tries to create a Query against a null field,
and then the phrase does not get added to the mainQuery.

Is this a known bug or am I missing something in my configuration? My config
is very simple:

 

  explicit
  edismax
  name
  name_exact^2
  id,name,image_url,url
  *:*

  


--
View this message in context: 
http://lucene.472066.n3.nabble.com/edismax-doesn-t-obey-pf-parameter-tp3589763p3589763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Faceting with null dates

2011-12-15 Thread Chris Hostetter

First of all, we need to clarify some terminology here: there is no such 
thing as a "null date" in solr -- or for that matter, there is no such 
thing as a "full value" in any field.  documents either have some value(s) 
for a field, or they do not hvae any values.

If you want to constrain your query to only documents that have a value in 
a field, you can use something like fq=field_name:[* TO *] ... if you want 
to constraint your query to only documents that do *NOT* have a value in a 
field, you can use fq=-field_name:[* TO *]

Now, having said that, like Erick, i'm a little confused by your question 
-- it's not clear if what you really want to do is:

a) change the set of documents returned in the main result list
b) change the set of documents considered when generating facet counts 
(w/o changing the main result list)
c) return an additional count of documents that are in the main result 
list, but are not in the facet counts because they do not have the field 
being faceted on.

My best guess is that you are asking about "c" based on your last 
sentence...

: get is 3 results and 7 non-null validToDate facets. And as I write this, 
: I start to wonder if this is possible at all as the facets are dependent 
: on the result set and that this might be better to handle in the 
: application layer by just extracting 10-7=3...

...subtracting the sum of all constraint counts from your range facet from 
the total number of documents found won't neccessarily tell you the number 
of documents that have no value in the field you are faceting on -- 
because documents may have values out side the range of your start/end.

Depending on what exactly it is you are looking for, you might find the 
"facet.range.other=all" param useful, as it will return things like the  
"between" counts (summing up all the docs between start->end) as well as 
the "before" and "after" counts.

But if you really just want to know "how many docs have no value for my 
validToDate field?" you can get that very explicitly and easily using 
facet.query=-validToDate:[* TO *]

: trueNOW/DAYS-4MONTHS1(*:*)validToDateNOW/DAY+1DAY+1MONTH
: 
:   
:  7


-Hoss


Call RequestHandler from QueryComponent

2011-12-15 Thread Maria Vazquez
Hi!

I have a solrconfig.xml like:

  

  all
  0
  10
  ABC
  score desc,rating asc
  CUSTOM FQ
  2.2
  CUSTOM FL


validate
CUSTOM ABC QUERY COMPONENT
stats
debug

  

  

  all
  0
  1
  XYZ
  score desc
  CUSTOM FL
  2.2

  edismax
  1
  CUSTOM QF
  0
  1
  *:*


validate
CUSTOM XYZ QUERY COMPONENT
stats
debug

  

In ABC QUERY COMPONENT, I customize prepare() and process(). In its
process() I want to call the /XYZ request handler and include those results
in the results for ABC. Is that possible?
I know the org.apache.solr.spelling.SpellCheckCollator calls a
QueryComponent and invokes prepare and process on it, but I want to invoke
the request handler directly. It¹d be silly to use SolrJ since both handlers
are in the same core.

Any suggestions?

Thanks!
Maria




Is there an issue with hypens in SpellChecker with StandardTokenizer?

2011-12-15 Thread Brandon Fish
I am getting an error using the SpellChecker component with the query
"another-test"
java.lang.StringIndexOutOfBoundsException: String index out of range: -7

This appears to be related to this
issue which
has been marked as fixed. My configuration and test case that follows
appear to reproduce the error I am seeing. Both "another" and "test" get
changed into tokens with start and end offsets of 0 and 12.
  



  

 &spellcheck=true&spellcheck.collate=true

Is this an issue with my configuration/test or is there an issue with the
SpellingQueryConverter? Is there a recommended work around such as the
WhitespaceTokenizer as mention in the issue comments?

Thank you for your help.

package org.apache.solr.spelling;
import static org.junit.Assert.assertTrue;
import java.util.Collection;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.solr.common.util.NamedList;
import org.junit.Test;
public class SimpleQueryConverterTest {
 @Test
public void testSimpleQueryConversion() {
SpellingQueryConverter converter = new SpellingQueryConverter();
 converter.init(new NamedList());
converter.setAnalyzer(new StandardAnalyzer(Version.LUCENE_35));
String original = "another-test";
 Collection tokens = converter.convert(original);
assertTrue("Token offsets do not match",
isOffsetCorrect(original, tokens));
 }
private boolean isOffsetCorrect(String s, Collection tokens) {
for (Token token : tokens) {
 int start = token.startOffset();
int end = token.endOffset();
if (!s.substring(start, end).equals(token.toString()))
 return false;
}
return true;
}
}


Re: Core overhead

2011-12-15 Thread Yury Kats
On 12/15/2011 1:41 PM, Robert Petersen wrote:
> loading.  Try it out, but make sure that the functionality you are
> actually looking for isn't sharding instead of multiple cores...  

Yes, but the way to achieve sharding is to have multiple cores.
The question is then becomes -- how many cores (shards)?


Call RequestHandler from QueryComponent

2011-12-15 Thread Maria Vazquez
Hi!

I have a solrconfig.xml like:

  

  all
  0
  10
  ABC
  score desc,rating asc
  CUSTOM FQ
  2.2
  CUSTOM FL


validate
CUSTOM ABC QUERY COMPONENT
stats
debug

  

  

  all
  0
  1
  XYZ
  score desc
  CUSTOM FL
  2.2

  edismax
  1
  CUSTOM QF
  0
  1
  *:*


validate
CUSTOM XYZ QUERY COMPONENT
stats
debug

  

In ABC QUERY COMPONENT, I customize prepare() and process(). In its
process() I want to call the /XYZ request handler and include those results
in the results for ABC. Is that possible?
I know the org.apache.solr.spelling.SpellCheckCollator calls a
QueryComponent and invokes prepare and process on it, but I want to invoke
the request handler directly. It¹d be silly to use SolrJ since both handlers
are in the same core.

Any suggestions?

Thanks!
Maria



Re: Core overhead

2011-12-15 Thread Robert Stewart
One other thing I did not mention is GC pauses.  If you have smaller
heap sizes, you would have less very long GC pauses, so that can be an
advantage having many cores (if cores are distributed into seperate
SOLR instances, as seperate processes).  I think you can expect 1
second pause for each GB of heap size in worst case.



On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart  wrote:
> It is true number of terms may be much more than N/10 (or even N for
> each core), but it is the number of docs per term that will really
> matter.  So you can have N terms in each core but each term has 1/10
> number of docs on avg.
>
>
>
>
> 2011/12/15 Yury Kats :
>> On 12/15/2011 1:07 PM, Robert Stewart wrote:
>>
>>> I think overall memory usage would be close to the same.
>>
>> Is this really so? I suspect that the consumed memory is in direct
>> proportion to the number of terms in the index. I also suspect that
>> if I divided 1 core with N terms into 10 smaller cores, each smaller
>> core would have much more than N/10 terms. Let's say I'm indexing
>> English texts, it's likely that all smaller cores would have almost
>> the same number of terms, close to the original N. Not so?


Re: Core overhead

2011-12-15 Thread Robert Stewart
It is true number of terms may be much more than N/10 (or even N for
each core), but it is the number of docs per term that will really
matter.  So you can have N terms in each core but each term has 1/10
number of docs on avg.




2011/12/15 Yury Kats :
> On 12/15/2011 1:07 PM, Robert Stewart wrote:
>
>> I think overall memory usage would be close to the same.
>
> Is this really so? I suspect that the consumed memory is in direct
> proportion to the number of terms in the index. I also suspect that
> if I divided 1 core with N terms into 10 smaller cores, each smaller
> core would have much more than N/10 terms. Let's say I'm indexing
> English texts, it's likely that all smaller cores would have almost
> the same number of terms, close to the original N. Not so?


RE: Core overhead

2011-12-15 Thread Robert Petersen
I am running eight cores, each core serves up different types of
searches so there is no overlap in their function.  Some cores have
millions of documents.  My search times are quite fast.  I don't see any
real slowdown from multiple cores, but you just have to have enough
memory for them. Memory simply has to be big enough to hold what you are
loading.  Try it out, but make sure that the functionality you are
actually looking for isn't sharding instead of multiple cores...  

http://wiki.apache.org/solr/DistributedSearch


-Original Message-
From: Yury Kats [mailto:yuryk...@yahoo.com] 
Sent: Thursday, December 15, 2011 10:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Core overhead

On 12/15/2011 1:07 PM, Robert Stewart wrote:

> I think overall memory usage would be close to the same.

Is this really so? I suspect that the consumed memory is in direct
proportion to the number of terms in the index. I also suspect that
if I divided 1 core with N terms into 10 smaller cores, each smaller
core would have much more than N/10 terms. Let's say I'm indexing
English texts, it's likely that all smaller cores would have almost
the same number of terms, close to the original N. Not so?


Re: Core overhead

2011-12-15 Thread Yury Kats
On 12/15/2011 1:07 PM, Robert Stewart wrote:

> I think overall memory usage would be close to the same.

Is this really so? I suspect that the consumed memory is in direct
proportion to the number of terms in the index. I also suspect that
if I divided 1 core with N terms into 10 smaller cores, each smaller
core would have much more than N/10 terms. Let's say I'm indexing
English texts, it's likely that all smaller cores would have almost
the same number of terms, close to the original N. Not so?


Re: Trim and copy a solr field

2011-12-15 Thread Juan Grande
Hi Swapna,

Do you want to modify the *indexed* value or the *stored* value? The
analyzers modify the indexed value. To modify the stored value, the only
option that I'm aware of is to write an UpdateProcessor that changes the
document before it's indexed.

*Juan*



On Tue, Dec 13, 2011 at 2:05 AM, Swapna Vuppala wrote:

> Hi Juan,
>
> Thanks for the reply. I tried using this, but I don't see any effect of
> the analyzer/filter.
>
> I tried copying my Solr field to another field of the type defined below.
> Then I indexed couple of documents with the new schema, but I see that both
> fields have got the same value.
> Am looking at the indexed data in Luke.
>
> Am assuming that analyzers process the field value (as specified by
> various filters etc) and then store the modified value. Is that true ? What
> else could I be missing here ?
>
> Thanks and Regards,
> Swapna.
>
> -Original Message-
> From: Juan Grande [mailto:juan.gra...@gmail.com]
> Sent: Monday, December 12, 2011 11:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Trim and copy a solr field
>
> Hi Swapna,
>
> You could try using a copyField to a field that uses
> PatternReplaceFilterFactory:
>
>
>  
>
> replacement="$1"/>
>  
>
>
> The regular expression may not be exactly what you want, but it will give
> you an idea of how to do it. I'm pretty sure there must be some other ways
> of doing this, but this is the first that comes to my mind.
>
> *Juan*
>
>
>
> On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala  >wrote:
>
> > Hi,
> >
> > I have a Solr field that contains the absolute path of the file that is
> > indexed, which will be something like
> >
> file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msg.
> >
> > Am interested in indexing the location in a separate field.  I was
> looking
> > for some way to trim the field value from last occurrence of char "/", so
> > that I can get the location value, something like
> >
> file:/myserver/Folder1/SubFol1/Sub-Fol2,
> > and store it in a new field. Can you please suggest some way to achieve
> > this ?
> >
> > Thanks and Regards,
> > Swapna.
> > 
> > Electronic mail messages entering and leaving Arup  business
> > systems are scanned for acceptability of content and viruses
> >
>


Re: Core overhead

2011-12-15 Thread Robert Stewart
I dont have any measured data, but here are my thoughts.

I think overall memory usage would be close to the same.
Speed will be slower in general, because if search speed is approx
log(n) then 10 * log(n/10) > log(n), and also if merging results you
have overhead in the merge step and also if fetching results beyond
the first page since you would generally need page_size * page_number
from each core.  Of course if you search many cores in parallel over
many CPU cores you would mitigate that overhead.  There are other
considerations such as caching - for example if you are adding new
documents on one core only, the other cores get to keep there filter
caches, etc. in RAM much longer than if you are always committing to
one single large core.  And then of course if you have some client
logic to pick a sub-set of cores based on some query data (such as
only searching newer cores, etc.) then you could end up with faster
search over many cores.


2011/12/15 Yury Kats :
> Does anybody have an idea, or better yet, measured data,
> to see what the overhead of a core is, both in memory and speed?
>
> For example, what would be the difference between having 1 core
> with 100M documents versus having 10 cores with 10M documents?


Core overhead

2011-12-15 Thread Yury Kats
Does anybody have an idea, or better yet, measured data,
to see what the overhead of a core is, both in memory and speed?

For example, what would be the difference between having 1 core
with 100M documents versus having 10 cores with 10M documents?


Re: how to setup to archive expired documents?

2011-12-15 Thread Robert Stewart
I think managing 100 cores will be too much headache.  Also
performance of querying 100 cores will not be good (need
page_number*page_size from 100 cores, and then merge).

I think having around 10 SOLR instances, each one about 10M docs.
Always search all 10 nodes.  Index using some hash(doc) to distribute
new docs among nodes.  Run some nightly/weekly job to delete old docs
and force merge (optimize) to some min/max number of segments.  I
think that will work ok, but not sure about how to handle
replication/failover so each node is redundant.  If we use SOLR
replication it will have problems with replication after optimize for
large indexes.  Seems to take a long time to move 10M doc index from
master to slave (around 100GB in our case).  Doing it once per week is
probably ok.



2011/12/15 Avni, Itamar :
> What about managing a core for each day?
>
> This way the deletion/archive is very simple. No "holes" in the index (which 
> is often when deleting document by document).
> The index done against core [today-0].
> The query is done against cores [today-0],[today-1]...[today-99]. Quite a 
> headache.
>
> Itamar
>
> -Original Message-
> From: Robert Stewart [mailto:bstewart...@gmail.com]
> Sent: יום ה 15 דצמבר 2011 16:54
> To: solr-user@lucene.apache.org
> Subject: how to setup to archive expired documents?
>
> We have a large (100M) index where we add about 1M new docs per day.
> We want to keep index at a constant size so the oldest ones are removed 
> and/or archived each day (so index contains around 100 days of data).  What 
> is the best way to do this?  We still want to keep older data in some archive 
> index, not just delete it (so is it possible to export older segments, etc. 
> into some other index?).  If we have some daily job to delete old data, I 
> assume we'd need to optimize the index to actually remove and free space, but 
> that will require very large (and slow) replication after optimize which will 
> probably not work out well for so large an index.  Is there some way to shard 
> the data or other best practice?
>
> Thanks
> Bob
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries.
> The information is intended to be for the use of the individual(s) or
> entity(ies) named above.  If you are not the intended recipient (or 
> authorized to receive this e-mail for the intended recipient), you may not 
> use, copy, disclose or distribute to anyone this message or any information 
> contained in this message.  If you have received this electronic message in 
> error, please notify us by replying to this e-mail.
>


RE: how to setup to archive expired documents?

2011-12-15 Thread Avni, Itamar
What about managing a core for each day?

This way the deletion/archive is very simple. No "holes" in the index (which is 
often when deleting document by document).
The index done against core [today-0].
The query is done against cores [today-0],[today-1]...[today-99]. Quite a 
headache.

Itamar

-Original Message-
From: Robert Stewart [mailto:bstewart...@gmail.com] 
Sent: יום ה 15 דצמבר 2011 16:54
To: solr-user@lucene.apache.org
Subject: how to setup to archive expired documents?

We have a large (100M) index where we add about 1M new docs per day.
We want to keep index at a constant size so the oldest ones are removed and/or 
archived each day (so index contains around 100 days of data).  What is the 
best way to do this?  We still want to keep older data in some archive index, 
not just delete it (so is it possible to export older segments, etc. into some 
other index?).  If we have some daily job to delete old data, I assume we'd 
need to optimize the index to actually remove and free space, but that will 
require very large (and slow) replication after optimize which will probably 
not work out well for so large an index.  Is there some way to shard the data 
or other best practice?

Thanks
Bob
This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries.
The information is intended to be for the use of the individual(s) or
entity(ies) named above.  If you are not the intended recipient (or authorized 
to receive this e-mail for the intended recipient), you may not use, copy, 
disclose or distribute to anyone this message or any information contained in 
this message.  If you have received this electronic message in error, please 
notify us by replying to this e-mail.



Re: Large RDBMS dataset

2011-12-15 Thread Mikhail Khludnev
CachedSqlEntityProcessor joins you tables fine. But be aware that it works
in the single thread only.

On Thu, Dec 15, 2011 at 12:14 PM, Finotti Simone  wrote:

> CachedSqlEntityProcessor




-- 
Sincerely yours
Mikhail Khludnev
Developer
Grid Dynamics
tel. 1-415-738-8644
Skype: mkhludnev




Re: NumericRangeQuery: what am I doing wrong?

2011-12-15 Thread Jay Luker
On Wed, Dec 14, 2011 at 5:02 PM, Chris Hostetter
 wrote:
>
> I'm a little lost in this thread ... if you are programaticly construction
> a NumericRangeQuery object to execute in the JVM against a Solr index,
> that suggests you are writting some sort of SOlr plugin (or uembedding
> solr in some way)

It's not you; it's me. I'm just doing weird things, partly, I'm sure,
due to ignorance, but sometimes out of expediency. I was experimenting
with ways to do a NumericRangeFilter, and the tests I was trying used
the Lucene api to query a Solr index, therefore I didn't have access
to the IndexSchema. Also my question might have been better directed
at the lucene-general list to avoid confusion.

Thanks,
--jay


Re: Migrate Lucene 2.9 To SOLR

2011-12-15 Thread Anderson vasconcelos
OK. Thanks for help. I gonna try do migrate



2011/12/14 Chris Hostetter 

>
> : I have a old project that use Lucene 2.9. Its possible to use the index
> : created by lucene in SOLR? May i just copy de index to data directory of
> : SOLR, or exists some mechanism to import Lucene index?
>
> you can use an index created directly with lucene libraries in Solr, but
> in order for Solr to understand that index and do anything meaningful with
> it you have to configure solr with a schema.xml file that makes sense
> given the custom code used to build that index (ie: what fields did you
> store, what fields did you index, what analyzers did you use, what fields
> dod you index with term vectors, etc...)
>
>
> -Hoss
>


how to setup to archive expired documents?

2011-12-15 Thread Robert Stewart
We have a large (100M) index where we add about 1M new docs per day.
We want to keep index at a constant size so the oldest ones are
removed and/or archived each day (so index contains around 100 days of
data).  What is the best way to do this?  We still want to keep older
data in some archive index, not just delete it (so is it possible to
export older segments, etc. into some other index?).  If we have some
daily job to delete old data, I assume we'd need to optimize the index
to actually remove and free space, but that will require very large
(and slow) replication after optimize which will probably not work out
well for so large an index.  Is there some way to shard the data or
other best practice?

Thanks
Bob


Announcement of Soldash - a dashboard for multiple Solr instances

2011-12-15 Thread Alexander Valet | edelight
We use Solr quite a bit at edelight -- and love it. However, we encountered one 
minor peeve: although each individual
Solr server has its own dashboard, there's no easy way of getting a complete 
overview of an entire Solr cluster and the
status of its nodes.

Over the last weeks our own Aengus Walton developed Soldash, a dashboard for 
your entire Solr cluster.

Although still in its infancy, Soldash gives you an overview of:

- your Solr servers
- what version of Solr they're running
- what index version they have, and whether slaves are in sync with their 
master

as well as allowing you to:

- turn polling and replication on or off
- force an index fetch on a slave
- display a file list of the current index
- backup the index
- reload the index

It is worth noting that due to the set-up of our own environment, Soldash has 
been programmed to automatically presume all Solr instances have the same 
cores. This may change in future releases, depending on community reaction.

The project is open-source and hopefully some of you shall find this tool 
useful in day-to-day administration of Solr.

The newest version (0.2.2) can be downloaded at:
https://github.com/edelight/soldash/tags

Instructions on how to configure Soldash can be found at the project's homepage 
on github:
https://github.com/edelight/soldash

Feedback and suggestions are very welcome!



--
edelight GmbH, Wilhelmstr. 4a, 70182 Stuttgart

Fon: +49 (0)711-912590-14 | Fax: +49 (0)711-912590-99

Geschäftsführer: Peter Ambrozy, Tassilo Bestler
Amtsgericht Stuttgart, HRB 722861
Ust.-IdNr. DE814842587

Diese E-Mail ist vertraulich. Wenn Sie nicht der rechtmäßige Empfänger sind, 
dürfen Sie den Inhalt weder kopieren noch verbreiten oder benutzen. Sollten Sie 
diese E-Mail versehentlich erhalten haben, senden Sie sie bitte an uns zurück 
und löschen Sie sie anschließend.

This email is confidential. If you are not the intended recipient, you must not 
copy, disclose or use its contents. If you have received it in error, please 
inform us immediately by return email and delete the document.



Re: Solr Search Across Multiple Cores not working when quering on specific field

2011-12-15 Thread Erick Erickson
I suspect that the distributed searching is working just fine in both cases, but
your querying isn't doing what you expect due to differences in the analysis
chain. I'd recommend spending some time with the admin/analysis page
to see what is actually being parsed.

And be aware that wildcards from 3.5 and below do not go through any analysis,
so, for instance, iPo* will not match ipod since the case is different.

Furthermore, you may be going through a different query parser, attaching
&debugQuery=on would show you this.

Best
Erick


On Thu, Dec 15, 2011 at 6:34 AM, ravicv  wrote:
> Hi I was able to do it by changing datatype of all field to textgen from
> textTight.
> I am not sure whats wrong with textTight datatype.
>
> Also can you please suggest me the best way to index huge database data.
> Currently I tried with dataimporthandler and CVS import . But both are
> giving almost similar performances.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Search-Across-Multiple-Cores-not-working-when-quering-on-specific-field-tp3585013p3588295.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using LocalParams in StatsComponent to create a price slider?

2011-12-15 Thread Erick Erickson
I really don't understand what you're asking, could you clarify with
an example or two?

Best
Erick

On Wed, Dec 14, 2011 at 10:36 AM, Mark Schoy  wrote:
> Hi,
>
> I'm using the StatsComponent to receive to lower and upper bounds of a
> price field to create a "price slider".
> If someone sets the price range to $100-$200 I have to add a filter to
> the query. But then the lower and upper bound are calculated of the
> filtered result.
>
> Is it possible to use LocalParams (like for facets) to ignore a specific 
> filter?
>
> Thanks.
>
> Mark


Re: Faceting with null dates

2011-12-15 Thread Erick Erickson
Hmmm, I'm not sure I'm following this.
"Is there a way to query the index to not give me non-null dates in return"
So you want null dates?
and:
"which gives me some unwanted non-null dates in the result set"
which seems to indicate you do NOT want null dates.

I honestly don't know what your desired outcome is, could you
clarify?

Best
Erick


On Wed, Dec 14, 2011 at 8:38 AM, kenneth hansen  wrote:
>
> hello,I have the following faceting parameters, which gives me some unwanted 
> non-null dates in the result set. Is there a way to query the index to not 
> give me non-null dates in return? I.e. I would like to get a result set which 
> contains only non-nulls on the validToDate, but as I am faceting on non-null 
> values on the validToDate, I would like to get the non-null values in the 
> faceting result. This response example below gives me 10 results, with 7 
> non-null validToDates. What I would like to get is 3 results and 7 non-null 
> validToDate facets. And as I write this, I start to wonder if this is 
> possible at all as the facets are dependent on the result set and that this 
> might be better to handle in the application layer by just extracting 
> 10-7=3...
> Any help would be appreciated!
> br,ken
> true name="f.validToDate.facet.range.start">NOW/DAYS-4MONTHS name="facet.mincount">1(*:*) name="facet.range">validToDate name="facet.range.end">NOW/DAY+1DAY name="facet.range.gap">+1MONTH
>
>  name="facet_ranges">       name="2011-11-14T00:00:00Z">7
>
>


Re: Solr Join with Dismax

2011-12-15 Thread Pascal Dimassimo
Thanks Hoss!

Here it is:
https://issues.apache.org/jira/browse/SOLR-2972

On Wed, Dec 14, 2011 at 4:47 PM, Chris Hostetter
wrote:

>
> : I have been doing more tracing in the code. And I think that I
> understand a
> : bit more. The problem does not seem to be dismax+join, but
> : dismax+join+fromIndex.
>
> Correct.  join+dismax works fine as i already demonstrated...
>
> : >> Note: even with that hardcoded "lucene" bug, you can still override
> the
> : >> default by using var dereferencing to point at another param with
> it's own
> : >> localparams specying the type...
> : >>
> : >>   qf=text name
> : >>   q={!join from=manu_id_s to=id v=$qq}
> : >>   qq={!dismax}ipod
>
> ...the problem you are refering to now has nothing to do with dismax, and
> is specificly a bug in how the query is parsed when "fromIndex" is
> used (which i thought i already mentioned in this thread but i see you
> found independently)...
>
>https://issues.apache.org/jira/browse/SOLR-2824
>
> Did you file a Jira about defaulting to "lucene" instead of null so we can
> make the defType local param syntax work?  I havne't seen it in my
> email but it's really an unrelated problem so it should be tracked
> seperately)
>
>
> -Hoss
>



-- 
Pascal Dimassimo

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


Re: Delta Replication in SOLR

2011-12-15 Thread Bob Stewart
Replication only copies new segment files so unless you are optimizing on 
commit it will not copy entire index.  Make sure you do not optimize your 
index.  Optimizing merges to a single segment and is not necessary.  When new 
docs are added new small segment files are created so typical replication will 
only copy a few small segments from master to slave.

On Dec 15, 2011, at 12:58 AM, mechravi25  wrote:

> We would like know whether it is possible to replicate only a certain
> documents from master to slave. More like a Delta Replication process. 
> 
> In our application, the master solr instances is used for indexing purpose
> and the slave solr is for user search request. Hence the replication has to
> happen on regular interval of time. Master solr has around 1.4 million
> document(Size : 2.7 GB) and it takes more than 900 seconds for replication.
> Even if we update few documents in the master, we have to replicate that to
> the slave to make the Slave in sync with master, in this process its taking
> too much. 
> 
> We have a field in the master SOLR which will denote the last added or
> updated time ( stored="true" default="NOW" multiValued="false"/>), so we thought whether we
> can replicate the documents from master which were added/updated after the
> last Replication time of slave instance which will be available in
> replication.properties file. We don’t want all the documents from master to
> be replicated to slave. The ultimate purpose is to reduce the time taken for
> replication. 
> 
> Thanks in advance. Any pointers would be of great help. 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Delta-Replication-in-SOLR-tp3587745p3587745.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: XPath with ExtractingRequestHandler

2011-12-15 Thread Michael Kelleher

Yeah, I tried:

//xhtml:div[@class='bibliographicData']/descendant:node()

also tried

//xhtml:div[@class='bibliographicData']

Neither worked.  The DIV I need also had an ID value, and I tried both 
variations on ID as well.  Still nothing.



XPath handling for Tika seems to be pretty basic and does not seem to 
support most XPath Query syntax.  Probably because it's using a Sax 
parser, I don't know.  I guess I will have to write something custom to 
get it to do what I need it to.


Thanks for the reply though.

I will post a follow up with how I fixed this.

--mike


Specifing BatchSize parameter in db-data-config.xml will improve performance?

2011-12-15 Thread ravicv
Hi

I am using Oracle Exadata as my DB. I want to index nearly 4 crore rows. I
have tried with specifing batchsize as 1. and with out specifing
batchsize. But both tests takes nearly same time.

Could anyone suggest me best way to index huge data Quickly?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Specifing-BatchSize-parameter-in-db-data-config-xml-will-improve-performance-tp3588355p3588355.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Search Across Multiple Cores not working when quering on specific field

2011-12-15 Thread ravicv
Hi I was able to do it by changing datatype of all field to textgen from
textTight.
I am not sure whats wrong with textTight datatype.

Also can you please suggest me the best way to index huge database data.
Currently I tried with dataimporthandler and CVS import . But both are
giving almost similar performances.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Search-Across-Multiple-Cores-not-working-when-quering-on-specific-field-tp3585013p3588295.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: XPath with ExtractingRequestHandler

2011-12-15 Thread Péter Király
Hi,

maybe I am wrong, but the // should be at the beggining of the
expression, like
//xhtml:div[@class='bibliographicData']/descendant:node(),
or if you want to search the div inside body, you have to use descendant like
/xhtml:html/xhtml:body/descendant::xhtml:div[@class='bibliographicData']/descendant:node()

Péter

2011/12/14 Michael Kelleher :
> I want to restrict the HTML that is returned by Tika to basically:
>
>
>  /xhtml:html/xhtml:body//xhtml:div[@class='bibliographicData']/descendant:node()
>
>
> and it seems that the XPath class being used does not support the '//'
> syntax.
>
> Is there anyway to configure Tika to use a different XPath evaluation class?
>
>



-- 
Péter Király
eXtensible Catalog
http://eXtensibleCatalog.org
http://drupal.org/project/xc


Re: Large RDBMS dataset

2011-12-15 Thread Finotti Simone
Thank you (and all the others who spent time answering me) very much for your  
insights!

I didn't know how I've managed to miss CachedSqlEntityProcessor, but it seems 
that's just what I need.

bye


Inizio: Gora Mohanty [g...@mimirtech.com]
Inviato: mercoledì 14 dicembre 2011 16.39
Fine: solr-user@lucene.apache.org
Oggetto: Re: Large RDBMS dataset

On Wed, Dec 14, 2011 at 3:48 PM, Finotti Simone  wrote:
> Hello,
> I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my Solr 
> application to pull data from.
[...]

> It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That 
> means that digesting the whole dataset would take 1 Ms (=> 12 days).

Depending on the size of the data that you are pulling from
the database, 1M records is not really that large a number.
We were doing ~75GB of stored data from ~7million records
in about 9h, including quite complicated transfomers. I would
imagine that there is much room for improvement in your case
also. Some notes on this:
* If you have servers to throw at the problem, and a sensible
  way to shard your RDBMS data, use parallel indexing to
  multiple Solr cores, maybe on multiple servers, followed by
  a merge. In our experience, given enough RAM and adequate
  provisioning of database servers, indexing speed scales linearly
  with the total no. of cores.
* Replicate your database, manually if needed. Look at the load
  on a database server during the indexing process, and provision
  enough database servers to match the no. of Solr indexing servers.
* This point is leading into flamewar territory, but consider switching
   databases. From our (admittedly non-rigorous measurements),
   mysql was at least a factor of 2-3 faster than MS-SQL, with the
   same dataset.
* Look at cloud-computing. If finances permit, one should be able
  to shrink indexing times to almost any desired level. E.g., for the
  dataset that we used, I have little doubt that we could have shrunk
  the time down to less than 1h, at an affordable cost on Amazon EC2.
  Unfortunately, we have not yet had the opportunity to try this.

> The problem is that for each record in "fd", Solr makes three distinct SELECT 
> on the other three tables. Of course, this is absolutely inefficient.
>
> Is there a way to have Solr loading every record in the four tables and join 
> them when they are already loaded in memory?

For various reasons, we did not investigate this in depth,
but you could also look at Solr's CachedSqlEntityProcessor.

Regards,
Gora






Re: Sorting and searching on a field

2011-12-15 Thread pravesh
>>I have read about the option of copying this to a different field, using
one for searching by tokenizing, and one for sorting.

That would be the optimal way of doing it. Since sorting requires the fields
not to be analyzed/tokenized, while the searching requires it. The copy
field would be the optimal solution for doing it.

Regds
Pravesh



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-and-searching-on-a-field-tp3584992p3587906.html
Sent from the Solr - User mailing list archive at Nabble.com.