Re: Spatial Solr (JTeam)

2010-01-04 Thread Thomas Rabaix
I have also move the jar into the global core's lib directory. and I still
have this issue.

I am running macosx snowleopard
  java version 1.6.0_17
  Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
  Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)


I really don't know where the issue come from.

On Mon, Dec 28, 2009 at 2:54 PM, Mauricio Scheffer 
mauricioschef...@gmail.com wrote:

 Seems to work for me... (I mean, I don't get a NoClassDefFoundError but I
 have other issues).
 I just put spatial-solr-1.0-RC3.jar in the core's lib directory and it
 worked.

 On Wed, Dec 23, 2009 at 8:25 PM, Thomas Rabaix thomas.rab...@gmail.com
 wrote:

  Hello,
 
  I would like to set up the spatial solr plugin from
  http://www.jteam.nl/news/spatialsolr on solr 1.4. However I am getting a
  error message when solr start.
 
  SEVERE: java.lang.NoClassDefFoundError:
  org/apache/solr/search/QParserPlugin
 
  I guess nl.jteam.search.solrext.spatial.SpatialTierQueryParserPlugin
  extends QParserPlugin. I have checked into the solr.war file (the one
  provided by solr download webpage) and the class is present.
 
  Do you know if the current version SSP version 1.0-RC3 is compatible
 with
  solr 1.4 ?
 
  Thanks
 
  --
  Thomas Rabaix
 




-- 
Thomas Rabaix
http://rabaix.net


Re: Remove the deleted docs from the Solr Index

2010-01-04 Thread Shalin Shekhar Mangar
On Wed, Dec 30, 2009 at 12:10 AM, Mohamed Parvez par...@gmail.com wrote:

 Ditto. There should have been an DIH command to re-sync the Index with the
 DB.


But there is such a command; it is called full-import.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Search algorithm used in Solr

2010-01-04 Thread Shalin Shekhar Mangar
On Mon, Jan 4, 2010 at 11:39 AM, abhis...@gmail.com wrote:

 Hello everyone,

 Is there an article which explains (on a high level) the algorithm of
 search in Solr?

 How does Solr search approach compare to the inverted index technique?


Solr uses Lucene. It is the same inverted index technique at work.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Optimize not having any effect on my index

2010-01-04 Thread Aleksander Stensby
Hey, I managed to run it correctly after a few restarts. Don't really know
what happened.
Can't really see what this would have had to do with compound file format
tho? But no, I'm not using compund file format.

Cheers and thanks for your replies,
 Aleks

On Mon, Dec 21, 2009 at 8:27 AM, gurudev suyalprav...@yahoo.com wrote:


 Hi,

 Are you using the compound file format? If yes, then, have u set it
 properly
 in solrconfig.xml, if not, then, change to:

 useCompoundFiletrue/useCompoundFile (this is by default 'false') under
 the tags:

 indexDefaults.../indexDefaults
  and, mainIndex.../mainIndex




 Aleksander Stensby wrote:
 
  Hey guys,
  I'm getting some strange behavior here, and I'm wondering if I'm doing
  anything wrong..
 
  I've got an unoptimized index, and I'm trying to run the following
  command:
 
 http://server:8983/solr/update?optimize=truemaxSegments=10waitFlush=false
  Tried it first directly in the browser, it obviously took quite a bit of
  time, but once it was finished I see no difference in my index. Same
  number
  of files, same size etc.
  So i tried with curl:
  curl http://server:8983/solr/update --data-binary 'optimize/' -H
  'Content-type:text/xml; charset=utf-8'
 
  No difference here either... Am I doing anything wrong? Do i need to
 issue
  a
  commit after the optimize?
 
  Any pointers would be greatly appreciated.
 
  Cheers,
   Aleks
 
 

 --
 View this message in context:
 http://old.nabble.com/Optimize-not-having-any-effect-on-my-index-tp26843094p26870653.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Facets and distributed search

2010-01-04 Thread Aleksander Stensby
Hi everyone! I've posted a similar question earlier, but in a thread related
to facets in general, so I thought I'd repost it here as a separate thread.

I have a faceted search that is very fast when I executed the query on a
single solr server, but is significantly slower when executed in a
distributed environment.
The set-back seem to be in the sharding of our data.. And that puzzles me a
little bit... I can't really see why SOLR is so slow at doing this.

The scenario:
Let's say we have two servers (s1 and s2).
If i query
the following:
q=threadid:33facet=truefacet.field=authorlimit=-1facet.mincount=0rows=0
directly on either server, the response is lightning fast. (10ms)

So, in theory I could query them directly, concat the result myself and get
that done pretty fast.

But if I introduce the shards parameter, the response time booms to between
15000ms and 2ms!
shards=s1:8983/solr,s2:8983/solr

My initial thoughts is that I MUST be doing something wrong here?

So I try the following:
Run the query on server s1, with the shards param shards=s1:8983/solr
response time goes from sub 10ms to between 5000ms and 1ms!
Same results if i run the query on s2, and same if i use shards=s2:8983/solr

Is there really that much overhead in running a distributed facet field
query with Solr? Anyone else experienced this?

On the other hand, running regular queries without facet distributed is
lightning fast... (so can't really see that this is a network problem or
anything either). - I tried running a facet query on s1 with s1 as the
shards param, and that is still as slow as if the shards param was pointed
to a different server...

Any insight into this would be greatly appreciated! (Would like to avoid
having to hack together our own solution concatenating results...)

Cheers,
 Aleks


Re: Implementing Autocomplete/Query Suggest using Solr

2010-01-04 Thread Shalin Shekhar Mangar
On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R plistma...@gmail.com wrote:

  I looked into the Solr/Lucene classes and found the required information.
 Am summarizing the same for the benefit of those that might refer to this
 thread in the future.

  The change I had to make was very simple - make a call to getPrefixQuery
 instead of getWildcardQuery in my custom-modified Solr dismax query parser
 class. However, this will make a fairly significant difference in terms of
 efficiency. The key difference between the lucene WildcardQuery and
 PrefixQuery lies in their respective term enumerators, specifically in the
 term comparators. The termCompare method for PrefixQuery is more
 light-weight than that of WildcardQuery and is essentially an optimization
 given that a prefix query is nothing but a specialized case of Wildcard
 query. Also, this is why the lucene query parser automatically creates a
 PrefixQuery for query terms of the form 'foo*' instead of a WildcardQuery.


I don't understand this. There is nothing that one should need to do in
Solr's code to make this work. Prefix queries are supported out of the box
in Solr.


 And one final request for Comment to Shalin on this topic - I am guessing
 you ensured there were no duplicate terms in the field(s) used for
 autocompletion. For our first version, I am thinking of eliminating the
 duplicates outside of the results handler that gives suggestions since
 duplicate suggestions originate only from different document IDs in our
 system and we do want the list of document IDs matched. Is there a
 better/different way of doing the same?


No, I guess not.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-04 Thread Shalin Shekhar Mangar
On Wed, Dec 30, 2009 at 7:49 AM, Ross tetr...@gmail.com wrote:

 Hi all

 I'm experimenting with Solr. I've successfully indexed some PDFs and
 all looks good but now I want to index some PDFs with metadata pulled
 from another source. I see this example in the docs.

 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc4captureAttr=truedefaultField=textcapture=divfmap.div=foo_tboost.foo_t=3literal.blah_s=Bah
 
  -F tutori...@tutorial.pdf

 I can write code to generate a script with those commands substituting
 my own literal.whatever.  My metadata could be up to a couple of KB in
 size. Is there a way of making the literal a POST variable rather than
 a GET?


With Curl? Yes, see the man page.


  Will Solr Cell accept it as a POST?


Yes, it will.

-- 
Regards,
Shalin Shekhar Mangar.


Re: performance question

2010-01-04 Thread Erik Hatcher


On Jan 4, 2010, at 12:04 AM, A. Steven Anderson wrote:



dynamic fields don't make it worse ... the number of actaul field  
names

you sort on makes it worse.

If you sort on 100 fields, the cost is the same regardless of  
wether all
100 of those fields exist because of a single dynamicField/  
declaration,

or 100 distinct field/ declarations.



Ahh...thanks for the clarification.

So, in general, there is no *significant* performance difference  
with using

dynamic fields. Correct?


Correct.  There's not even really an insignificant performance  
difference.  A dynamic field is the same as a regular field in  
practically every way on the search side of things.


Erik



Re: Search both diacritics and non-diacritics

2010-01-04 Thread Shalin Shekhar Mangar
On Sun, Jan 3, 2010 at 6:01 AM, Lance Norskog goks...@gmail.com wrote:

 The ASCIIFoldingFilter is a superset of the ISOLatin1Filter -
 ISOLatin1 is deprecated.  Here's the Javadoc from ASCIIFoldingFIlter.
 You did not mention which language you want to search.

 Unforch, the ASCIIFoldingFilter is not mentioned on the Solr wiki.


Thanks Lance. I've added it to the wiki at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

-- 
Regards,
Shalin Shekhar Mangar.


Re: Configuring Solr to use RAMDirectory

2010-01-04 Thread Shalin Shekhar Mangar
On Thu, Dec 31, 2009 at 3:36 PM, dipti khullar dipti.khul...@gmail.comwrote:

 Hi

 Can somebody let me know if its possible to configure RAMDirectory from
 solrconfig.xml. Although its clearly mentioned in
 https://issues.apache.org/jira/browse/SOLR-465 by Mark that he has worked
 upon it, but still I couldn't find any such property in config file in Solr
 1.4 latest download.
 May be I am overlooking some simple property. Any help would be
 appreciated.


Note that there are things like replication which will not work if you are
using a RAMDirectory.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Rules engine and Solr

2010-01-04 Thread Shalin Shekhar Mangar
On Mon, Jan 4, 2010 at 10:24 AM, Avlesh Singh avl...@gmail.com wrote:

 I have a Solr (version 1.3) powered search server running in production.
 Search is keyword driven is supported using custom fields and tokenizers.

 I am planning to build a rules engine on top search. The rules are database
 driven and can't be stored inside solr indexes. These rules would
 ultimately
 two do things -

   1. Change the order of Lucene hits.


A Lucene FieldComparator is what you'd need. The QueryElevationComponent
uses this technique.


   2. Add/remove some results to/from the Lucene hits.


This is a bit more tricky. If you will always have a very limited number of
docs to add or remove, it may be best to change the query itself to include
or exclude them (i.e. add fq). Otherwise you'd need to write a custom
Collector (see DocSetCollector) and change SolrIndexSearcher to use it. We
are planning to modify SolrIndexSearcher to allow custom collectors soon for
field collapsing but for now you will have to modify it.


 What should be my starting point? Custom search handler?


A custom SearchComponent which extends/overrides QueryComponent will do the
job.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Invalid CRLF - StreamingUpdateSolrServer ?

2010-01-04 Thread Patrick Sauts

Thank you Yonik for your answer.

The platform encoding is fr_FR.UTF-8, so it's still UTF-8, it should 
be I guess en_US.UTF-8 ?


I've also tested LBHttpSolrServer (We wanted to have it as a backup 
for HAproxy) and it appears not to be thread safe ( what is also curious 
about it, is that there's no way to  manage the connections' pool ). If 
you're interresting in the logs, I can send those to you.


*Will there be a Solr 1.4.1 that'll fix those problems ?*

Cause using a SNAPSHOT doesn't seem a good idea to me.

I have another question but I don't know if I have to make a new post :
Can I use -Dmaster=disabled in JAVA_OPTS for a server that is slave 
and repeater ?


Patrick.


Yonik Seeley a écrit :

It could be this bug, fixed in trunk:

* SOLR-1595: StreamingUpdateSolrServer used the platform default character
  set when streaming updates, rather than using UTF-8 as the HTTP headers
  indicated, leading to an encoding mismatch. (hossman, yonik)

Could you try a recent nightly build (or build your own from trunk)
and see if it fixes it?

-Yonik
http://www.lucidimagination.com



On Thu, Dec 31, 2009 at 5:07 AM, Patrick Sauts patrick.via...@gmail.com wrote:
  

I'm using solr 1.4 on tomcat 5.0.28, with client StreamingUpdateSolrServer
with 10threads and xml communication via Post method.

Is there a way to avoid this error (data lost)?
And is StreamingUpdateSolrServer reliable ?

GRAVE: org.apache.solr.common.SolrException: Invalid CRLF
  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:72)
  at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
  at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
  at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
  at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
  at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
  at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
  at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
  at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
  at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
  at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
  at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
  at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
  at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
  at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
  at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
  at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
  at java.lang.Thread.run(Thread.java:619)
Caused by: com.ctc.wstx.exc.WstxIOException: Invalid CRLF





  




Re: Invalid CRLF - StreamingUpdateSolrServer ?

2010-01-04 Thread Shalin Shekhar Mangar
On Mon, Jan 4, 2010 at 6:11 PM, Patrick Sauts patrick.via...@gmail.comwrote:


 I've also tested LBHttpSolrServer (We wanted to have it as a backup for
 HAproxy) and it appears not to be thread safe ( what is also curious about
 it, is that there's no way to  manage the connections' pool ). If you're
 interresting in the logs, I can send those to you.


What is the issue that you are facing? What is it exactly that you want to
change?

-- 
Regards,
Shalin Shekhar Mangar.


Improvising solr queries

2010-01-04 Thread dipti khullar
Hi

We have tried out various configurations settings to improvise the
performance of the site which is majorly using Solr but still the response
time remains about 4-5 reqs/sec. We also did some performance tests on Solr
1.4 but still there is a very minute improvement in performance. Currently
we are using Solr 1.3.

So our last resort remains, improvising the queries. We are using SolrJ -
CommonsHttpSolrServer

We guys are trying to tune up Solr Queries being used in our project.
Following sample query takes about 6 secs to execute under normal traffic.
At peak hours this often increases to 10-15 secs.

sitename:XYZ OR sitename:All Sites) AND (localeid:1237400589415) AND
((assettype:Gallery))  AND (rbcategory:ABC XYZ ) AND (startdate:[* TO
2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
*])rows=9start=63sort=date
descfacet=truefacet.field=assettypefacet.mincount=1

Similar to this query we have several much complex queries supporting all
major landing pages of our application.

Just want to confirm that whether anyone can identify any major flaws or
issues in the sample query?

Thanks
Dipti


Re: Invalid CRLF - StreamingUpdateSolrServer ?

2010-01-04 Thread Patrick Sauts
The issue was sometimes null result during facet navigation or simple 
search, results were back after a refresh, we tried to changed the cache 
to httpCaching never304=true/. But same behaviour.


*My implementation was :* (maybe wrong ?)
LBHttpSolrServer solrServer = new LBHttpSolrServer(new HttpClient(), new 
XMLResponseParser(), solrServerUrl.split(,));

solrServer.setConnectionManagerTimeout(CONNECTION_TIMEOUT);
solrServer.setConnectionTimeout(CONNECTION_TIMEOUT);
solrServer.setSoTimeout(READ_TIMEOUT);
solrServer.setAliveCheckInterval(CHECK_HEALTH_INTERVAL_MS);

*What I was suggesting :*
As a LBHttpSolrServer is a wrapper to CommonsHttpSolrServer

CommonsHttpSolrServer search1 = new 
CommonsHttpSolrServer(http://mysearch1;);

search1.setConnectionTimeout(CONNECTION_TIMEOUT);
search1.setSoTimeout(READ_TIMEOUT);
search1.setConnectionManagerTimeout(solr.CONNECTION_MANAGER_TIMEOUT);
search1.setDefaultMaxConnectionsPerHost(MAX_CONNECTIONS_PER_HOST1);
search1.setMaxTotalConnections(MAX_TOTAL_CONNECTIONS1);
search1.setParser(new XMLResponseParser());

CommonsHttpSolrServer search2 = new 
CommonsHttpSolrServer(http://mysearch1;);

search2.setConnectionTimeout(CONNECTION_TIMEOUT);
search2.setSoTimeout(READ_TIMEOUT);
search2.setConnectionManagerTimeout(solr.CONNECTION_MANAGER_TIMEOUT);
search2.setDefaultMaxConnectionsPerHost(MAX_CONNECTIONS_PER_HOST1);
search2.setMaxTotalConnections(MAX_TOTAL_CONNECTIONS1);
search2.setParser(new XMLResponseParser());

*LBHttpSolrServer solrServers = new LBHttpSolrServer(search1, search2);*

So we can manage the parameters per server.

Thank you for your time.

Patrick.


Shalin Shekhar Mangar a écrit :

On Mon, Jan 4, 2010 at 6:11 PM, Patrick Sauts patrick.via...@gmail.comwrote:

  

I've also tested LBHttpSolrServer (We wanted to have it as a backup for
HAproxy) and it appears not to be thread safe ( what is also curious about
it, is that there's no way to  manage the connections' pool ). If you're
interresting in the logs, I can send those to you.




What is the issue that you are facing? What is it exactly that you want to
change?

  




Re: Improvising solr queries

2010-01-04 Thread Shalin Shekhar Mangar
On Mon, Jan 4, 2010 at 6:39 PM, dipti khullar dipti.khul...@gmail.comwrote:

 We have tried out various configurations settings to improvise the
 performance of the site which is majorly using Solr but still the response
 time remains about 4-5 reqs/sec. We also did some performance tests on Solr
 1.4 but still there is a very minute improvement in performance. Currently
 we are using Solr 1.3.


That is too slow.

We need more information on your setup before we can help. What kind of
hardware you are using? Which OS/JVM? How much memory have you allocated to
the JVM?

What does your solrconfig look like? How many documents are there in your
index? What is the size of index on disk? What are the field types of the
fields you are searching on? Do you do highlighting on large fields? Can you
paste the cache section on the statistics page of your Solr dashboard
(preferably, just after a peak load)? How frequently is your index changed
(i.e. how frequently do you commit)?

I'd recommend an upgrade to Solr 1.4 anyway since it has major performance
improvements.



 So our last resort remains, improvising the queries. We are using SolrJ -
 CommonsHttpSolrServer


Actually that is one of the first things that you should look at.


 We guys are trying to tune up Solr Queries being used in our project.
 Following sample query takes about 6 secs to execute under normal traffic.
 At peak hours this often increases to 10-15 secs.

 sitename:XYZ OR sitename:All Sites) AND (localeid:1237400589415) AND
 ((assettype:Gallery))  AND (rbcategory:ABC XYZ ) AND (startdate:[* TO
 2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
 *])rows=9start=63sort=date
 descfacet=truefacet.field=assettypefacet.mincount=1

 Similar to this query we have several much complex queries supporting all
 major landing pages of our application.

 Just want to confirm that whether anyone can identify any major flaws or
 issues in the sample query?


Most of those AND conditions can be separate filter queries. Filter queries
can be cached separately and can therefore be re-used. See
http://wiki.apache.org/solr/FilterQueryGuidance

-- 
Regards,
Shalin Shekhar Mangar.


RE: Reverse sort facet query [SOLR-1672]

2010-01-04 Thread Peter 4U


 

 Date: Sun, 3 Jan 2010 22:18:33 -0800
 From: hossman_luc...@fucit.org
 To: solr-user@lucene.apache.org
 Subject: RE: Reverse sort facet query [SOLR-1672]
 
 
 : Yes, I thought about adding some 'new syntax', but I opted for a separate 
 'facet.sortorder' parameter,
 : 
 : mainly because I'm not familiar enough with the codebase to know what 
 effect this might have on
 : 
 : backward compatibility. It would be easy enough to modify the patch I 
 created to do it this way.
 
 it shouldn't really affect anything -- it wouldn't really be new syntax, 
 just extending hte existing sort param syntax to apply to the 
 facet.sort param. The only back compat concern is making sure we 
 continue to support true/false as aliases, and having the default order 
 match the current bahvior if asc/desc aren't specified.
 
 
 -Hoss
 


Yes, agreed. The current patch doesn't touch the b/w true/false aliasing, and 
any move to adding a new attr can keep all that intact.

I've been using the current patch extensively in our testing, and that's 
working well. The only caveat to this is that the reverse sort results

don't include 0-count facets (see notes in SOLR-1672), so reverse sort results 
start with the first count=1. This could be confusing as

there could well be many facets whose count is 0, and it might be expected that 
these be returned in the first instance.

From my admittedly cursory look into the codebase regading this, I believe 
patching to include 0 counts could open a can of worms in terms

of b/w compat and performance, as 0 counts look to be skipped (by default). I 
could be wrong, and you may know better how changes to 
SimpleFacets/UnInvertedField would affect performance and compatibility.

If there is indeed a performance optimization in facet counting iteration, it 
would, imo, be preferable to have the optimization, rather than the 0-counts.

 

Would you like me to go ahead and amend the patch (w/o 0-counts) to define a 
new 'sort' parameter? 

For naming, I would propose an extension of FacetParams.FACET_SORT_COUNT ala:

 

public static final String FACET_SORT_COUNT_REVERSE = count.reverse;

 

I can then easily modify the patch to detect/use this value to invoke the new 
behaviour.

Comments? Suggestions?

 

Thanks,

Peter

 

 

 

 
  
_
Have more than one Hotmail account? Link them together to easily access both
 http://clk.atdmt.com/UKM/go/186394591/direct/01/

Re: Improvising solr queries

2010-01-04 Thread Shalin Shekhar Mangar
On Mon, Jan 4, 2010 at 7:25 PM, dipti khullar dipti.khul...@gmail.comwrote:

 Thanks Shalin.

 Following are the relevant details:

 There are 2 search servers in a virtualized VMware environment. Each has  2
 instances of Solr running on separates ports in tomcat.
 Server 1: hosts 1 master(application 1), 1 slave (application 1)
 Server 2: hosta 1 master (application 2), 1 slave (application 1)


Have you tried a non-virtualized environment? Virtual instances are not that
great for high I/O throughput environments.


 Both servers have 4 CPUs and 4 GB RAM.

 Master
 - 4GB RAM
 - 1GB JVM Heap memory is allocated to Solr
 Slave1/Slave2:
 - 4GB RAM
 - 2GB JVM Heap memory is allocated to Solr

 Solr Details:
 apache-solr Version: 1.3.0
 Lucene - 2.4-dev

 - autocommit: 50 docs and 5 minutes
 - optimize runs on master in every 7 minutes
 - using postOptimize , we execute snapshooter on master
 - snappuller/snapinstaller on 2 slaves runs after every 10 minutes


You are committing every 5 minutes and optimizing every 7 minutes. Can you
try committing less often?


 Master and Slave1 (solr1)are on single box and Slave2(solr2) on different
 box. We use HAProxy to load balance query requests between
 2 slaves. Master is only used for indexing.

 Solrj client which is used to query slave solr,gets timedout and there is
 high CPU usage/load avg.T he problem is reported on slaves for application
 1. The SolrJ client which queries Solr over HTTP times out (10 sec is the
 timeout value) though in the Solr tomcat access log we find all requests
 have 200 response.
 During the tme, requests timeout the load avg. of the server goes extremely
 high (10-20).
 The issue gets resolved as soon as we optimize the slave index. In the solr
 admin, it shows only 4 requests/sec is handled with 400 ms response time.

 I am attaching solrconfig.xml for both master and slaves.


There is no autowarming on slaves which is probably OK if you are committing
so often. But do you really need to index new documents so often?

-- 
Regards,
Shalin Shekhar Mangar.


High Availability

2010-01-04 Thread Matthew Inger
I'm kind of stuck and looking for suggestions for high availability options.  
I've figured out without much trouble how to get the master-slave replication 
working.  This eliminates any single points of failure in the application in 
terms of the application's searching capability.

I would setup a master which would create the index and several slaves to act 
as the search servers, and put them behind a load balancer to distribute the 
requests.  This would ensure that if a slave node goes down, requests would 
continue to get serviced by the other nodes that are still up.

The problem I have is that my particular application also has the capability to 
trigger index updates from the user interface.  This means that the master now 
becomes a single point of failure for the user interface.  

The basic idea of the app is that there are multiple oracle instances 
contributing to a single document.  The volume and organization of the data 
(database links, normalization, etc...) prevents any sort of fast querying via 
SQL to do querying of the documents.  The solution is to build a lucene index 
(via solr), and use that for searching.  When updates are made in the UI, we 
will also send the updates directly to the solr server as well (we don't want 
to wait some arbitrary interval for a delta query to run).  

So you can see the problem here is that if the master is down, the sending of 
the updates to the master solr server will fail, thus causing an application 
exception.

I have tried configuring multiple solr servers which are both setup as masters 
and slaves to each other, but they keep clobber each other's index updates and 
rolling back each other's delta updates.  It seems that the replication doesn't 
take the generation # into account and check that the generation it's fetching 
is  the generation it already has before it applies it.

I thought of maybe introducing a JMS queue to send my updates to and having the 
JMS message listener set to manually acknowledge the messages only after a 
succesfull application of the solrj api calls, but that seems kind of 
contrived, and is only a band-aid.

Does anyone have any suggestions?


 
mattin...@yahoo.com
Once you start down the dark path, forever will it
dominate your destiny.  Consume you it will  - Yoda



  


Re: High Availability

2010-01-04 Thread rob

Have you looked into a basic floating IP setup?

Have the master also replicate to another hot-spare master.

Any downtime during an outage of the 'live' master would be minimal as the 
hot-spare takes up the floating IP.




On Mon 04/01/10 16:13 , Matthew Inger mattin...@yahoo.com wrote:

 I'm kind of stuck and looking for suggestions for high availability
 options.  I've figured out without much trouble how to get the
 master-slave replication working.  This eliminates any single points
 of failure in the application in terms of the application's searching
 capability.
 I would setup a master which would create the index and several
 slaves to act as the search servers, and put them behind a load
 balancer to distribute the requests.  This would ensure that if a
 slave node goes down, requests would continue to get serviced by the
 other nodes that are still up.
 The problem I have is that my particular application also has the
 capability to trigger index updates from the user interface.  This
 means that the master now becomes a single point of failure for the
 user interface.  
 The basic idea of the app is that there are multiple oracle
 instances contributing to a single document.  The volume and
 organization of the data (database links, normalization, etc...)
 prevents any sort of fast querying via SQL to do querying of the
 documents.  The solution is to build a lucene index (via solr), and
 use that for searching.  When updates are made in the UI, we will
 also send the updates directly to the solr server as well (we don't
 want to wait some arbitrary interval for a delta query to run).  
 So you can see the problem here is that if the master is down, the
 sending of the updates to the master solr server will fail, thus
 causing an application exception.
 I have tried configuring multiple solr servers which are both setup
 as masters and slaves to each other, but they keep clobber each
 other's index updates and rolling back each other's delta updates. 
 It seems that the replication doesn't take the generation # into
 account and check that the generation it's fetching is  the
 generation it already has before it applies it.
 I thought of maybe introducing a JMS queue to send my updates to and
 having the JMS message listener set to manually acknowledge the
 messages only after a succesfull application of the solrj api calls,
 but that seems kind of contrived, and is only a band-aid.
 Does anyone have any suggestions?
 
 Once you start down the dark path, forever will it
 dominate your destiny.  Consume you it will  - Yoda
 
 
Message sent via Atmail Open - http://atmail.org/


Re: Facets and distributed search

2010-01-04 Thread Yonik Seeley
Something looks wrong... that type of slowdown is certainly not expected.
You should be able to see both the main query and a sub-query in the
logs... could you post an actual example?

-Yonik
http://www.lucidimagination.com


On Mon, Jan 4, 2010 at 4:15 AM, Aleksander Stensby
aleksander.sten...@integrasco.com wrote:
 Hi everyone! I've posted a similar question earlier, but in a thread related
 to facets in general, so I thought I'd repost it here as a separate thread.

 I have a faceted search that is very fast when I executed the query on a
 single solr server, but is significantly slower when executed in a
 distributed environment.
 The set-back seem to be in the sharding of our data.. And that puzzles me a
 little bit... I can't really see why SOLR is so slow at doing this.

 The scenario:
 Let's say we have two servers (s1 and s2).
 If i query
 the following:
 q=threadid:33facet=truefacet.field=authorlimit=-1facet.mincount=0rows=0
 directly on either server, the response is lightning fast. (10ms)

 So, in theory I could query them directly, concat the result myself and get
 that done pretty fast.

 But if I introduce the shards parameter, the response time booms to between
 15000ms and 2ms!
 shards=s1:8983/solr,s2:8983/solr

 My initial thoughts is that I MUST be doing something wrong here?

 So I try the following:
 Run the query on server s1, with the shards param shards=s1:8983/solr
 response time goes from sub 10ms to between 5000ms and 1ms!
 Same results if i run the query on s2, and same if i use shards=s2:8983/solr

 Is there really that much overhead in running a distributed facet field
 query with Solr? Anyone else experienced this?

 On the other hand, running regular queries without facet distributed is
 lightning fast... (so can't really see that this is a network problem or
 anything either). - I tried running a facet query on s1 with s1 as the
 shards param, and that is still as slow as if the shards param was pointed
 to a different server...

 Any insight into this would be greatly appreciated! (Would like to avoid
 having to hack together our own solution concatenating results...)

 Cheers,
  Aleks



Re: High Availability

2010-01-04 Thread Matthew Inger
So, when the masters switch back, does that mean, we have to force a full delta 
update, correct?


 
mattin...@yahoo.com
Once you start down the dark path, forever will it
dominate your destiny.  Consume you it will  - Yoda



- Original Message 
From: r...@intelcompute.com r...@intelcompute.com
To: solr-user@lucene.apache.org
Sent: Mon, January 4, 2010 11:17:40 AM
Subject: Re: High Availability


Have you looked into a basic floating IP setup?

Have the master also replicate to another hot-spare master.

Any downtime during an outage of the 'live' master would be minimal as the 
hot-spare takes up the floating IP.




On Mon 04/01/10 16:13 , Matthew Inger mattin...@yahoo.com wrote:

 I'm kind of stuck and looking for suggestions for high availability
 options.  I've figured out without much trouble how to get the
 master-slave replication working.  This eliminates any single points
 of failure in the application in terms of the application's searching
 capability.
 I would setup a master which would create the index and several
 slaves to act as the search servers, and put them behind a load
 balancer to distribute the requests.  This would ensure that if a
 slave node goes down, requests would continue to get serviced by the
 other nodes that are still up.
 The problem I have is that my particular application also has the
 capability to trigger index updates from the user interface.  This
 means that the master now becomes a single point of failure for the
 user interface.  
 The basic idea of the app is that there are multiple oracle
 instances contributing to a single document.  The volume and
 organization of the data (database links, normalization, etc...)
 prevents any sort of fast querying via SQL to do querying of the
 documents.  The solution is to build a lucene index (via solr), and
 use that for searching.  When updates are made in the UI, we will
 also send the updates directly to the solr server as well (we don't
 want to wait some arbitrary interval for a delta query to run).  
 So you can see the problem here is that if the master is down, the
 sending of the updates to the master solr server will fail, thus
 causing an application exception.
 I have tried configuring multiple solr servers which are both setup
 as masters and slaves to each other, but they keep clobber each
 other's index updates and rolling back each other's delta updates. 
 It seems that the replication doesn't take the generation # into
 account and check that the generation it's fetching is  the
 generation it already has before it applies it.
 I thought of maybe introducing a JMS queue to send my updates to and
 having the JMS message listener set to manually acknowledge the
 messages only after a succesfull application of the solrj api calls,
 but that seems kind of contrived, and is only a band-aid.
 Does anyone have any suggestions?
 
 Once you start down the dark path, forever will it
 dominate your destiny.  Consume you it will  - Yoda
 
 
Message sent via Atmail Open - http://atmail.org/



  


Re: High Availability

2010-01-04 Thread rob

Even when Master 1 is alive again, it shouldn't get the floating IP until 
Master 2 actually fails.

So you'd ideally want them replicating to eachother, but since one will only be 
updated/Live at a time, it shouldn't cause an issue with cobbling data (?).

Just a suggestion tho, not done it myself on Solr, only with DB servers.




On Mon 04/01/10 16:28 , Matthew Inger mattin...@yahoo.com wrote:

 So, when the masters switch back, does that mean, we have to force a
 full delta update, correct?
 
 Once you start down the dark path, forever will it
 dominate your destiny.  Consume you it will  - Yoda
 - Original Message 
 From:  
 To: 
 Sent: Mon, January 4, 2010 11:17:40 AM
 Subject: Re: High Availability
 Have you looked into a basic floating IP setup?
 Have the master also replicate to another hot-spare master.
 Any downtime during an outage of the 'live' master would be minimal
 as the hot-spare takes up the floating IP.
 On Mon 04/01/10 16:13 , Matthew Inger  wrote:
  I'm kind of stuck and looking for suggestions for high
 availability
  options.  I've figured out without much trouble how to get the
  master-slave replication working.  This eliminates any single
 points
  of failure in the application in terms of the application's
 searching
  capability.
  I would setup a master which would create the index and several
  slaves to act as the search servers, and put them behind a load
  balancer to distribute the requests.  This would ensure that if a
  slave node goes down, requests would continue to get serviced by
 the
  other nodes that are still up.
  The problem I have is that my particular application also has the
  capability to trigger index updates from the user interface.  This
  means that the master now becomes a single point of failure for
 the
  user interface.  
  The basic idea of the app is that there are multiple oracle
  instances contributing to a single document.  The volume and
  organization of the data (database links, normalization, etc...)
  prevents any sort of fast querying via SQL to do querying of the
  documents.  The solution is to build a lucene index (via solr),
 and
  use that for searching.  When updates are made in the UI, we will
  also send the updates directly to the solr server as well (we
 don't
  want to wait some arbitrary interval for a delta query to run).  
  So you can see the problem here is that if the master is down, the
  sending of the updates to the master solr server will fail, thus
  causing an application exception.
  I have tried configuring multiple solr servers which are both
 setup
  as masters and slaves to each other, but they keep clobber each
  other's index updates and rolling back each other's delta updates.
 
  It seems that the replication doesn't take the generation # into
  account and check that the generation it's fetching is  the
  generation it already has before it applies it.
  I thought of maybe introducing a JMS queue to send my updates to
 and
  having the JMS message listener set to manually acknowledge the
  messages only after a succesfull application of the solrj api
 calls,
  but that seems kind of contrived, and is only a band-aid.
  Does anyone have any suggestions?
  
  Once you start down the dark path, forever will it
  dominate your destiny.  Consume you it will  - Yoda
  
  
 Message sent via Atmail Open - http://atmail.org/
 
 
Message sent via Atmail Open - http://atmail.org/


Re: High Availability

2010-01-04 Thread rob

I'm also not sure what hooks you could put in upon the IP floating to the other 
machine, to start/stop replication - if it IS an issue anyway.




On Mon 04/01/10 16:28 , Matthew Inger mattin...@yahoo.com wrote:

 So, when the masters switch back, does that mean, we have to force a
 full delta update, correct?
 
 Once you start down the dark path, forever will it
 dominate your destiny.  Consume you it will  - Yoda
 - Original Message 
 From:  
 To: 
 Sent: Mon, January 4, 2010 11:17:40 AM
 Subject: Re: High Availability
 Have you looked into a basic floating IP setup?
 Have the master also replicate to another hot-spare master.
 Any downtime during an outage of the 'live' master would be minimal
 as the hot-spare takes up the floating IP.
 On Mon 04/01/10 16:13 , Matthew Inger  wrote:
  I'm kind of stuck and looking for suggestions for high
 availability
  options.  I've figured out without much trouble how to get the
  master-slave replication working.  This eliminates any single
 points
  of failure in the application in terms of the application's
 searching
  capability.
  I would setup a master which would create the index and several
  slaves to act as the search servers, and put them behind a load
  balancer to distribute the requests.  This would ensure that if a
  slave node goes down, requests would continue to get serviced by
 the
  other nodes that are still up.
  The problem I have is that my particular application also has the
  capability to trigger index updates from the user interface.  This
  means that the master now becomes a single point of failure for
 the
  user interface.  
  The basic idea of the app is that there are multiple oracle
  instances contributing to a single document.  The volume and
  organization of the data (database links, normalization, etc...)
  prevents any sort of fast querying via SQL to do querying of the
  documents.  The solution is to build a lucene index (via solr),
 and
  use that for searching.  When updates are made in the UI, we will
  also send the updates directly to the solr server as well (we
 don't
  want to wait some arbitrary interval for a delta query to run).  
  So you can see the problem here is that if the master is down, the
  sending of the updates to the master solr server will fail, thus
  causing an application exception.
  I have tried configuring multiple solr servers which are both
 setup
  as masters and slaves to each other, but they keep clobber each
  other's index updates and rolling back each other's delta updates.
 
  It seems that the replication doesn't take the generation # into
  account and check that the generation it's fetching is  the
  generation it already has before it applies it.
  I thought of maybe introducing a JMS queue to send my updates to
 and
  having the JMS message listener set to manually acknowledge the
  messages only after a succesfull application of the solrj api
 calls,
  but that seems kind of contrived, and is only a band-aid.
  Does anyone have any suggestions?
  
  Once you start down the dark path, forever will it
  dominate your destiny.  Consume you it will  - Yoda
  
  
 Message sent via Atmail Open - http://atmail.org/
 
 
Message sent via Atmail Open - http://atmail.org/


Re: Any way to modify result ranking using an integer field?

2010-01-04 Thread Ahmet Arslan
 Thanks Ahmet.
 
 Do I need to do anything to enable BoostQParserPlugin in
 Solr, or is it already enabled?

I just confirmed that it is already enabled. You can see affect of it by 
appending debugQuery=on to your search url.


  


Re: Implementing Autocomplete/Query Suggest using Solr

2010-01-04 Thread Prasanna R
On Mon, Jan 4, 2010 at 1:20 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Wed, Dec 30, 2009 at 3:07 AM, Prasanna R plistma...@gmail.com wrote:

   I looked into the Solr/Lucene classes and found the required
 information.
  Am summarizing the same for the benefit of those that might refer to this
  thread in the future.
 
   The change I had to make was very simple - make a call to getPrefixQuery
  instead of getWildcardQuery in my custom-modified Solr dismax query
 parser
  class. However, this will make a fairly significant difference in terms
 of
  efficiency. The key difference between the lucene WildcardQuery and
  PrefixQuery lies in their respective term enumerators, specifically in
 the
  term comparators. The termCompare method for PrefixQuery is more
  light-weight than that of WildcardQuery and is essentially an
 optimization
  given that a prefix query is nothing but a specialized case of Wildcard
  query. Also, this is why the lucene query parser automatically creates a
  PrefixQuery for query terms of the form 'foo*' instead of a
 WildcardQuery.
 
 
 I don't understand this. There is nothing that one should need to do in
 Solr's code to make this work. Prefix queries are supported out of the box
 in Solr.

  I  am using the dismax query parser and I match on multiple fields with
different boosts. I run a prefix query on some fields in combination with a
regular field query on other fields. I do not know of any way in which one
could specify a prefix query on a particular field in your dismax query out
of the box in Solr 1.4. I had to update Solr to support additional syntax in
a dismax query that lets you choose to create a prefix query on a particular
field. As part of parsing this custom syntax, I was making a call to the
getWildcardQuery which I simply changed to getPrefixQuery.

Prasanna.


Phrase search issue with XMLPayload? Is it the better solution?

2010-01-04 Thread Shairon

I have a project that involves words extracted by OCR, each page has words,
each word has its geometry to blink a highlight to end user. 
I've been trying represent this document structure by xml
document
   page num=1
term top='111' bottom='222' right='333' left='444'foo/term 
term top='211' bottom='322' right='833' left='944'bar/term 
term top='311' bottom='422' right='733' left='144'baz/term 
term top='411' bottom='522' right='633' left='244'qux/term 
   /page
   page num=2
term  /
   /page
   
/document
Using the field 'fulltext_st' ,

field name=fulltext_st
lt;document gt;
lt;page top='111' bottom='222' right='333' left='444' word='foo'
num='1'gt;foolt;/pagegt;
lt;page top='211' bottom='322' right='833' left='944' word='bar'
num='1'gt;barlt;/pagegt;
lt;page top='311' bottom='422' right='733' left='144' word='baz'
num='1'gt;bazlt;/pagegt;
lt;page top='411' bottom='522' right='633' left='244' word='qux'
num='1'gt;quxlt;/pagegt;
lt;/documentgt;
/field
I can get all terms in my search result with them payloads.
But if I do search using phrase query I can't fetch any result.

Example:


search?q=foo
lst name=fulltext_st
int
name=/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']1/int
/lst


search?q=foo+bar
lst name=fulltext_st
int
name=/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']1/int
int
name=/document/page[word='baz'][num='1'][top='211'][bottom='322'][right='833'][left='944']1/int
/lst

/search?q=foo bar

*nothing*


I was wondering if I could get your thoughts if xmlpayload supports sort of
the things(with phrase search) or is there a good solution to index a doc
with many pages and one rectangle(graphical word geometry) for each term?



thank you in advance

-- 
View this message in context: 
http://old.nabble.com/Phrase-search-issue-with-XMLPayload--Is-it-the-better-solution--tp27018815p27018815.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Any way to modify result ranking using an integer field?

2010-01-04 Thread Andy
Thank you Ahmet.

Is there any way I can configure Solr to always use {!boost b=log(popularity)} 
as the default for all queries?

I'm using Solr through django-haystack, so all the Solr queries are actually 
generated by haystack. It'd be much cleaner if I could configure Solr to always 
use BoostQParserPlugin for all queries instead of manually modifying every 
single query generated by haystack.

--- On Mon, 1/4/10, Ahmet Arslan iori...@yahoo.com wrote:

From: Ahmet Arslan iori...@yahoo.com
Subject: Re: Any way to modify result ranking using an integer field?
To: solr-user@lucene.apache.org
Date: Monday, January 4, 2010, 2:33 PM

 Thanks Ahmet.
 
 Do I need to do anything to enable BoostQParserPlugin in
 Solr, or is it already enabled?

I just confirmed that it is already enabled. You can see affect of it by 
appending debugQuery=on to your search url.


      



  

Re: Improvising solr queries

2010-01-04 Thread Tom Hill
Hi -

Something doesn't make sense to me here:

On Mon, Jan 4, 2010 at 5:55 AM, dipti khullar dipti.khul...@gmail.comwrote:

 - optimize runs on master in every 7 minutes
 - using postOptimize , we execute snapshooter on master
 - snappuller/snapinstaller on 2 slaves runs after every 10 minutes


Why would you optimize every 7 minutes, and update the slaves every ten?
After 70 minutes you'll be doing both at the same time.

How about optimizing every ten minutes, at :00,:10, :20, :30, :40, :50 and
then pulling every ten minutes at :01, :11, :21, :31, :41, :51 (assuming
your optimize completes in one minute).

Or did I misunderstand something?


 The issue gets resolved as soon as we optimize the slave index. In the solr
 admin, it shows only 4 requests/sec is handled with 400 ms response time.


From your earlier description, it seems like you should only be distributing
an optimized index, so optimizing the slave should be a no-op. Check to see
what files you have on the slave after snappulling.

Tom


Non-leading wildcard search

2010-01-04 Thread Peter S

Hello,
There are lots of questions and answers in the forum regarding varying wildcard 
behaviour, but I haven't been able to find any
that address this particular behaviour. Perhaps someone could help?
Problem:
I have a fieldType that only goes through a KeywordTokenizer at index time, to 
ensure it stays 'verbatim' (e.g. it doesn't get split into any tokens - ws or 
otherwise).
Let's say there's some data stored in this field like this:


Something
Something Else
Something Else Altogether


When I query:  Something or Something Else or *thing  or *omething*, I 
get back the expected results.
If, however, I query: Some* or S* or s* etc, I get no results (although 
this type of non-leading wildcard works fine with other fieldType schema 
elements that don't use KeywordTokenizer).
Is this something to do with KeywordTokenizer?
Is there a better way to index data (preserving case) and not splitting on ws 
or stemming etc. (i.e. no WhitespaceTokenizer or similar)?
My fieldType schema looks like this: (I've tried a number of other combinations 
as well including using class=solr.TextField)
fieldType name=text_verbatim class=solr.StrField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
  /analyzer
/fieldType
 
field name=appname  type=text_verbatim indexed=true stored=true/

I understand that wildcard queries don't go through analyzers, but why is it 
that 'tokenized' data matches on non-leading wildcard queries, whereas 
non-tokenized (or more specifically Keyword-Tokenized) doesn't?
The fieldType schema requires some tokenizer class, and it appears that 
KeywordTokenizer is the only one that tokenizes to a token size of 1 (i.e. the 
whole string).
I'm sure I'm missing something that is probably reasonbly obvious, but having 
tried myriad combinations, I thought it prudent to ask the experts in the forum.
 
Many thanks for any insight you can provide on this.
 
Peter
 

  
_
Use Hotmail to send and receive mail from your different email accounts
http://clk.atdmt.com/UKM/go/186394592/direct/01/

Re: Non-leading wildcard search

2010-01-04 Thread Yonik Seeley
On Mon, Jan 4, 2010 at 5:38 PM, Peter S pete...@hotmail.com wrote:
 When I query:  Something or Something Else or *thing  or *omething*, 
 I get back the expected results.
 If, however, I query: Some* or S* or s* etc, I get no results (although 
 this type of non-leading wildcard works fine with other fieldType schema 
 elements that don't use KeywordTokenizer).

Is your query string actually in quotes?  Wildcards aren't currently
supported in quotes.
So text_verbatim:Some* should work.

-Yonik
http://www.lucidimagination.com


RE: Non-leading wildcard search

2010-01-04 Thread Peter S

Hi Yonik,

 

Thanks for your quick reply.

No, the queries themselves aren't in quotes.

 

Since I sent the initial email, I have managed to get non-leading wildcard 
queries to work with this, but by unexpected means (for me at least :-).

 

If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) 
work as expected.

 

So the fieldType schema element now looks like:

fieldType name=text_verbatim class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
  /analyzer
/fieldType

 

I wasn't expecting this, as I would have thought this would change only the 
case behaviour, not the wildcard behaviour (or at least not just the 
non-leading wildcard behaviour). Perhaps I'm just not understanding how the 
terms (term in this case as not tokenized) is indexed and subsequently matched.

 

What I've noticed is that with the LowerCaseFilterFactory in place, document 
queries return results with case intact, but facet queries show the results in 
lower-case

(e.g. document-appname=Something  facet.field.appname=something). (I kind of 
expected the document-appname field to be lower case as well)

 

Does this sound like correct behaviour to you?

If it's correct, that's ok, I'll manage to work 'round it (maybe there's a way 
to map the facet field back to the document field?), but if it sounds wrong, 
perhaps it warrants further investigation.

 

Many thanks,

Peter

 


 
 Date: Mon, 4 Jan 2010 17:42:30 -0500
 Subject: Re: Non-leading wildcard search
 From: yo...@lucidimagination.com
 To: solr-user@lucene.apache.org
 
 On Mon, Jan 4, 2010 at 5:38 PM, Peter S pete...@hotmail.com wrote:
  When I query:  Something or Something Else or *thing  or 
  *omething*, I get back the expected results.
  If, however, I query: Some* or S* or s* etc, I get no results 
  (although this type of non-leading wildcard works fine with other fieldType 
  schema elements that don't use KeywordTokenizer).
 
 Is your query string actually in quotes? Wildcards aren't currently
 supported in quotes.
 So text_verbatim:Some* should work.
 
 -Yonik
 http://www.lucidimagination.com
  
_
View your other email accounts from your Hotmail inbox. Add them now.
http://clk.atdmt.com/UKM/go/186394592/direct/01/

RE: Non-leading wildcard search

2010-01-04 Thread Peter S

FYI:

 

I have found the root of this behaviour. It has to do with a test patch I've 
been working on for working 'round pre SOLR-219 (case insensitive wildcard 
searching).

With the test patch switched out, it works as expected. Although the case 
insensitive wildcard search reverts to pre-SOLR-219 behaviour.

 

I believe I can work 'round this by using a copyField that holds the lower-case 
text for wildcarding.

 

Many thanks, Yonik for your help.

 

Peter

 


 
 From: pete...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: RE: Non-leading wildcard search
 Date: Mon, 4 Jan 2010 23:29:04 +
 
 
 Hi Yonik,
 
 
 
 Thanks for your quick reply.
 
 No, the queries themselves aren't in quotes.
 
 
 
 Since I sent the initial email, I have managed to get non-leading wildcard 
 queries to work with this, but by unexpected means (for me at least :-).
 
 
 
 If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) 
 work as expected.
 
 
 
 So the fieldType schema element now looks like:
 
 fieldType name=text_verbatim class=solr.TextField 
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
 /analyzer
 /fieldType
 
 
 
 I wasn't expecting this, as I would have thought this would change only the 
 case behaviour, not the wildcard behaviour (or at least not just the 
 non-leading wildcard behaviour). Perhaps I'm just not understanding how the 
 terms (term in this case as not tokenized) is indexed and subsequently 
 matched.
 
 
 
 What I've noticed is that with the LowerCaseFilterFactory in place, document 
 queries return results with case intact, but facet queries show the results 
 in lower-case
 
 (e.g. document-appname=Something facet.field.appname=something). (I kind of 
 expected the document-appname field to be lower case as well)
 
 
 
 Does this sound like correct behaviour to you?
 
 If it's correct, that's ok, I'll manage to work 'round it (maybe there's a 
 way to map the facet field back to the document field?), but if it sounds 
 wrong, perhaps it warrants further investigation.
 
 
 
 Many thanks,
 
 Peter
 
 
 
 
 
  Date: Mon, 4 Jan 2010 17:42:30 -0500
  Subject: Re: Non-leading wildcard search
  From: yo...@lucidimagination.com
  To: solr-user@lucene.apache.org
  
  On Mon, Jan 4, 2010 at 5:38 PM, Peter S pete...@hotmail.com wrote:
   When I query: Something or Something Else or *thing or 
   *omething*, I get back the expected results.
   If, however, I query: Some* or S* or s* etc, I get no results 
   (although this type of non-leading wildcard works fine with other 
   fieldType schema elements that don't use KeywordTokenizer).
  
  Is your query string actually in quotes? Wildcards aren't currently
  supported in quotes.
  So text_verbatim:Some* should work.
  
  -Yonik
  http://www.lucidimagination.com
 
 _
 View your other email accounts from your Hotmail inbox. Add them now.
 http://clk.atdmt.com/UKM/go/186394592/direct/01/
  
_
Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
http://clk.atdmt.com/UKM/go/186394592/direct/01/

Re: Indexing the latests MS Office documents

2010-01-04 Thread Peter Wolanin
You must have been searching old documentation - I think tika 0,3+ has
support for the new MS formats.  but don't take my word for it - why
don't you build tika and try it?

-Peter

On Sun, Jan 3, 2010 at 7:00 PM, Roland Villemoes r...@alpha-solutions.dk 
wrote:
 Hi All,

 Anyone who knows how to index the latest MS office documents like .docx and 
 .xlsx  ?

 From searching it seems like Tika only supports the earlier formats .doc and 
 .xls



 med venlig hilsen/best regards

 Roland Villemoes
 Tel: (+45) 22 69 59 62
 E-Mail: mailto:r...@alpha-solutions.dk





-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: Improvising solr queries

2010-01-04 Thread Ian Holsman

On 1/5/10 12:46 AM, Shalin Shekhar Mangar wrote:

sitename:XYZ OR sitename:All Sites) AND (localeid:1237400589415) AND
  ((assettype:Gallery))  AND (rbcategory:ABC XYZ ) AND (startdate:[* TO
  2009-12-07T23:59:00Z] AND enddate:[2009-12-07T00:00:00Z TO
  *])rows=9start=63sort=date
  descfacet=truefacet.field=assettypefacet.mincount=1

  Similar to this query we have several much complex queries supporting all
  major landing pages of our application.

  Just want to confirm that whether anyone can identify any major flaws or
  issues in the sample query?


 
I'm not the expert Shalin is, but I seem to remember sorting by date was 
pretty rough on CPU. (this could have been resolved since I last looked 
at it)


the other thing I'd question is the facet. it looks like your only 
retrieving a single assetType  (Gallery).
so you will only get a single field back. if thats the case, wouldn't 
the rows returned (which is part of the response)

give you the same answer ?


Most of those AND conditions can be separate filter queries. Filter queries
can be cached separately and can therefore be re-used. See
http://wiki.apache.org/solr/FilterQueryGuidance

   




Listing Terms by Ascending IDF value . . ?

2010-01-04 Thread Christopher Ball
Hello,

 

I am trying to get a list of highly unusual terms or phrases (for example a
TF of 1 or 2) within an entire index (essentially this would be the inverse
of how Luke gives 'top terms' on the 'Overview' tab).

 

I see how I can do this within a specific query using the Term Vector
Component (qt=tvrh).

 

But do I have to write my own analyzer to get a list for the complete index
in ascending order?

 

Most grateful for any thoughts or insights,

 

Christopher 

 



Re: Rules engine and Solr

2010-01-04 Thread Avlesh Singh
 Thanks for the response, Shalin. I am still in two minds over doing it
inside Solr versus outside.
I'll get back with more questions, if any.

Cheers
Avlesh

On Mon, Jan 4, 2010 at 5:11 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Mon, Jan 4, 2010 at 10:24 AM, Avlesh Singh avl...@gmail.com wrote:

  I have a Solr (version 1.3) powered search server running in production.
  Search is keyword driven is supported using custom fields and tokenizers.
 
  I am planning to build a rules engine on top search. The rules are
 database
  driven and can't be stored inside solr indexes. These rules would
  ultimately
  two do things -
 
1. Change the order of Lucene hits.
 

 A Lucene FieldComparator is what you'd need. The QueryElevationComponent
 uses this technique.


2. Add/remove some results to/from the Lucene hits.
 
 
 This is a bit more tricky. If you will always have a very limited number of
 docs to add or remove, it may be best to change the query itself to include
 or exclude them (i.e. add fq). Otherwise you'd need to write a custom
 Collector (see DocSetCollector) and change SolrIndexSearcher to use it. We
 are planning to modify SolrIndexSearcher to allow custom collectors soon
 for
 field collapsing but for now you will have to modify it.


  What should be my starting point? Custom search handler?
 
 
 A custom SearchComponent which extends/overrides QueryComponent will do the
 job.

 --
 Regards,
 Shalin Shekhar Mangar.



Re: Improvising solr queries

2010-01-04 Thread Shalin Shekhar Mangar
On Tue, Jan 5, 2010 at 11:16 AM, dipti khullar dipti.khul...@gmail.comwrote:


 This assettype is variable. It can have around 6 values at a time.
 But this is true that we apply facet mostly on just one field - assettype.


Ian has a good point. You are faceting on assettype and you are also
filtering on it so you will get only one facet value Gallery with a count
equal to numFound.


 Any idea if the use of date range queries is expensive? Also if Shalin can
 put in some comments on
 sorting by date was pretty rough on CPU, I can start analyzing sort by
 date specific queries.


This is a range search and not a sort. I don't know if range search on dates
is especially costly compared to a range search on any other type. But I do
know that trie fields in Solr 1.4 are much faster for range searches at the
cost of more tokens in the index.

With a date field, instead of using NOW, you should always try to round it
down to the coarsest interval you can use. So if it is possible to use
NOW/DAY instead of NOW, you should do that. The problem with querying on NOW
is that it is always unique and therefore the query can never be cached
(actually, it is cached but can never be hit). If you use NOW/DAY, the query
can be cached for a day.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Listing Terms by Ascending IDF value . . ?

2010-01-04 Thread Shalin Shekhar Mangar
On Tue, Jan 5, 2010 at 9:15 AM, Christopher Ball 
christopher.b...@metaheuristica.com wrote:

 Hello,

 I am trying to get a list of highly unusual terms or phrases (for example a
 TF of 1 or 2) within an entire index (essentially this would be the inverse
 of how Luke gives 'top terms' on the 'Overview' tab).

 I see how I can do this within a specific query using the Term Vector
 Component (qt=tvrh).


Did you mean TermsComponent (qt=terms)?


 But do I have to write my own analyzer to get a list for the complete index
 in ascending order?


No, you don't need a custom analyzer. But TermsComponent can only sort by
frequency in descending order or by index order (lexicographical order).

Perhaps the patch in SOLR-1672 is more suitable for your task.

-- 
Regards,
Shalin Shekhar Mangar.