Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Rob Brown
Apologies if things were a little vague.

Given the example snippet to index (numbered to show searches needed to
match)...

1: i am a sales-manager in here
2: using asp.net and .net daily
3: working in design.
4: using something called sage 200. and i'm fluent
5: german sausages.
6: busy AE dept earning £10,000 annually


... all with newlines in place.

able to match...

1. sales
1. sales manager
1. sales-manager
1. sales-manager
2. .net
2. asp.net
3. design
4. sage 200
6. AE
6. £10,000

But do NOT match fluent german from 4 + 5 since there's a newline
between them when indexed, but not when searched.


Do the filters (wdf in this case) not create multiple tokens, so if
splitting on period in asp.net would create tokens for all of asp,
asp., asp.net, .net, net.


Cheers,
Rob

-- 

IntelCompute
Web Design and Online Marketing

http://www.intelcompute.com


-Original Message-
From: Chris Hostetter hossman_luc...@fucit.org
Reply-to: solr-user@lucene.apache.org
To: solr-user@lucene.apache.org
Subject: Re: Which Tokeniser (and/or filter)
Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

: This all seems a bit too much work for such a real-world scenario?

You haven't really told us what your scenerio is.

You said you want to split tokens on whitespace, full-stop (aka: 
period) and comma only, but then in response to some suggestions you added 
comments other things that you never mentioned previously...

1) evidently you don't want the . in foo.net to cause a split in tokens?
2) evidently you not only want token splits on newlines, but also 
positition gaps to prevent phrases matching across newlines.

...these are kind of important details that affect suggestions people 
might give you.

can you please provide some concrete examples of hte types of data you 
have, the types of queries you want them to match, and the types of 
queries you *don't* want to match?


-Hoss



Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
This is true with Lucene as it stands.  It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.

On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote:

 Experience has shown that it is much faster to run Solr with a small
 amount of memory and let the rest of the ram be used by the operating
 system disk cache. That is, the OS is very good at keeping the right
 disk blocks in memory, much better than Solr.

 How much RAM is in the server and how much RAM does the JVM get? How
 big are the documents, and how large is the term index for your
 searches? How many documents do you get with each search? And, do you
 use filter queries- these are very powerful at limiting searches.

 2012/2/7 James ljatreey...@163.com:
  Is there any practice to load index into RAM to accelerate solr
 performance?
  The over all documents is about 100 million. The search time around
 100ms. I am seeking some method to accelerate the respond time for solr.
  Just check that there is some practice use SSD disk. And SSD is also
 cost much, just want to know is there some method like to load the index
 file in RAM and keep the RAM index and disk index synchronized. Then I can
 search on the RAM index.



 --
 Lance Norskog
 goks...@gmail.com



Re:Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread James
But the solr did not have the im-memory index, I am right?





At 2012-02-08 16:17:49,Ted Dunning ted.dunn...@gmail.com wrote:
This is true with Lucene as it stands.  It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.

On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote:

 Experience has shown that it is much faster to run Solr with a small
 amount of memory and let the rest of the ram be used by the operating
 system disk cache. That is, the OS is very good at keeping the right
 disk blocks in memory, much better than Solr.

 How much RAM is in the server and how much RAM does the JVM get? How
 big are the documents, and how large is the term index for your
 searches? How many documents do you get with each search? And, do you
 use filter queries- these are very powerful at limiting searches.

 2012/2/7 James ljatreey...@163.com:
  Is there any practice to load index into RAM to accelerate solr
 performance?
  The over all documents is about 100 million. The search time around
 100ms. I am seeking some method to accelerate the respond time for solr.
  Just check that there is some practice use SSD disk. And SSD is also
 cost much, just want to know is there some method like to load the index
 file in RAM and keep the RAM index and disk index synchronized. Then I can
 search on the RAM index.



 --
 Lance Norskog
 goks...@gmail.com



Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Patrick Plaatje
A start maybe to use a RAM disk for that. Mount is as a normal disk and
have the index files stored there. Have a read here:

http://en.wikipedia.org/wiki/RAM_disk

Cheers,

Patrick


2012/2/8 Ted Dunning ted.dunn...@gmail.com

 This is true with Lucene as it stands.  It would be much faster if there
 were a specialized in-memory index such as is typically used with high
 performance search engines.

 On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote:

  Experience has shown that it is much faster to run Solr with a small
  amount of memory and let the rest of the ram be used by the operating
  system disk cache. That is, the OS is very good at keeping the right
  disk blocks in memory, much better than Solr.
 
  How much RAM is in the server and how much RAM does the JVM get? How
  big are the documents, and how large is the term index for your
  searches? How many documents do you get with each search? And, do you
  use filter queries- these are very powerful at limiting searches.
 
  2012/2/7 James ljatreey...@163.com:
   Is there any practice to load index into RAM to accelerate solr
  performance?
   The over all documents is about 100 million. The search time around
  100ms. I am seeking some method to accelerate the respond time for solr.
   Just check that there is some practice use SSD disk. And SSD is also
  cost much, just want to know is there some method like to load the index
  file in RAM and keep the RAM index and disk index synchronized. Then I
 can
  search on the RAM index.
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 




-- 
Patrick Plaatje
Senior Consultant
http://www.nmobile.nl/


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Dmitry Kan
Hi,

This talk has some interesting details on setting up an Lucene index in RAM:

http://www.lucidimagination.com/devzone/events/conferences/revolution/2011/lucene-yelp


Would be great to hear your findings!

Dmitry

2012/2/8 James ljatreey...@163.com

 Is there any practice to load index into RAM to accelerate solr
 performance?
 The over all documents is about 100 million. The search time around 100ms.
 I am seeking some method to accelerate the respond time for solr.
 Just check that there is some practice use SSD disk. And SSD is also cost
 much, just want to know is there some method like to load the index file in
 RAM and keep the RAM index and disk index synchronized. Then I can search
 on the RAM index.



Query in starting solr 3.5

2012-02-08 Thread mechravi25
Hi,

I am using solr 3.5 version. I moved the data import handler files from solr
1.4(which I used previously) to the new solr. When I tried to start the solr
3.5, I got the following message in my log

WARNING: XML parse warning in solrres:/dataimport.xml, line 2, column 95:
Include operation failed, reverting to fallback. Resource error reading file
as XML (href='solr/conf/solrconfig_master.xml'). Reason: Can't find resource
'solr/conf/solrconfig_master.xml' in classpath or
'/solr/apache-solr-3.5.0/example/multicore/core1/conf/',
cwd=/solr/apache-solr-3.5.0/example

The partial content of dataimport file that I used in solr1.4 is as follows

xi:include href=solr/conf/solrconfig_master.xml
xmlns:xi=http://www.w3.org/2001/XInclude;
xi:fallback
xi:include
href=/solr/apache-solr-3.5.0/example/multicore/IncludeFile/File1.xml/
xi:include
href=/solr/apache-solr-3.5.0/example/multicore/IncludeFile/File2.xml/
xi:include
href=/solr/apache-solr3.5.0/example/multicore/IncludeFile/File3/
/xi:fallback
/xi:include



The 3 files given in Fallback tag are present in the location. Does solr 3.5
support fallback? Can someone please suggest a solution?



Also, I got the following warnings in my log while starting solr 3.5

WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24
emulation. You should at some point declare and reindex to at least 3.0,
because 2.4 emulation is deprecated and will be removed in 4.0. This
parameter will be mandatory in 4.0.

The solution i got after googling is to apply a patch. Is there any other
option other than applying this patch to overcome the warnings? Which is the
best option.


Kindly help me out.

Thanks in advance.








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-in-starting-solr-3-5-tp3725372p3725372.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Andrzej Bialecki

On 08/02/2012 09:17, Ted Dunning wrote:

This is true with Lucene as it stands.  It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.


This could be implemented in Lucene trunk as a Codec. The challenge 
though is to come up with the right data structures.


There has been some interesting research on optimizations for in-memory 
inverted indexes, but it usually involves changing the query evaluation 
algos as well - for reference:


http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/1202502
http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf
http://research.google.com/pubs/archive/37365.pdf

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Matthias Käppler
Hi Erick,

if we're not doing geo searches, we filter by location tags that we
attach to places. This is simply a hierachical regional id, which is
simple to filter for, but much less flexible. We use that on Web a
lot, but not on mobile, where we want to performance searches in
arbitrary radii around arbitrary positions. For those location tag
kind of queries, the average time spent in SOLR is 43msec (I'm looking
at the New Relic snapshot of the last 12 hours). I have disabled our
optimization again just yesterday, so for the bbox queries we're now
at an avg of 220ms (same time window). That's a 5 fold increase in
response time, and in peak hours it's worse than that.

I've also found a blog post from 3 years ago which outlines the inner
workings of the SOLR spatial indexing and searching:
http://www.searchworkings.org/blog/-/blogs/23842
From that it seems as if SOLR already performs a similar optimization
we had in mind during the index step, so if I understand correctly, it
doesn't even search over all records, only those that were mapped to
the grid box identified during indexing.

What I would love to see is what the suggested way is to perform a geo
query on SOLR, considering that they're so difficult to cache and
expensive to run. Is the best approach to restrict the candidate set
as much as possible using cheap filter queries, so that SOLR merely
has to do the geo search against these subsets? How does the query
planner work here? I see there's a cost attached to a filter query,
but one can only set it when cache is set to false? Are cached geo
queries executed last when there are cheaper filter queries to cut
down on documents? If you have a real world practical setup to share,
one that performs well in a production environment that serves
requests in the Millions per day, that would be great.

I'd love to contribute documentation by the way, if you knew me you'd
know I'm an avid open source contributor and actually run several open
source projects myself. But tell me, how can I possibly contribute
answer to questions I don't have an answer to? That's why I'm here,
remember :) So please, these kinds of snippy replies are not helping
anyone.

Thanks
-Matthias

On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson erickerick...@gmail.com wrote:
 So the obvious question is what is your
 performance like without the distance filters?

 Without that knowledge, we have no clue whether
 the modifications you've made had any hope of
 speeding up your response times

 As for the docs, any improvements you'd like to
 contribute would be happily received

 Best
 Erick

 2012/2/6 Matthias Käppler matth...@qype.com:
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirable evictions.

 Now our idea was this: while the actual geo queries are poorly
 cacheable, we could clearly identify geographical regions which are
 more often queried than others (naturally, since we're a user driven
 service). Therefore, we dynamically partition Earth into a static grid
 of overlapping boxes, where the grid size (the distance of the nodes)
 depends on the maximum allowed search radius. That way, for every user
 query, we would always be able to identify a single bounding box that
 covers it. This larger bounding box (200km edge length) we would send
 to SOLR as a cached filter query, along with the actual user query
 which would still be sent uncached. Ex:

 User asks for places in 10km around 49.14839,8.5691, then what we will
 send to SOLR is something like this:

 fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691}
 fq={!bbox cache=true d=100.0 sfield=location_ll
 pt=49.4684836290799,8.31165802979391} -- this one we derive
 automatically

 That way SOLR would intersect the two filters and return the same
 results as when only looking at the smaller bounding box, but keep the
 larger box in cache and speed up subsequent geo queries in the same
 regions. Or so we thought; unfortunately this approach did not help
 query execution times get better, at all.

 

Re: URI Encoding with Solr and Weblogic

2012-02-08 Thread Elisabeth Adler

Hi,

I found a solution to it.
Adding the Weblogic Server Argument -Dfile.encoding=UTF-8 did not affect 
the encoding.


Only a change to the .war file's weblogic.xml and redeployment of the 
modified .war solved it.

I added the following to the weblogic.xml:

charset-params
  input-charset
resource-path*/resource-path
java-charset-nameUTF-8/java-charset-name
  /input-charset
/charset-params

Would it make sense to include this in the shipped weblogic.xml file?

Best,
Elisabeth

On 07.02.2012 23:12, Elisabeth Adler wrote:

Hi,

I try to get Solr 3.3.0 to process Arabic search requests using its
admin interface. I have successfully managed to set it up on Tomcat
using the URIEncoding attribute but fail miserably on WebLogic 10.

Invoking the URL http://localhost:7012/solr/select/?q=تهنئة returns the
XML below:
response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
lst name=params
str name=qتهنئة/str
/lst
/lst
result name=response numFound=0 start=0/
/response

The search term is just gibberish. Running the query through Luke or
Tomcat returns the expected result and renders the search term correctly.

I have tried to change the URI encoding and JVM default encoding by
setting the following start up arguments in WebLogic:
-Dfile.encoding=UTF-8 -Dweblogic.http.URIDecodeEncoding=UTF-8. I can see
them being set through Solr's admin interface. They don't have any
impact though.

I am running out of ideas on how to get this working. Any thoughts and
pointers are much appreciated.

Thanks,
Elisabeth



How to reindex about 10Mio. docs

2012-02-08 Thread Vadim Kisselmann
Hello folks,

i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to another
Solr(1.4.1).
I changed my schema.xml (field types sing to slong), standard
replication would fail.
what is the fastest and smartest way to manage this?
this here sound great (EntityProcessor):
http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
But would it work with Solr 1.4.1?

Best Regards
Vadim


Re: How to reindex about 10Mio. docs

2012-02-08 Thread Ahmet Arslan
 i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
 another
 Solr(1.4.1).
 I changed my schema.xml (field types sing to slong),
 standard
 replication would fail.
 what is the fastest and smartest way to manage this?
 this here sound great (EntityProcessor):
 http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
 But would it work with Solr 1.4.1?

SolrEntityProcessor is not available in 1.4.1. I would dump stored fields into 
comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to feed 
into new solr instance.


Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Robert Stewart
I concur with this.  As long as index segment files are cached in OS file cache 
performance is as about good as it gets.  Pulling segment files into RAM inside 
JVM process may actually be slower, given Lucene's existing data structures and 
algorithms for reading segment file data.   If you have very large index (much 
bigger than available RAM) then it will only be slow when accessing disk for 
uncached segment files.  In that case you might consider sharding index across 
more than one server and using distributed searching (possibly SOLR cloud, 
etc.).

How large is your index in GB?  You can also try making index files smaller by 
removing indexed/stored fields you dont need, compressing large stored fields, 
etc.  Also maybe turn off storing norms, term frequencies, positions, vectors 
and stuff if you dont need them.

On Feb 8, 2012, at 3:17 AM, Ted Dunning wrote:

 This is true with Lucene as it stands.  It would be much faster if there
 were a specialized in-memory index such as is typically used with high
 performance search engines.
 
 On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog goks...@gmail.com wrote:
 
 Experience has shown that it is much faster to run Solr with a small
 amount of memory and let the rest of the ram be used by the operating
 system disk cache. That is, the OS is very good at keeping the right
 disk blocks in memory, much better than Solr.
 
 How much RAM is in the server and how much RAM does the JVM get? How
 big are the documents, and how large is the term index for your
 searches? How many documents do you get with each search? And, do you
 use filter queries- these are very powerful at limiting searches.
 
 2012/2/7 James ljatreey...@163.com:
 Is there any practice to load index into RAM to accelerate solr
 performance?
 The over all documents is about 100 million. The search time around
 100ms. I am seeking some method to accelerate the respond time for solr.
 Just check that there is some practice use SSD disk. And SSD is also
 cost much, just want to know is there some method like to load the index
 file in RAM and keep the RAM index and disk index synchronized. Then I can
 search on the RAM index.
 
 
 
 --
 Lance Norskog
 goks...@gmail.com
 



Re: How to reindex about 10Mio. docs

2012-02-08 Thread Vadim Kisselmann
Hi Ahmet,
thanks for quick response:)
I've already thought the same...
And it will be a pain to export and import this huge doc-set as CSV.
Do i have an another solution?
Regards
Vadim


2012/2/8 Ahmet Arslan iori...@yahoo.com:
 i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
 another
 Solr(1.4.1).
 I changed my schema.xml (field types sing to slong),
 standard
 replication would fail.
 what is the fastest and smartest way to manage this?
 this here sound great (EntityProcessor):
 http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
 But would it work with Solr 1.4.1?

 SolrEntityProcessor is not available in 1.4.1. I would dump stored fields 
 into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to 
 feed into new solr instance.


usage of /etc/jetty.xml when debugging Solr in Eclipse

2012-02-08 Thread jmlucjav
Hi,

I am following
http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
in order to be able to debug Solr in eclipse. I got it working fine.

Now, I usually use ./etc/jetty.xml to set logging configuration. When
starting jetty in eclipse I dont see any log files created, so I guessed
jetty.xml is not being used. So I added it to RunJetty Advanced
configuration (Additional jetty.xml), but in that case something goes wrong,
as I get a 'java.net.BindException: Address already in use: JVM_Bind' error,
like if something is started twice.

So my question is: can jetty.xml be used while debugging in eclipse? If so,
how? I would like to use the same configuration I use when I am just
changing xml stuff in Solr and starting with 'java -jar start.jar'.

thank in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/usage-of-etc-jetty-xml-when-debugging-Solr-in-Eclipse-tp3725588p3725588.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to reindex about 10Mio. docs

2012-02-08 Thread Vadim Kisselmann
Another problem appeared ;)
how can i export my docs in csv-format?
In Solr 3.1+ i can use the query-param wt=csv, but in Solr 1.4.1?
Best Regards
Vadim


2012/2/8 Vadim Kisselmann v.kisselm...@googlemail.com:
 Hi Ahmet,
 thanks for quick response:)
 I've already thought the same...
 And it will be a pain to export and import this huge doc-set as CSV.
 Do i have an another solution?
 Regards
 Vadim


 2012/2/8 Ahmet Arslan iori...@yahoo.com:
 i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
 another
 Solr(1.4.1).
 I changed my schema.xml (field types sing to slong),
 standard
 replication would fail.
 what is the fastest and smartest way to manage this?
 this here sound great (EntityProcessor):
 http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
 But would it work with Solr 1.4.1?

 SolrEntityProcessor is not available in 1.4.1. I would dump stored fields 
 into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to 
 feed into new solr instance.


Custom Document Clustering and Mahout Integration

2012-02-08 Thread Selvam
Hi all,

I am trying to write a custom document clustering component that should
take all the docs in commit and cluster them; Solr Version:3.5.0

Main Class:
public class KMeansClusteringEngine extends DocumentClusteringEngine
implements SolrEventListener

I added newSearcher event listener, that works as expected. But, when is
the document clustering called ?, I have two functions of
DocumentClusteringEngine in my custom code, but when do they get called ?,
wiki page says to add clustering.collection=true, but I am not sure as my
guess is document clustering noway related to search.

  public NamedList cluster(SolrParams params)
  public NamedList cluster(DocSet docSet, SolrParams solrParams)


Note:
Actually I am trying to integrate Solr 3.5 with Mahout 0.5 for incremental
clustering (i.e mapping new docs to existing cluster to avoid complete
re-clustering ) basing my work from this github code,
https://github.com/gsingers/ApacheCon2010/blob/master/src/main/java/com/grantingersoll/intell/clustering/KMeansClusteringEngine.java
.

I would love to get some support from you.

-- 
Regards,
S.Selvam
http://knackforge.com


Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots

2012-02-08 Thread Erick Erickson
Hmmm, seems OK. Did you re-index after any
schema changes?

You'll learn to love admin/analysis for questions like this,
that page should show you what the actual tokenization
results are, make sure to click the verbose check boxes.

Best
Erick

On Tue, Feb 7, 2012 at 10:52 PM, geeky2 gee...@hotmail.com wrote:
 hello all,

 i am struggling with getting solr.WordDelimiterFilterFactory to behave as is
 indicated in the solr book (Smiley) on page 54.

 the example in the books reads like this:


 Here is an example exercising all options:
 WiFi-802.11b to Wi, Fi, WiFi, 802, 11, 80211, b, WiFi80211b
 

 essentially - i have the same requirement with embedded periods and need to
 return a successful search on a field, even if the user does NOT enter the
 period.

 i have a field, itemNo that can contain periods ..

 example content in the itemNo field:

 B12.0123

 when the user searches on this field, they need to be able to enter an
 itemNo without the period, and still find the item.

 example:

 user enters: B120123 and a document is returned with B12.0123.


 unfortunately, the search will NOT return the appropriate document, if the
 user enters B120123.

 however - the search does work if the user enters B12 0123 (a space in place
 of the period).

 can someone help me understand what is missing from my configuration?


 this is snipped from my schema.xml file


  fields
     ...
    field name=itemNo type=text indexed=true stored=true/
     ...
  /fields




    fieldType name=text class=solr.TextField
 positionIncrementGap=100
      analyzer type=index
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
        *filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1/*
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
        filter class=solr.PorterStemFilterFactory/
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer
      analyzer type=query
        tokenizer class=solr.WhitespaceTokenizerFactory/
        filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
        filter class=solr.PorterStemFilterFactory/
        filter class=solr.RemoveDuplicatesTokenFilterFactory/
      /analyzer
    /fieldType




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3724822.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Erick Erickson
Yes, WDDF creates multiple tokens. But that has
nothing to do with the multiValued suggestion.

You can get exactly what you want by
1 setting multiValued=true in your schema file and re-indexing. Say
positionIncrementGap is set to 100
2 When you index, add the field for each sentence, so your doc
  looks something like:
 doc
field name=sentencesi am a sales-manager in here/field
   field name=sentencesusing asp.net and .net daily/field
 .
  /doc
3 search like sales manager~100

Best
Erick

On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote:
 Apologies if things were a little vague.

 Given the example snippet to index (numbered to show searches needed to
 match)...

 1: i am a sales-manager in here
 2: using asp.net and .net daily
 3: working in design.
 4: using something called sage 200. and i'm fluent
 5: german sausages.
 6: busy AE dept earning £10,000 annually


 ... all with newlines in place.

 able to match...

 1. sales
 1. sales manager
 1. sales-manager
 1. sales-manager
 2. .net
 2. asp.net
 3. design
 4. sage 200
 6. AE
 6. £10,000

 But do NOT match fluent german from 4 + 5 since there's a newline
 between them when indexed, but not when searched.


 Do the filters (wdf in this case) not create multiple tokens, so if
 splitting on period in asp.net would create tokens for all of asp,
 asp., asp.net, .net, net.


 Cheers,
 Rob

 --

 IntelCompute
 Web Design and Online Marketing

 http://www.intelcompute.com


 -Original Message-
 From: Chris Hostetter hossman_luc...@fucit.org
 Reply-to: solr-user@lucene.apache.org
 To: solr-user@lucene.apache.org
 Subject: Re: Which Tokeniser (and/or filter)
 Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

 : This all seems a bit too much work for such a real-world scenario?

 You haven't really told us what your scenerio is.

 You said you want to split tokens on whitespace, full-stop (aka:
 period) and comma only, but then in response to some suggestions you added
 comments other things that you never mentioned previously...

 1) evidently you don't want the . in foo.net to cause a split in tokens?
 2) evidently you not only want token splits on newlines, but also
 positition gaps to prevent phrases matching across newlines.

 ...these are kind of important details that affect suggestions people
 might give you.

 can you please provide some concrete examples of hte types of data you
 have, the types of queries you want them to match, and the types of
 queries you *don't* want to match?


 -Hoss



Re: Fields not indexed?

2012-02-08 Thread Dmitry Kan
How does your schema for the fields look like?

On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote:

 Hi,

 I am really new to Solr so I apologize if the question is a little off.
 I was playing with DataImportHandler and tried to index a table in a MS SQL
 database.
 I configured my datasource with the necessary parameters and added three
 fields with column(uppercase) and name:

field column=ID name=machineId /
field column=SERIAL name=machineSerial/
field column=IVK name=machineIvk/

 The full-import command seems to have completed successfully and I see that
 the number of documents processed is the same as the number of entries in
 my table.
 However when I try to run a *:* query from the admin console I only get
 responses in the form:

   doc
  float name=score1.0/float
  str name=id1/str
   /doc

 I'm not sure how to get to the bottom of this.
 Thanks.




-- 
Regards,

Dmitry Kan


Re: Fields not indexed?

2012-02-08 Thread Radu Toev
The schema.xml is the default file that comes with Solr 3.5, didn't change
anything there.

On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan dmitry@gmail.com wrote:

 How does your schema for the fields look like?

 On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote:

  Hi,
 
  I am really new to Solr so I apologize if the question is a little off.
  I was playing with DataImportHandler and tried to index a table in a MS
 SQL
  database.
  I configured my datasource with the necessary parameters and added three
  fields with column(uppercase) and name:
 
 field column=ID name=machineId /
 field column=SERIAL name=machineSerial/
 field column=IVK name=machineIvk/
 
  The full-import command seems to have completed successfully and I see
 that
  the number of documents processed is the same as the number of entries in
  my table.
  However when I try to run a *:* query from the admin console I only get
  responses in the form:
 
doc
   float name=score1.0/float
   str name=id1/str
/doc
 
  I'm not sure how to get to the bottom of this.
  Thanks.
 



 --
 Regards,

 Dmitry Kan



Re: Fields not indexed?

2012-02-08 Thread Dmitry Kan
well, you should add these fields in schema.xml, otherwise solr won't know
them.

On Wed, Feb 8, 2012 at 2:48 PM, Radu Toev radut...@gmail.com wrote:

 The schema.xml is the default file that comes with Solr 3.5, didn't change
 anything there.

 On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan dmitry@gmail.com wrote:

  How does your schema for the fields look like?
 
  On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote:
 
   Hi,
  
   I am really new to Solr so I apologize if the question is a little off.
   I was playing with DataImportHandler and tried to index a table in a MS
  SQL
   database.
   I configured my datasource with the necessary parameters and added
 three
   fields with column(uppercase) and name:
  
  field column=ID name=machineId /
  field column=SERIAL name=machineSerial/
  field column=IVK name=machineIvk/
  
   The full-import command seems to have completed successfully and I see
  that
   the number of documents processed is the same as the number of entries
 in
   my table.
   However when I try to run a *:* query from the admin console I only get
   responses in the form:
  
 doc
float name=score1.0/float
str name=id1/str
 /doc
  
   I'm not sure how to get to the bottom of this.
   Thanks.
  
 
 
 
  --
  Regards,
 
  Dmitry Kan
 




-- 
Regards,

Dmitry Kan


Re: Fields not indexed?

2012-02-08 Thread Radu Toev
I just realized that as I pushed the send button :P
Thanks, I'll have a look.

On Wed, Feb 8, 2012 at 2:58 PM, Dmitry Kan dmitry@gmail.com wrote:

 well, you should add these fields in schema.xml, otherwise solr won't know
 them.

 On Wed, Feb 8, 2012 at 2:48 PM, Radu Toev radut...@gmail.com wrote:

  The schema.xml is the default file that comes with Solr 3.5, didn't
 change
  anything there.
 
  On Wed, Feb 8, 2012 at 2:45 PM, Dmitry Kan dmitry@gmail.com wrote:
 
   How does your schema for the fields look like?
  
   On Wed, Feb 8, 2012 at 2:41 PM, Radu Toev radut...@gmail.com wrote:
  
Hi,
   
I am really new to Solr so I apologize if the question is a little
 off.
I was playing with DataImportHandler and tried to index a table in a
 MS
   SQL
database.
I configured my datasource with the necessary parameters and added
  three
fields with column(uppercase) and name:
   
   field column=ID name=machineId /
   field column=SERIAL name=machineSerial/
   field column=IVK name=machineIvk/
   
The full-import command seems to have completed successfully and I
 see
   that
the number of documents processed is the same as the number of
 entries
  in
my table.
However when I try to run a *:* query from the admin console I only
 get
responses in the form:
   
  doc
 float name=score1.0/float
 str name=id1/str
  /doc
   
I'm not sure how to get to the bottom of this.
Thanks.
   
  
  
  
   --
   Regards,
  
   Dmitry Kan
  
 



 --
 Regards,

 Dmitry Kan



Re: usage of /etc/jetty.xml when debugging Solr in Eclipse

2012-02-08 Thread Bernd Fehling

Hi,

run-jetty-run issue #9:
...
In the VM Arguments of your launch configuration set
-Drjrxml=./jetty.xml

If jetty.xml is in the root of your project it will be used (you can also use a 
fully
qualified path name).

The UI port, context and WebApp dir are ignored, since you can define them in 
jetty.xml

Note: You still have to specify a valid WebApp dir because there are other 
checks
that the plugin performs.
...


Or you can start solr with jetty as usual and then connect eclipse
to the running process.


Regards


Am 08.02.2012 12:24, schrieb jmlucjav:

Hi,

I am following
http://www.lucidimagination.com/devzone/technical-articles/setting-apache-solr-eclipse
in order to be able to debug Solr in eclipse. I got it working fine.

Now, I usually use ./etc/jetty.xml to set logging configuration. When
starting jetty in eclipse I dont see any log files created, so I guessed
jetty.xml is not being used. So I added it to RunJetty Advanced
configuration (Additional jetty.xml), but in that case something goes wrong,
as I get a 'java.net.BindException: Address already in use: JVM_Bind' error,
like if something is started twice.

So my question is: can jetty.xml be used while debugging in eclipse? If so,
how? I would like to use the same configuration I use when I am just
changing xml stuff in Solr and starting with 'java -jar start.jar'.

thank in advance



Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots

2012-02-08 Thread geeky2
hello,

thank you for the reply.

yes - i did re-index after the changes to the schema.

also - thank you for the direction on using the analyzer - but i am not sure
if i am interpreting the feedback from the analyzer correctly.

here is what i did:

in the Field value (Index) box - i placed this: BP2.1UAA

in the Field value (Query) box - i placed this: BP21UAA

then after hitting the Analyze button - i see the following:

Under Index Analyzer for: 

org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
generateWordParts=1, catenateAll=1, catenateNumbers=1}

i see 

position1   2   3   4
term text   BP  2   1   UAA
21  BP21UAA

Under Query Analyzer for:

org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
generateWordParts=1, catenateAll=1, catenateNumbers=1}

i see 

position1   2   3
term text   BP  21  UAA
BP21UAA

the above information leads me to believe that i should have BP21UAA as an
indexed term generated from the BP2.1UAA value coming from the database.

also - the query analysis lead me to believe that i should find a document
when i search on BP21UAA in the itemNo field

do i have this correct

am i missing something here?

i am still unable to get a hit when i search on BP21UAA in the itemNo field.

thank you,
mark

--
View this message in context: 
http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3726021.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Robert Brown
Thanks Erick,

I didn't get confused with multiple tokens vs multiValued  :)

Before I go ahead and re-index 4m docs, and believe me I'm using the
analysis page like a mad-man!

What do I need to configure to have the following both indexed with and
without the dots...

.net
sales manager.
£12.50

Currently...

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1
generateNumberParts=1
catenateWords=1
catenateNumbers=1
catenateAll=1
splitOnCaseChange=1
splitOnNumerics=1
types=wdftypes.txt
/

with nothing specific in wdftypes.txt for full-stops.

Should there also be any difference when quoting my searches?

The analysis page seems to just drop the quotes, but surely actual
calls don't do this?



---

IntelCompute
Web Design  Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
erickerick...@gmail.com wrote:
 Yes, WDDF creates multiple tokens. But that has
 nothing to do with the multiValued suggestion.
 
 You can get exactly what you want by
 1 setting multiValued=true in your schema file and re-indexing. Say
 positionIncrementGap is set to 100
 2 When you index, add the field for each sentence, so your doc
   looks something like:
  doc
 field name=sentencesi am a sales-manager in here/field
field name=sentencesusing asp.net and .net daily/field
  .
   /doc
 3 search like sales manager~100
 
 Best
 Erick
 
 On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote:
 Apologies if things were a little vague.

 Given the example snippet to index (numbered to show searches needed to
 match)...

 1: i am a sales-manager in here
 2: using asp.net and .net daily
 3: working in design.
 4: using something called sage 200. and i'm fluent
 5: german sausages.
 6: busy AE dept earning £10,000 annually


 ... all with newlines in place.

 able to match...

 1. sales
 1. sales manager
 1. sales-manager
 1. sales-manager
 2. .net
 2. asp.net
 3. design
 4. sage 200
 6. AE
 6. £10,000

 But do NOT match fluent german from 4 + 5 since there's a newline
 between them when indexed, but not when searched.


 Do the filters (wdf in this case) not create multiple tokens, so if
 splitting on period in asp.net would create tokens for all of asp,
 asp., asp.net, .net, net.


 Cheers,
 Rob

 --

 IntelCompute
 Web Design and Online Marketing

 http://www.intelcompute.com


 -Original Message-
 From: Chris Hostetter hossman_luc...@fucit.org
 Reply-to: solr-user@lucene.apache.org
 To: solr-user@lucene.apache.org
 Subject: Re: Which Tokeniser (and/or filter)
 Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

 : This all seems a bit too much work for such a real-world scenario?

 You haven't really told us what your scenerio is.

 You said you want to split tokens on whitespace, full-stop (aka:
 period) and comma only, but then in response to some suggestions you added
 comments other things that you never mentioned previously...

 1) evidently you don't want the . in foo.net to cause a split in tokens?
 2) evidently you not only want token splits on newlines, but also
 positition gaps to prevent phrases matching across newlines.

 ...these are kind of important details that affect suggestions people
 might give you.

 can you please provide some concrete examples of hte types of data you
 have, the types of queries you want them to match, and the types of
 queries you *don't* want to match?


 -Hoss




Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
Add this as well:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.5030

On Wed, Feb 8, 2012 at 1:56 AM, Andrzej Bialecki a...@getopt.org wrote:

 On 08/02/2012 09:17, Ted Dunning wrote:

 This is true with Lucene as it stands.  It would be much faster if there
 were a specialized in-memory index such as is typically used with high
 performance search engines.


 This could be implemented in Lucene trunk as a Codec. The challenge though
 is to come up with the right data structures.

 There has been some interesting research on optimizations for in-memory
 inverted indexes, but it usually involves changing the query evaluation
 algos as well - for reference:

 http://digbib.ubka.uni-**karlsruhe.de/volltexte/**documents/1202502http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/1202502
 http://www.siam.org/**proceedings/alenex/2008/alx08_**01transierf.pdfhttp://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf
 http://research.google.com/**pubs/archive/37365.pdfhttp://research.google.com/pubs/archive/37365.pdf

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __**
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




How to identify the field with highest score in dismax

2012-02-08 Thread crisfromnova
Hi,

According solr documentation the dismax score is calculating after the
formula : 
(score of matching clause with the highest score) + ( (tie paramenter) *
(scores of any other matching clauses) ).

Is there a way to identify the field on which the matching clause score is
the highest?

For example I suppose that I have the following document : 

doc
  str name=NameFord Mustang Coupe Cabrio/str
  str name=DetailsFord Mustang is a great car/str
/doc

and the following dismax query :

defType=dismaxqf=Name^10+Details^1q=Ford+Mustang+Ford+Mustang

and receive the document with the score : 5.6.
Is there a way to find out if the score is for the matching on Name field or
for the matching on Details field?

Thanks in advance!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-identify-the-field-with-highest-score-in-dismax-tp3726297p3726297.html
Sent from the Solr - User mailing list archive at Nabble.com.


Sorting solrdocumentlist object after querying

2012-02-08 Thread Kashif Khan
Hi all,

I want to sort a SolrDocumentList after it has been queried and obtained
from the QueryResponse.getResults(). The reason is i have a SolrDocumentList
obtained after querying using QueryResponse.getResults() and i have added
few docs to it. Now i want to sort this SolrDocumentList based on the same
fields i did the querying before i modified this SolrDocumentList.

Please advice any alternatives with sample code will be appreciated a lot if
this is not possible. It is an emergency



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3726303.html
Sent from the Solr - User mailing list archive at Nabble.com.


Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

and these fields are configured accordingly:

fieldType name=xml  class=solr.TextField  positionIncrementGap=100
  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
  /analyzer
/fieldType

fieldType name=xml_unicode  class=solr.TextField  
positionIncrementGap=100
  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

fieldType name=xml_unicode_full  class=solr.TextField  
positionIncrementGap=100
  analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
  /analyzer
/fieldType

And finally my search configuration:

requestHandler name=dictionary  class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsall/str
   str name=defTypeedismax/str
   str name=mm2lt;-25%/str
   str name=qfdc_title_unicode_full^2 dc_title_unicode^2 
dc_title/str
   int  name=rows10/int
   str name=spellcheck.onlyMorePopulartrue/str
   str name=spellcheck.extendedResultsfalse/str
   str name=spellcheck.count1/str
 /lst
arr name=last-components
  strspellcheck/str
/arr
/requestHandler

I am trying to match the field with various search phrases (that are 
valid). There are results:



#   search phrase   match?  Comment
1   cal.lígra?  yes 
2   cal.ligra?  no  Changed í to i
3   cal.ligraf  yes 
4   calligra?   no  


The problem is the #2 attempt to match a data. The #3 works replacing ? 
with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be *calligra?* insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10

--
Regards!
Dalius Sidlauskas



Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas
If you can not read this mail easily check this ticket: 
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.


Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full) 
containing same value:


title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

and these fields are configured accordingly:

fieldType name=xml  class=solr.TextField  
positionIncrementGap=100

analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
/fieldType

fieldType name=xml_unicode  class=solr.TextField  
positionIncrementGap=100

analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

fieldType name=xml_unicode_full  class=solr.TextField  
positionIncrementGap=100

analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

And finally my search configuration:

requestHandler name=dictionary  class=solr.SearchHandler
lst name=defaults
str name=echoParamsall/str
str name=defTypeedismax/str
str name=mm2lt;-25%/str
str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str
int  name=rows10/int
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.extendedResultsfalse/str
str name=spellcheck.count1/str
/lst
arr name=last-components
strspellcheck/str
/arr
/requestHandler

I am trying to match the field with various search phrases (that are 
valid). There are results:



# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing 
? with f.


One more thing. If * is used insted of ? other data is matched as 
cal.lígrafia but not cal.lígraf...


Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 | 
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 | 
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))


Should the second be *calligra?* insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10



Re: Wildcard ? issue?

2012-02-08 Thread Sethi, Parampreet
Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, Dalius Sidlauskas dalius.sidlaus...@semantico.com
wrote:

If you can not read this mail easily check this ticket:
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.

Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:
 Sorry for inaccurate title.

 I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
 containing same value:

 title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

 and these fields are configured accordingly:

 fieldType name=xml  class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory/
 /analyzer
 /fieldType

 fieldType name=xml_unicode  class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.StandardTokenizerFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
 /fieldType

 fieldType name=xml_unicode_full  class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
 tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
 /fieldType

 And finally my search configuration:

 requestHandler name=dictionary  class=solr.SearchHandler
 lst name=defaults
 str name=echoParamsall/str
 str name=defTypeedismax/str
 str name=mm2lt;-25%/str
 str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str
 int  name=rows10/int
 str name=spellcheck.onlyMorePopulartrue/str
 str name=spellcheck.extendedResultsfalse/str
 str name=spellcheck.count1/str
 /lst
 arr name=last-components
 strspellcheck/str
 /arr
 /requestHandler

 I am trying to match the field with various search phrases (that are
 valid). There are results:


 # search phrase match? Comment
 1 cal.lígra? yes
 2 cal.ligra? no Changed í to i
 3 cal.ligraf yes
 4 calligra? no


 The problem is the #2 attempt to match a data. The #3 works replacing
 ? with f.

 One more thing. If * is used insted of ? other data is matched as
 cal.lígrafia but not cal.lígraf...

 Also I have spotted some logic missmatch in debug parsedQuery field:
 *
 cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
 dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
 *cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
 dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))

 Should the second be *calligra?* insted?*

 *Environment:
 Tomcat 7.0.25 (request encoding UTF-8)
 Solr 3.5.0
 Java 7 Oracle
 Ubuntu 11.10




Re: Wildcard ? issue?

2012-02-08 Thread Dalius Sidlauskas
I have already tried this and it did not helped because it does not 
highlight matches if wild-card is used. The field configuration turns 
data to:


dc_title: calligraf
dc_title_unicode: cal·lígraf
dc_title_unicode_full: cal·lígraf

Debug parsedquery says:

[Search for *cal·ligraf*]

+DisjunctionMaxQuery((dc_title:*calligraf* |  
dc_title_unicode:cal·ligraf^2.0 | dc_title_unicode_full:cal·ligraf^2.0))


[Search for *cal·ligra?*]

+DisjunctionMaxQuery((dc_title:*cal·ligra?* | 
dc_title_unicode:cal·ligra?^2.0 | dc_title_unicode_full:cal·ligra?^2.0))


Why the *dc_title* field is handled differently? The analysis looks fine:


 Index Analyzer


   org.apache.solr.analysis.HTMLStripCharFilterFactory
   {luceneMatchVersion=LUCENE_34}

textcal·lígraf


   org.apache.solr.analysis.PatternReplaceCharFilterFactory
   {replacement=, pattern=-, maxBlockChars=1,
   luceneMatchVersion=LUCENE_34, blockDelimiters=}

textcal·lígraf


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·lígraf
startOffset 43
endOffset   53


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligraf
startOffset 43
endOffset   53


 Query Analyzer


   org.apache.solr.analysis.WhitespaceTokenizerFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   cal·ligra?
startOffset 0
endOffset   10


   org.apache.solr.analysis.ICUFoldingFilterFactory
   {luceneMatchVersion=LUCENE_34}

position1
term text   calligra?
startOffset 0
endOffset   10


Is this a Solr or Lucene bug?

Regards!
Dalius Sidlauskas


On 08/02/12 16:03, Sethi, Parampreet wrote:

Hi Dalius,

If not already tried, Check http://localhost:8983/solr/admin/analysis.jsp
(enable verbose output for both Field Value index and query for details)
for your queries and see what all filters/tokenizers are being applied.

Hope it helps!

-param

On 2/8/12 10:48 AM, Dalius Sidlauskasdalius.sidlaus...@semantico.com
wrote:


If you can not read this mail easily check this ticket:
https://issues.apache.org/jira/browse/SOLR-3106 This is a copy.

Regards!
Dalius Sidlauskas


On 08/02/12 15:44, Dalius Sidlauskas wrote:

Sorry for inaccurate title.

I have a 3 fields (dc_title, dc_title_unicode, dc_unicode_full)
containing same value:

title xmlns=http://www.tei-c.org/ns/1.0;cal.lígraf/title

and these fields are configured accordingly:

fieldType name=xml  class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
/analyzer
/fieldType

fieldType name=xml_unicode  class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.StandardTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

fieldType name=xml_unicode_full  class=solr.TextField
positionIncrementGap=100
analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
/analyzer
/fieldType

And finally my search configuration:

requestHandler name=dictionary  class=solr.SearchHandler
lst name=defaults
str name=echoParamsall/str
str name=defTypeedismax/str
str name=mm2lt;-25%/str
str name=qfdc_title_unicode_full^2 dc_title_unicode^2 dc_title/str
int  name=rows10/int
str name=spellcheck.onlyMorePopulartrue/str
str name=spellcheck.extendedResultsfalse/str
str name=spellcheck.count1/str
/lst
arr name=last-components
strspellcheck/str
/arr
/requestHandler

I am trying to match the field with various search phrases (that are
valid). There are results:


# search phrase match? Comment
1 cal.lígra? yes
2 cal.ligra? no Changed í to i
3 cal.ligraf yes
4 calligra? no


The problem is the #2 attempt to match a data. The #3 works replacing
? with f.

One more thing. If * is used insted of ? other data is matched as
cal.lígrafia but not cal.lígraf...

Also I have spotted some logic missmatch in debug parsedQuery field:
*
cal·lígraf:* +DisjunctionMaxQuery((dc_title:*calligraf*^2.0 |
dc_title_unicode:cal·lígraf^3.0 | dc_title_unicode_full:cal·lígraf^3.0))
*cal·lígra?:*+DisjunctionMaxQuery((dc_title:*cal·lígra?*^2.0 |
dc_title_unicode:cal·lígra?^3.0 | dc_title_unicode_full:cal·lígra?^3.0))

Should the second be *calligra?* insted?*

*Environment:
Tomcat 7.0.25 (request encoding UTF-8)
Solr 3.5.0
Java 7 Oracle
Ubuntu 11.10



Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots

2012-02-08 Thread Erick Erickson
Hmmm, that all looks correct, from the output you pasted I'd expect
you to be finding the doc.

So next thing: add debugQuery=on to your query and look at
the debug information after the list of documents, particularly
the parsedQuery bit. Are you searching against the fields you
think you are? If you don't specify a field, Solr uses the default
defined in schema.xml.

Next, look at your actual index using either Luke or the TemsComponent
to see what's actually *in* your index rather than what you *think* is. I
can't tell you how many times I've made the wrong assumptions.

My guess would be that you aren't searching the fields you think you are...

Best
Erick

On Wed, Feb 8, 2012 at 9:06 AM, geeky2 gee...@hotmail.com wrote:
 hello,

 thank you for the reply.

 yes - i did re-index after the changes to the schema.

 also - thank you for the direction on using the analyzer - but i am not sure
 if i am interpreting the feedback from the analyzer correctly.

 here is what i did:

 in the Field value (Index) box - i placed this: BP2.1UAA

 in the Field value (Query) box - i placed this: BP21UAA

 then after hitting the Analyze button - i see the following:

 Under Index Analyzer for:

 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
 generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
 generateWordParts=1, catenateAll=1, catenateNumbers=1}

 i see

 position        1       2       3       4
 term text       BP      2       1       UAA
 21      BP21UAA

 Under Query Analyzer for:

 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
 generateNumberParts=1, catenateWords=1, luceneMatchVersion=LUCENE_33,
 generateWordParts=1, catenateAll=1, catenateNumbers=1}

 i see

 position        1       2       3
 term text       BP      21      UAA
 BP21UAA

 the above information leads me to believe that i should have BP21UAA as an
 indexed term generated from the BP2.1UAA value coming from the database.

 also - the query analysis lead me to believe that i should find a document
 when i search on BP21UAA in the itemNo field

 do i have this correct

 am i missing something here?

 i am still unable to get a hit when i search on BP21UAA in the itemNo field.

 thank you,
 mark

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/struggling-with-solr-WordDelimiterFilterFactory-and-periods-or-dots-tp3724822p3726021.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Erick Erickson
You'll probably have to index them in separate fields to
get what you want. The question is always whether it's
worth it, is the use-case really well served by having a
variant that keeps dots and things? But that's always more
a question for your product manager

Best
Erick

On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown r...@intelcompute.com wrote:
 Thanks Erick,

 I didn't get confused with multiple tokens vs multiValued  :)

 Before I go ahead and re-index 4m docs, and believe me I'm using the
 analysis page like a mad-man!

 What do I need to configure to have the following both indexed with and
 without the dots...

 .net
 sales manager.
 £12.50

 Currently...

 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
        generateWordParts=1
        generateNumberParts=1
        catenateWords=1
        catenateNumbers=1
        catenateAll=1
        splitOnCaseChange=1
        splitOnNumerics=1
        types=wdftypes.txt
 /

 with nothing specific in wdftypes.txt for full-stops.

 Should there also be any difference when quoting my searches?

 The analysis page seems to just drop the quotes, but surely actual
 calls don't do this?



 ---

 IntelCompute
 Web Design  Local Online Marketing

 http://www.intelcompute.com


 On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
 erickerick...@gmail.com wrote:
 Yes, WDDF creates multiple tokens. But that has
 nothing to do with the multiValued suggestion.

 You can get exactly what you want by
 1 setting multiValued=true in your schema file and re-indexing. Say
 positionIncrementGap is set to 100
 2 When you index, add the field for each sentence, so your doc
       looks something like:
      doc
         field name=sentencesi am a sales-manager in here/field
        field name=sentencesusing asp.net and .net daily/field
          .
       /doc
 3 search like sales manager~100

 Best
 Erick

 On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote:
 Apologies if things were a little vague.

 Given the example snippet to index (numbered to show searches needed to
 match)...

 1: i am a sales-manager in here
 2: using asp.net and .net daily
 3: working in design.
 4: using something called sage 200. and i'm fluent
 5: german sausages.
 6: busy AE dept earning £10,000 annually


 ... all with newlines in place.

 able to match...

 1. sales
 1. sales manager
 1. sales-manager
 1. sales-manager
 2. .net
 2. asp.net
 3. design
 4. sage 200
 6. AE
 6. £10,000

 But do NOT match fluent german from 4 + 5 since there's a newline
 between them when indexed, but not when searched.


 Do the filters (wdf in this case) not create multiple tokens, so if
 splitting on period in asp.net would create tokens for all of asp,
 asp., asp.net, .net, net.


 Cheers,
 Rob

 --

 IntelCompute
 Web Design and Online Marketing

 http://www.intelcompute.com


 -Original Message-
 From: Chris Hostetter hossman_luc...@fucit.org
 Reply-to: solr-user@lucene.apache.org
 To: solr-user@lucene.apache.org
 Subject: Re: Which Tokeniser (and/or filter)
 Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

 : This all seems a bit too much work for such a real-world scenario?

 You haven't really told us what your scenerio is.

 You said you want to split tokens on whitespace, full-stop (aka:
 period) and comma only, but then in response to some suggestions you added
 comments other things that you never mentioned previously...

 1) evidently you don't want the . in foo.net to cause a split in tokens?
 2) evidently you not only want token splits on newlines, but also
 positition gaps to prevent phrases matching across newlines.

 ...these are kind of important details that affect suggestions people
 might give you.

 can you please provide some concrete examples of hte types of data you
 have, the types of queries you want them to match, and the types of
 queries you *don't* want to match?


 -Hoss




Re: Which Tokeniser (and/or filter)

2012-02-08 Thread Robert Brown
Attempting to re-produce legacy behaviour (i know!) of simple SQL
substring searching, with and without phrases.

I feel simply NGram'ing 4m CV's may be pushing it?


---

IntelCompute
Web Design  Local Online Marketing

http://www.intelcompute.com


On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson
erickerick...@gmail.com wrote:
 You'll probably have to index them in separate fields to
 get what you want. The question is always whether it's
 worth it, is the use-case really well served by having a
 variant that keeps dots and things? But that's always more
 a question for your product manager
 
 Best
 Erick
 
 On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown r...@intelcompute.com wrote:
 Thanks Erick,

 I didn't get confused with multiple tokens vs multiValued  :)

 Before I go ahead and re-index 4m docs, and believe me I'm using the
 analysis page like a mad-man!

 What do I need to configure to have the following both indexed with and
 without the dots...

 .net
 sales manager.
 £12.50

 Currently...

 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
        generateWordParts=1
        generateNumberParts=1
        catenateWords=1
        catenateNumbers=1
        catenateAll=1
        splitOnCaseChange=1
        splitOnNumerics=1
        types=wdftypes.txt
 /

 with nothing specific in wdftypes.txt for full-stops.

 Should there also be any difference when quoting my searches?

 The analysis page seems to just drop the quotes, but surely actual
 calls don't do this?



 ---

 IntelCompute
 Web Design  Local Online Marketing

 http://www.intelcompute.com


 On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson
 erickerick...@gmail.com wrote:
 Yes, WDDF creates multiple tokens. But that has
 nothing to do with the multiValued suggestion.

 You can get exactly what you want by
 1 setting multiValued=true in your schema file and re-indexing. Say
 positionIncrementGap is set to 100
 2 When you index, add the field for each sentence, so your doc
       looks something like:
      doc
         field name=sentencesi am a sales-manager in here/field
        field name=sentencesusing asp.net and .net daily/field
          .
       /doc
 3 search like sales manager~100

 Best
 Erick

 On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown r...@intelcompute.com wrote:
 Apologies if things were a little vague.

 Given the example snippet to index (numbered to show searches needed to
 match)...

 1: i am a sales-manager in here
 2: using asp.net and .net daily
 3: working in design.
 4: using something called sage 200. and i'm fluent
 5: german sausages.
 6: busy AE dept earning £10,000 annually


 ... all with newlines in place.

 able to match...

 1. sales
 1. sales manager
 1. sales-manager
 1. sales-manager
 2. .net
 2. asp.net
 3. design
 4. sage 200
 6. AE
 6. £10,000

 But do NOT match fluent german from 4 + 5 since there's a newline
 between them when indexed, but not when searched.


 Do the filters (wdf in this case) not create multiple tokens, so if
 splitting on period in asp.net would create tokens for all of asp,
 asp., asp.net, .net, net.


 Cheers,
 Rob

 --

 IntelCompute
 Web Design and Online Marketing

 http://www.intelcompute.com


 -Original Message-
 From: Chris Hostetter hossman_luc...@fucit.org
 Reply-to: solr-user@lucene.apache.org
 To: solr-user@lucene.apache.org
 Subject: Re: Which Tokeniser (and/or filter)
 Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST)

 : This all seems a bit too much work for such a real-world scenario?

 You haven't really told us what your scenerio is.

 You said you want to split tokens on whitespace, full-stop (aka:
 period) and comma only, but then in response to some suggestions you added
 comments other things that you never mentioned previously...

 1) evidently you don't want the . in foo.net to cause a split in tokens?
 2) evidently you not only want token splits on newlines, but also
 positition gaps to prevent phrases matching across newlines.

 ...these are kind of important details that affect suggestions people
 might give you.

 can you please provide some concrete examples of hte types of data you
 have, the types of queries you want them to match, and the types of
 queries you *don't* want to match?


 -Hoss





Re: struggling with solr.WordDelimiterFilterFactory and periods . or dots

2012-02-08 Thread geeky2
hello,

thanks for sticking with me on this ...very frustrating 

ok - i did perform the query with the debug parms using two scenarios:

1) a successful search (where i insert the period / dot) in to the itemNo
field and the search returns a document.

itemNo:BP2.1UAA

http://hfsthssolr1.intra.searshc.com:8180/solrpartscat/core1/select/?q=itemNo%3ABP2.1UAAversion=2.2start=0rows=10indent=ondebugQuery=on

results from debug

?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime1/int
  lst name=params
str name=indenton/str
str name=rows10/str

str name=version2.2/str
str name=debugQueryon/str
str name=start0/str
str name=qitemNo:BP2.1UAA/str
  /lst
/lst
result name=response numFound=1 start=0
  doc

arr name=brandstrPHILIPS/str/arr
str name=groupId0333500/str
str name=id0333500,1549  ,BP2.1UAA   /str
str name=itemDescPLASMA TELEVISION/str
str name=itemNoBP2.1UAA   /str
int name=itemType2/int

arr name=modelstrBP2.1UAA   /str/arr
arr name=productTypestrPlasma Television^/str/arr
int name=rankNo0/int
str name=supplierId1549  /str
  /doc
/result
lst name=debug
  str name=rawquerystringitemNo:BP2.1UAA/str

  str name=querystringitemNo:BP2.1UAA/str
  str name=parsedqueryMultiPhraseQuery(itemNo:bp 2 (1 21) (uaa
bp21uaa))/str
  str name=parsedquery_toStringitemNo:bp 2 (1 21) (uaa bp21uaa)/str
  lst name=explain
str name=0333500,1549  ,BP2.1UAA   
22.539911 = (MATCH) weight(itemNo:bp 2 (1 21) (uaa bp21uaa) in 134993),
product of:
  0.9994 = queryWeight(itemNo:bp 2 (1 21) (uaa bp21uaa)), product of:
45.079826 = idf(itemNo: bp=829 2=29303 1=43943 21=6716 uaa=32 bp21uaa=1)
0.02218287 = queryNorm
  22.539913 = (MATCH) fieldWeight(itemNo:bp 2 (1 21) (uaa bp21uaa) in
134993), product of:
1.0 = tf(phraseFreq=1.0)
45.079826 = idf(itemNo: bp=829 2=29303 1=43943 21=6716 uaa=32 bp21uaa=1)
0.5 = fieldNorm(field=itemNo, doc=134993)
/str
  /lst

  str name=QParserLuceneQParser/str
  lst name=timing
double name=time1.0/double
lst name=prepare
  double name=time0.0/double
  lst name=org.apache.solr.handler.component.QueryComponent
double name=time0.0/double

  /lst
  lst name=org.apache.solr.handler.component.FacetComponent
double name=time0.0/double
  /lst
  lst name=org.apache.solr.handler.component.MoreLikeThisComponent
double name=time0.0/double
  /lst
  lst name=org.apache.solr.handler.component.HighlightComponent

double name=time0.0/double
  /lst
  lst name=org.apache.solr.handler.component.StatsComponent
double name=time0.0/double
  /lst
  lst name=org.apache.solr.handler.component.DebugComponent
double name=time0.0/double

  /lst
/lst
lst name=process
  double name=time1.0/double
  lst name=org.apache.solr.handler.component.QueryComponent
double name=time1.0/double
  /lst
  lst name=org.apache.solr.handler.component.FacetComponent

double name=time0.0/double
  /lst
  lst name=org.apache.solr.handler.component.MoreLikeThisComponent
double name=time0.0/double
  /lst
  lst name=org.apache.solr.handler.component.HighlightComponent
double name=time0.0/double

  /lst
  lst name=org.apache.solr.handler.component.StatsComponent
double name=time0.0/double
  /lst
  lst name=org.apache.solr.handler.component.DebugComponent
double name=time0.0/double
  /lst
/lst

  /lst
/lst
/response







2) a NON-successful search (where i do NOT insert a period / dot) in to the
itemNo field and the search does NOT return a document

 itemNo:BP21UAA

http://hfsthssolr1.intra.searshc.com:8180/solrpartscat/core1/select/?q=itemNo%3ABP21UAAversion=2.2start=0rows=10indent=ondebugQuery=on

?xml version=1.0 encoding=UTF-8?
response

lst name=responseHeader
  int name=status0/int
  int name=QTime1/int
  lst name=params
str name=indenton/str
str name=rows10/str

str name=version2.2/str
str name=debugQueryon/str
str name=start0/str
str name=qitemNo:BP21UAA/str
  /lst
/lst
result name=response numFound=0 start=0/
lst name=debug

  str name=rawquerystringitemNo:BP21UAA/str
  str name=querystringitemNo:BP21UAA/str
  str name=parsedqueryMultiPhraseQuery(itemNo:bp 21 (uaa
bp21uaa))/str
  str name=parsedquery_toStringitemNo:bp 21 (uaa bp21uaa)/str
  lst name=explain/
  str name=QParserLuceneQParser/str

  lst name=timing
double name=time1.0/double
lst name=prepare
  double name=time1.0/double
  lst name=org.apache.solr.handler.component.QueryComponent
double name=time1.0/double
  /lst

  lst name=org.apache.solr.handler.component.FacetComponent
double name=time0.0/double
  /lst
  lst 

Re: Wildcard ? issue?

2012-02-08 Thread Ahmet Arslan
 I have already tried this and it did
 not helped because it does not 
 highlight matches if wild-card is used. The field
 configuration turns 
 data to:

This writeup should explain your scenario :
http://wiki.apache.org/solr/MultitermQueryAnalysis


Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:

 I have been using solr for a while and have recently started getting into
 solrcloud .. i am a bit confused with some of the concepts ..
 
 1. what exactly is the relationship between a collection and the core ..
 can a core has multiple collections in it .. in this case all collections
 within this core will have the same schema .. and i am assuming all
 instances of collections within the core can be deployed on different solr
 nodes to achieve distributed search ..
 or is it the other way around where a collection can have multiple cores

Currently, a core basically equals a replica of the index.

So you might have a collection called collection1 - lets say it's 2 shards and 
each shard has a single replica:

Collection1
shard1 replica1
shard1 replica2
shard2 replica1
shard2 replica2

Each of those replicas is a core. So a collection has multiple cores basically. 
Also, each of those cores can be on a different machine. So yes, you have 
distributed indexing and distributed search.

 
 2. at some places it has been pointed out that solrcloud doesnt actually
 supports replication .. but in the solrcloud wiki the second example is
 supposed to be for replication .. so does solrcloud at this point supports
 automatic replication where as you add more servers it automatically uses
 the additional servers as replicas

SolrCloud doesn't support the old style Solr replication concept. It does 
however, handle replication - it's just all pretty much automatic and behind 
the scenes - eg all the information about Solr replication in the wiki 
documentation for previous versions of Solr is really not applicable. We now 
achieve replica copies by sending documents to each shard one document at a 
time so that we can support near realtime search. The old style replication is 
only used in recovery, or when you start a new replica machine and it has to 
'catchup' to the other replicas.

 
 I have a few more questions but I wanted to get these basic ones out of the
 way first .. I would appreciate any response.

Fire away.

 
 Thanks
 Adeel

- Mark Miller
lucidimagination.com













Re: Sorting solrdocumentlist object after querying

2012-02-08 Thread Ahmet Arslan
 I want to sort a SolrDocumentList after it has been queried
 and obtained
 from the QueryResponse.getResults(). The reason is i have a
 SolrDocumentList
 obtained after querying using QueryResponse.getResults() and
 i have added
 few docs to it. Now i want to sort this SolrDocumentList
 based on the same
 fields i did the querying before i modified this
 SolrDocumentList.

QueryResponse.getResults()  will return rows many documents. Cant you sort them 
(plus your injected documents) with your own?


Re: solr cloud concepts

2012-02-08 Thread Bruno Dumon
Hi Adeel,

I just started looking into SolrCloud and had some of the same questions.

I wrote a blog with the understanding I gained so far, maybe it will help
you:

http://outerthought.org/blog/491-ot.html

Regards,

Bruno.

On Wed, Feb 8, 2012 at 4:31 PM, Adeel Qureshi adeelmahm...@gmail.comwrote:

 I have been using solr for a while and have recently started getting into
 solrcloud .. i am a bit confused with some of the concepts ..

 1. what exactly is the relationship between a collection and the core ..
 can a core has multiple collections in it .. in this case all collections
 within this core will have the same schema .. and i am assuming all
 instances of collections within the core can be deployed on different solr
 nodes to achieve distributed search ..
 or is it the other way around where a collection can have multiple cores

 2. at some places it has been pointed out that solrcloud doesnt actually
 supports replication .. but in the solrcloud wiki the second example is
 supposed to be for replication .. so does solrcloud at this point supports
 automatic replication where as you add more servers it automatically uses
 the additional servers as replicas

 I have a few more questions but I wanted to get these basic ones out of the
 way first .. I would appreciate any response.

 Thanks
 Adeel




-- 
Bruno Dumon
Outerthought
http://outerthought.org/


Re: How to reindex about 10Mio. docs

2012-02-08 Thread Otis Gospodnetic
Vadim,

Would using xslt output help?

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html




 From: Vadim Kisselmann v.kisselm...@googlemail.com
To: solr-user@lucene.apache.org 
Sent: Wednesday, February 8, 2012 7:09 AM
Subject: Re: How to reindex about 10Mio. docs
 
Another problem appeared ;)
how can i export my docs in csv-format?
In Solr 3.1+ i can use the query-param wt=csv, but in Solr 1.4.1?
Best Regards
Vadim


2012/2/8 Vadim Kisselmann v.kisselm...@googlemail.com:
 Hi Ahmet,
 thanks for quick response:)
 I've already thought the same...
 And it will be a pain to export and import this huge doc-set as CSV.
 Do i have an another solution?
 Regards
 Vadim


 2012/2/8 Ahmet Arslan iori...@yahoo.com:
 i want to reindex about 10Mio. Docs. from one Solr(1.4.1) to
 another
 Solr(1.4.1).
 I changed my schema.xml (field types sing to slong),
 standard
 replication would fail.
 what is the fastest and smartest way to manage this?
 this here sound great (EntityProcessor):
 http://www.searchworkings.org/blog/-/blogs/importing-data-from-another-solr
 But would it work with Solr 1.4.1?

 SolrEntityProcessor is not available in 1.4.1. I would dump stored fields 
 into comma separated file, and use http://wiki.apache.org/solr/UpdateCSV to 
 feed into new solr instance.




Re: Using UUID for uniqueId

2012-02-08 Thread François Schiettecatte
Anderson

I would say that this is highly unlikely, but you would need to pay attention 
to how they are generated, this would be a good place to start:

http://en.wikipedia.org/wiki/Universally_unique_identifier

Cheers

François

On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote:

 HI all
 
 If i use the UUID like a uniqueId in the future if i break my index in
 shards, i will have problems? The UUID generation could generate the same
 UUID in differents machines?
 
 Thanks



Thank you all

2012-02-08 Thread Tim Hibbs
All,

It appears my attempt at using solr for the application I support is
about to fail. I'm personally and professionally disappointed, but I
wanted to say Many Thanks to those of you who have provided so much
help to so many on this list. In the right hands and in the right
environments, it has so much potential. You all have shown the
collective knowledge and cooperation it takes to bring that potential to
fruition.

I wish I'd been able to pick up on the right details of the toolset to
be able to make this work.

Best of luck to you all!

Tim Hibbs

On 2/7/2012 2:53 PM, Tim Hibbs wrote:
 Hi, all...
 
 I have a small problem retrieving the full set of query responses I need
 and would appreciate any help.
 
 I have a query string as follows:
 
 +((Title:sales) (+Title:sales) (TOC:sales) (+TOC:sales)
 (Keywords:sales) (+Keywords:sales) (text:sales) (+text:sales)
 (sales)) +(RepType:WRO Revenue Services) +(ContentType:SOP
 ContentType:Key Concept) -(Topics:Backup)
 
 The query is intended to be:
 
 MUST have at least one of:
 - exact phrase in field Title
 - all of the phrase words in field Title
 - exact phrase in field TOC
 - all of the phrase words in field TOC
 - exact phrase in field Keywords
 - all of the phrase words in field Keywords
 - exact phrase in field text
 - all of the phrase words in field text
 - any of the phrase words in field text
 
 MUST have WRO Revenue Services in field RepType
 MUST have at least one of:
 - SOP in field ContentType
 - Key Concept in field ContentType
 MUST NOT have Backup in field Topics
 
 It's almost working, but it misses a couple of items that contain a
 single occurrence of the word sale in a indexed field. The indexed
 field containing that single occurrence is named UrlContent.
 
 schema.xml
 
 UrlContent is defined as:
 field name=UrlContent type=text indexed=true stored=false
 required=false omitNorms=false/
 
 Copyfields are as follows:
   copyField source=Title dest=text/
   copyField source=Keywords dest=text/
   copyField source=TOC dest=text/
   copyField source=Overview dest=text/
   copyField source=UrlContent dest=text/
 
 Thanks,
 Tim Hibbs


solr/tomcat performance.

2012-02-08 Thread adm1n
Hi,

I'm running solr+tomcat with the following configuration:
I have 16 slaves, which are being queried by aggregator, while aggregator
being queried by the users.
My slaveUrls variable in solr.xml (on aggregator) looks like - 'property
name=slaveUrls
value=host01/slave01,host02/slave02,host03/slave03,...,host16/slave16 /'
I'm running it on linux machine (not dedicated, there are some other 'heavy'
processes) with 16 quads CPUs and 66GB Ram.

I ran some tests and I saw, that when I did 400 concurrent requests to
aggregator the host stopped to respond until I restart the tomcat. I tried
to 'play' with tomcat's/java configuration a little, but it didn't help me
much and the main issue was memory usage and timeouts. Currently I'm using
the following settings:
Java:
-Xms256m -Xmx8192m
I tried to tweak -XX:MinHeapFreeRatio setting, but from what I could see no
memory was returned to OS.
Tomcat:
Executor name=HTTPThreadPool namePrefix=HTTPThread-
maxThreads=8000
minSpareThreads=4000/

Connector executor=HTTPThreadPool port=8080 protocol=HTTP/1.1
redirectPort=8443 URIEncoding=UTF-8
maxHttpHeaderSize=8388608
enableLookups=false
acceptCount=100
connectionTimeout=1 /

Assuming I'll have ~1000 requests/second done to aggregator, on how many
aggregators should I balance the loading? Or may be I can achieve better
performance only by tweaking the current system?


Any help/advise will be appreciated,
Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-tomcat-performance-tp3727199p3727199.html
Sent from the Solr - User mailing list archive at Nabble.com.


Index Start Question

2012-02-08 Thread Hoffman, Chase
Please forgive me if this is a dumb question.  I've never dealt with SOLR 
before, and I'm being asked to determine from the logs when a SOLR index is 
kicked off (it is a Windows server).  The TOMCAT service runs continually, so 
no love there.  In parsing the logs, I think 
org.apache.solr.core.SolrResourceLoader init is the indicator, since 
org.apache.solr.core.SolrCore execute seems to occur even when I know an 
index has not been started.

Any advice you could give me would be wonderful.

Best,

--Chase

Chase Hoffman
Infrastructure Systems Administrator, Performance Technologies
The Advisory Board Company
512-681-2190 direct | 512-609-1150 fax
hoffm...@advisory.commailto:hoffm...@advisory.com | 
www.advisory.comhttp://www.advisory.com

Don't miss out-log in now
Unlock thousands of members-only tools, events, best practices, and more at 
www.advisory.com.
Get 
startedhttp://www.advisory.com/reasons-to-log-in-now/?WT.mc_id=eMail|SignatureLine|Other|ABC|Login+8+Reasons|Nov212011


SolrCloud is in trunk.

2012-02-08 Thread Mark Miller
For those that are interested and have not noticed, the latest work on 
SolrCloud and distributed indexing is now in trunk.

SolrCloud is our name for a new set of distributed capabilities that improve 
upon the old style distributed search and index based replication.

It provides for high availability and fault tolerance while allowing for near 
realtime search and an interface that matches what you are used to with 
previous versions of Solr.

We are looking to release this in the next 4.0 release, and any feedback early 
users can provide will be very useful. So if you have an interest in these 
types of features, please take the latest trunk build for a spin and provide 
some feedback. 

There is still a lot more planned, so feel free to chime in on what you would 
like to see - this is essentially the end of stage one. 

You can read more about what we have done on the wiki: 
http://wiki.apache.org/solr/SolrCloud

Also, a couple blog posts I recently saw pop up:

http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search
http://outerthought.org/blog/491-ot.html

I'll contribute my own blog post as well when I get a chance, but there should 
be a fair amount of info there to get you started if you are interested. 

Thanks,

- Mark Miller
lucidimagination.com













Re: Using UUID for uniqueId

2012-02-08 Thread Anderson vasconcelos
Thanks
2012/2/8 François Schiettecatte fschietteca...@gmail.com

 Anderson

 I would say that this is highly unlikely, but you would need to pay
 attention to how they are generated, this would be a good place to start:

http://en.wikipedia.org/wiki/Universally_unique_identifier

 Cheers

 François

 On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote:

  HI all
 
  If i use the UUID like a uniqueId in the future if i break my index in
  shards, i will have problems? The UUID generation could generate the same
  UUID in differents machines?
 
  Thanks




Re: SolrCloud is in trunk.

2012-02-08 Thread darren

Good job on this work. A monumental effort.

On Wed, 8 Feb 2012 16:41:13 -0500, Mark Miller markrmil...@gmail.com
wrote:
 For those that are interested and have not noticed, the latest work on
 SolrCloud and distributed indexing is now in trunk.
 
 SolrCloud is our name for a new set of distributed capabilities that
 improve upon the old style distributed search and index based
replication.
 
 It provides for high availability and fault tolerance while allowing for
 near realtime search and an interface that matches what you are used to
 with previous versions of Solr.
 
 We are looking to release this in the next 4.0 release, and any feedback
 early users can provide will be very useful. So if you have an interest
in
 these types of features, please take the latest trunk build for a spin
and
 provide some feedback. 
 
 There is still a lot more planned, so feel free to chime in on what you
 would like to see - this is essentially the end of stage one. 
 
 You can read more about what we have done on the wiki:
 http://wiki.apache.org/solr/SolrCloud
 
 Also, a couple blog posts I recently saw pop up:
 

http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search
 http://outerthought.org/blog/491-ot.html
 
 I'll contribute my own blog post as well when I get a chance, but there
 should be a fair amount of info there to get you started if you are
 interested. 
 
 Thanks,
 
 - Mark Miller
 lucidimagination.com


Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Ryan McKinley
Hi Matthias-

I'm trying to understand how you have your data indexed so we can give
reasonable direction.

What field type are you using for your locations?  Is it using the
solr spatial field types?  What do you see when you look at the debug
information from debugQuery=true?

From my experience, there is no single best practice for spatial
queries -- it will depend on your data density and distribution if.

You may also want to look at:
http://code.google.com/p/lucene-spatial-playground/
but note this is off lucene trunk -- the geohash queries are super fast though

ryan




2012/2/8 Matthias Käppler matth...@qype.com:
 Hi Erick,

 if we're not doing geo searches, we filter by location tags that we
 attach to places. This is simply a hierachical regional id, which is
 simple to filter for, but much less flexible. We use that on Web a
 lot, but not on mobile, where we want to performance searches in
 arbitrary radii around arbitrary positions. For those location tag
 kind of queries, the average time spent in SOLR is 43msec (I'm looking
 at the New Relic snapshot of the last 12 hours). I have disabled our
 optimization again just yesterday, so for the bbox queries we're now
 at an avg of 220ms (same time window). That's a 5 fold increase in
 response time, and in peak hours it's worse than that.

 I've also found a blog post from 3 years ago which outlines the inner
 workings of the SOLR spatial indexing and searching:
 http://www.searchworkings.org/blog/-/blogs/23842
 From that it seems as if SOLR already performs a similar optimization
 we had in mind during the index step, so if I understand correctly, it
 doesn't even search over all records, only those that were mapped to
 the grid box identified during indexing.

 What I would love to see is what the suggested way is to perform a geo
 query on SOLR, considering that they're so difficult to cache and
 expensive to run. Is the best approach to restrict the candidate set
 as much as possible using cheap filter queries, so that SOLR merely
 has to do the geo search against these subsets? How does the query
 planner work here? I see there's a cost attached to a filter query,
 but one can only set it when cache is set to false? Are cached geo
 queries executed last when there are cheaper filter queries to cut
 down on documents? If you have a real world practical setup to share,
 one that performs well in a production environment that serves
 requests in the Millions per day, that would be great.

 I'd love to contribute documentation by the way, if you knew me you'd
 know I'm an avid open source contributor and actually run several open
 source projects myself. But tell me, how can I possibly contribute
 answer to questions I don't have an answer to? That's why I'm here,
 remember :) So please, these kinds of snippy replies are not helping
 anyone.

 Thanks
 -Matthias

 On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson erickerick...@gmail.com 
 wrote:
 So the obvious question is what is your
 performance like without the distance filters?

 Without that knowledge, we have no clue whether
 the modifications you've made had any hope of
 speeding up your response times

 As for the docs, any improvements you'd like to
 contribute would be happily received

 Best
 Erick

 2012/2/6 Matthias Käppler matth...@qype.com:
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirable evictions.

 Now our idea was this: while the actual geo queries are poorly
 cacheable, we could clearly identify geographical regions which are
 more often queried than others (naturally, since we're a user driven
 service). Therefore, we dynamically partition Earth into a static grid
 of overlapping boxes, where the grid size (the distance of the nodes)
 depends on the maximum allowed search radius. That way, for every user
 query, we would always be able to identify a single bounding box that
 covers it. This larger bounding box (200km edge length) we would send
 to SOLR as a cached filter query, along with the actual user query
 

Re: solr cloud concepts

2012-02-08 Thread Adeel Qureshi
okay so after reading Bruno's blog post .. lets add slice to the mix as
well .. so we have got collections, cores, shards, partitions and slices :)
..

The whole point with cores is to be able to have different schemas on the
same solr server instance. So how does that changes with collections .. may
be an example might help .. if I want to setup a solrcloud cluster with 2
cores (different schema) .. with each core having 2 shards (i m assuming
shards are really partitions here, across multiple nodes in the cluster) ..
with one shard being the replica..


On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller markrmil...@gmail.com wrote:


 On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:

  I have been using solr for a while and have recently started getting into
  solrcloud .. i am a bit confused with some of the concepts ..
 
  1. what exactly is the relationship between a collection and the core ..
  can a core has multiple collections in it .. in this case all collections
  within this core will have the same schema .. and i am assuming all
  instances of collections within the core can be deployed on different
 solr
  nodes to achieve distributed search ..
  or is it the other way around where a collection can have multiple cores

 Currently, a core basically equals a replica of the index.

 So you might have a collection called collection1 - lets say it's 2 shards
 and each shard has a single replica:

 Collection1
 shard1 replica1
 shard1 replica2
 shard2 replica1
 shard2 replica2

 Each of those replicas is a core. So a collection has multiple cores
 basically. Also, each of those cores can be on a different machine. So yes,
 you have distributed indexing and distributed search.

 
  2. at some places it has been pointed out that solrcloud doesnt actually
  supports replication .. but in the solrcloud wiki the second example is
  supposed to be for replication .. so does solrcloud at this point
 supports
  automatic replication where as you add more servers it automatically uses
  the additional servers as replicas

 SolrCloud doesn't support the old style Solr replication concept. It does
 however, handle replication - it's just all pretty much automatic and
 behind the scenes - eg all the information about Solr replication in the
 wiki documentation for previous versions of Solr is really not applicable.
 We now achieve replica copies by sending documents to each shard one
 document at a time so that we can support near realtime search. The old
 style replication is only used in recovery, or when you start a new replica
 machine and it has to 'catchup' to the other replicas.

 
  I have a few more questions but I wanted to get these basic ones out of
 the
  way first .. I would appreciate any response.

 Fire away.

 
  Thanks
  Adeel

 - Mark Miller
 lucidimagination.com














linking documents in solr

2012-02-08 Thread T Vinod Gupta
hi,
I have a question around documents linking in solr and want to know if its
possible. lets say i have a set of blogs and their authors that i want to
index seperately. is it possible to link a document describing a blog to
another document describing an author? if yes, can i search for blogs with
filters on attributes of the author? if yes, if i update an attribute of an
author (by its id), then will the search results reflect the updated
attribute(s)?

thanks


Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote:

 okay so after reading Bruno's blog post .. lets add slice to the mix as
 well .. so we have got collections, cores, shards, partitions and slices :)
 ..

Yeah - heh - this has bugged me, but we have not really all come down on 
agreement of terminology here. I was a fan of using shard for each node and 
slice for partition. Another couple of committers wanted partitions rather than 
slice. Another says slice in code, shard for both in terminology and use 
context...

I'd even go for shards as partitions and replicas for every node in a shard. 
But those fine points are still settling ;)

 
 The whole point with cores is to be able to have different schemas on the
 same solr server instance. So how does that changes with collections .. may
 be an example might help .. if I want to setup a solrcloud cluster with 2
 cores (different schema) .. with each core having 2 shards (i m assuming
 shards are really partitions here, across multiple nodes in the cluster) ..
 with one shard being the replica..

So this would mean you want to create 2 collections. Think of a collection as a 
bunch of SolrCores that all share the same schema and config. 

So you would start up 2 nodes set to one collection and with numShards=1 that 
will give you one shard hosted by two identical SolrCores, giving you a 
replication factor. The full index will be in each of the two SolrCores.

Then if you start another two nodes and specify a different collection name, 
you will get the same thing, but distinct from your first collection (although, 
if both collections have compatible shema/config you can still search across 
them).

 
 
 On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller markrmil...@gmail.com wrote:
 
 
 On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
 
 I have been using solr for a while and have recently started getting into
 solrcloud .. i am a bit confused with some of the concepts ..
 
 1. what exactly is the relationship between a collection and the core ..
 can a core has multiple collections in it .. in this case all collections
 within this core will have the same schema .. and i am assuming all
 instances of collections within the core can be deployed on different
 solr
 nodes to achieve distributed search ..
 or is it the other way around where a collection can have multiple cores
 
 Currently, a core basically equals a replica of the index.
 
 So you might have a collection called collection1 - lets say it's 2 shards
 and each shard has a single replica:
 
 Collection1
 shard1 replica1
 shard1 replica2
 shard2 replica1
 shard2 replica2
 
 Each of those replicas is a core. So a collection has multiple cores
 basically. Also, each of those cores can be on a different machine. So yes,
 you have distributed indexing and distributed search.
 
 
 2. at some places it has been pointed out that solrcloud doesnt actually
 supports replication .. but in the solrcloud wiki the second example is
 supposed to be for replication .. so does solrcloud at this point
 supports
 automatic replication where as you add more servers it automatically uses
 the additional servers as replicas
 
 SolrCloud doesn't support the old style Solr replication concept. It does
 however, handle replication - it's just all pretty much automatic and
 behind the scenes - eg all the information about Solr replication in the
 wiki documentation for previous versions of Solr is really not applicable.
 We now achieve replica copies by sending documents to each shard one
 document at a time so that we can support near realtime search. The old
 style replication is only used in recovery, or when you start a new replica
 machine and it has to 'catchup' to the other replicas.
 
 
 I have a few more questions but I wanted to get these basic ones out of
 the
 way first .. I would appreciate any response.
 
 Fire away.
 
 
 Thanks
 Adeel
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 

- Mark Miller
lucidimagination.com













Re: Improving performance for SOLR geo queries?

2012-02-08 Thread Nicolas Flacco
I compared locallucene to spatial search and saw a performance
degradation, even using geohash queries, though perhaps I indexed things
wrong? Locallucene across 6 machines handles 150 queries per second fine,
but using geofilt and geohash I got lots of timeouts even when I was doing
only 50 queries per second. Has anybody done a formal comparison of
locallucene with spatial search and latlontype, pointtype and geohash?

On 2/8/12 2:20 PM, Ryan McKinley ryan...@gmail.com wrote:

Hi Matthias-

I'm trying to understand how you have your data indexed so we can give
reasonable direction.

What field type are you using for your locations?  Is it using the
solr spatial field types?  What do you see when you look at the debug
information from debugQuery=true?

From my experience, there is no single best practice for spatial
queries -- it will depend on your data density and distribution if.

You may also want to look at:
http://code.google.com/p/lucene-spatial-playground/
but note this is off lucene trunk -- the geohash queries are super fast
though

ryan




2012/2/8 Matthias Käppler matth...@qype.com:
 Hi Erick,

 if we're not doing geo searches, we filter by location tags that we
 attach to places. This is simply a hierachical regional id, which is
 simple to filter for, but much less flexible. We use that on Web a
 lot, but not on mobile, where we want to performance searches in
 arbitrary radii around arbitrary positions. For those location tag
 kind of queries, the average time spent in SOLR is 43msec (I'm looking
 at the New Relic snapshot of the last 12 hours). I have disabled our
 optimization again just yesterday, so for the bbox queries we're now
 at an avg of 220ms (same time window). That's a 5 fold increase in
 response time, and in peak hours it's worse than that.

 I've also found a blog post from 3 years ago which outlines the inner
 workings of the SOLR spatial indexing and searching:
 http://www.searchworkings.org/blog/-/blogs/23842
 From that it seems as if SOLR already performs a similar optimization
 we had in mind during the index step, so if I understand correctly, it
 doesn't even search over all records, only those that were mapped to
 the grid box identified during indexing.

 What I would love to see is what the suggested way is to perform a geo
 query on SOLR, considering that they're so difficult to cache and
 expensive to run. Is the best approach to restrict the candidate set
 as much as possible using cheap filter queries, so that SOLR merely
 has to do the geo search against these subsets? How does the query
 planner work here? I see there's a cost attached to a filter query,
 but one can only set it when cache is set to false? Are cached geo
 queries executed last when there are cheaper filter queries to cut
 down on documents? If you have a real world practical setup to share,
 one that performs well in a production environment that serves
 requests in the Millions per day, that would be great.

 I'd love to contribute documentation by the way, if you knew me you'd
 know I'm an avid open source contributor and actually run several open
 source projects myself. But tell me, how can I possibly contribute
 answer to questions I don't have an answer to? That's why I'm here,
 remember :) So please, these kinds of snippy replies are not helping
 anyone.

 Thanks
 -Matthias

 On Tue, Feb 7, 2012 at 3:06 PM, Erick Erickson
erickerick...@gmail.com wrote:
 So the obvious question is what is your
 performance like without the distance filters?

 Without that knowledge, we have no clue whether
 the modifications you've made had any hope of
 speeding up your response times

 As for the docs, any improvements you'd like to
 contribute would be happily received

 Best
 Erick

 2012/2/6 Matthias Käppler matth...@qype.com:
 Hi,

 we need to perform fast geo lookups on an index of ~13M places, and
 were running into performance problems here with SOLR. We haven't done
 a lot of query optimization / SOLR tuning up until now so there's
 probably a lot of things we're missing. I was wondering if you could
 give me some feedback on the way we do things, whether they make
 sense, and especially why a supposed optimization we implemented
 recently seems to have no effect, when we actually thought it would
 help a lot.

 What we do is this: our API is built on a Rails stack and talks to
 SOLR via a Ruby wrapper. We have a few filters that almost always
 apply, which we put in filter queries. Filter cache hit rate is
 excellent, about 97%, and cache size caps at 10k filters (max size is
 32k, but it never seems to reach that many, probably because we
 replicate / delta update every few minutes). Still, geo queries are
 slow, about 250-500msec on average. We send them with cache=false, so
 as to not flood the fq cache and cause undesirable evictions.

 Now our idea was this: while the actual geo queries are poorly
 cacheable, we could clearly identify geographical regions which are
 more often 

Re: usage of /etc/jetty.xml when debugging Solr in Eclipse

2012-02-08 Thread jmlucjav
yes, I am using https://github.com/alexwinston/RunJettyRun that apparently is
a fork of the original project that originated in the need to use an
jetty.xml.

So I am already setting an additional jetty.xml, this can be done in the Run
configuration, no need to use -D param. But as I mentioned solr does not
start cleanly if I do that.

So I wanted to understand what role plays /etc/jetty.xml 
- when solr is started via 'java -jar start.jar'
- when started with RunJettyRun in eclipse.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/usage-of-etc-jetty-xml-when-debugging-Solr-in-Eclipse-tp3725588p3728008.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr cloud concepts

2012-02-08 Thread Jamie Johnson
Mark,
is the recommendation now to have each solr instance be a separate core in
solr cloud? I had thought that the core name was by default the collection
name? Or are you saying that although they have the same name they are
separate because they are in different JVMs?

On Wednesday, February 8, 2012, Mark Miller markrmil...@gmail.com wrote:

 On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:

 I have been using solr for a while and have recently started getting into
 solrcloud .. i am a bit confused with some of the concepts ..

 1. what exactly is the relationship between a collection and the core ..
 can a core has multiple collections in it .. in this case all collections
 within this core will have the same schema .. and i am assuming all
 instances of collections within the core can be deployed on different
solr
 nodes to achieve distributed search ..
 or is it the other way around where a collection can have multiple cores

 Currently, a core basically equals a replica of the index.

 So you might have a collection called collection1 - lets say it's 2
shards and each shard has a single replica:

 Collection1
 shard1 replica1
 shard1 replica2
 shard2 replica1
 shard2 replica2

 Each of those replicas is a core. So a collection has multiple cores
basically. Also, each of those cores can be on a different machine. So yes,
you have distributed indexing and distributed search.


 2. at some places it has been pointed out that solrcloud doesnt actually
 supports replication .. but in the solrcloud wiki the second example is
 supposed to be for replication .. so does solrcloud at this point
supports
 automatic replication where as you add more servers it automatically uses
 the additional servers as replicas

 SolrCloud doesn't support the old style Solr replication concept. It does
however, handle replication - it's just all pretty much automatic and
behind the scenes - eg all the information about Solr replication in the
wiki documentation for previous versions of Solr is really not applicable.
We now achieve replica copies by sending documents to each shard one
document at a time so that we can support near realtime search. The old
style replication is only used in recovery, or when you start a new replica
machine and it has to 'catchup' to the other replicas.


 I have a few more questions but I wanted to get these basic ones out of
the
 way first .. I would appreciate any response.

 Fire away.


 Thanks
 Adeel

 - Mark Miller
 lucidimagination.com














Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 9:36 PM, Jamie Johnson wrote:

 Mark,
 is the recommendation now to have each solr instance be a separate core in
 solr cloud? I had thought that the core name was by default the collection
 name? Or are you saying that although they have the same name they are
 separate because they are in different JVMs?

By default, the collection name is set to the core name. This is really just 
for convenience when you are getting started. If gives you a default collection 
name of collection1 because the default SolrCore name is collection1, and each 
SolrCore on each instance is addressable as /solr/collection1.

You can certainly have core names be whatever you want and explicitly pass it's 
collection. In the case, the url for each would be different - though I think 
there is an open JIRA issue about making that nicer - so that you can look up 
the right core even if you pass the collection name or something.

- Mark Miller
lucidimagination.com













Re: multiple cores in a single instance vs multiple instances with single core

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 9:52 PM, Jamie Johnson wrote:

 In solr cloud what is a better approach / use of resources having multiple
 cores on a single instance or multiple instances with a single core? What
 are the benefits and drawbacks of each?


It depends I suppose. If you are talking about on a single machine, I'd prefer 
using multiple cores over multiple Solr instances. I think it's just easier to 
manage. You have to be sensible about that though - if all the replicas for a 
shard are on the same machine, in the same instance, as different cores, you 
don't have a lot of room for error - if that box goes down, goodbye. But you 
can certainly mix and match instances and cores.

One interesting thing you can do is a poor mans micro sharding - put a few 
shards per machine - then later when you add more nodes to your cluster, you 
can bring up a core on one of the new machines, it will catch up, then you 
could unload that core on the original machine and replicas. Then start up any 
other new nodes to add replicas for the moved shard. Roughly and/or something 
like that anyway - I haven't thought it through thoroughly, but Yonik has 
brought it up before, and it seems pretty easily doable.

- Mark Miller
lucidimagination.com













Re: multiple cores in a single instance vs multiple instances with single core

2012-02-08 Thread Jamie Johnson
Thanks Mark, in regards to failover I completely agree, I am wondering more
about performance and memory usage if the indexes are large and wondering
if the separate Java instances under heavy load would more or less
performant.  Currently we deploy a single core per instance but deploy
multiple instances per machine
On Wednesday, February 8, 2012, Mark Miller markrmil...@gmail.com wrote:

 On Feb 8, 2012, at 9:52 PM, Jamie Johnson wrote:

 In solr cloud what is a better approach / use of resources having
multiple
 cores on a single instance or multiple instances with a single core? What
 are the benefits and drawbacks of each?


 It depends I suppose. If you are talking about on a single machine, I'd
prefer using multiple cores over multiple Solr instances. I think it's just
easier to manage. You have to be sensible about that though - if all the
replicas for a shard are on the same machine, in the same instance, as
different cores, you don't have a lot of room for error - if that box goes
down, goodbye. But you can certainly mix and match instances and cores.

 One interesting thing you can do is a poor mans micro sharding - put a
few shards per machine - then later when you add more nodes to your
cluster, you can bring up a core on one of the new machines, it will catch
up, then you could unload that core on the original machine and replicas.
Then start up any other new nodes to add replicas for the moved shard.
Roughly and/or something like that anyway - I haven't thought it through
thoroughly, but Yonik has brought it up before, and it seems pretty easily
doable.

 - Mark Miller
 lucidimagination.com














Re: solr cloud concepts

2012-02-08 Thread Adeel Qureshi
Thanks for the explanation. It makes sense but I am hoping that you can
clarify things a bit more ..

so now it sounds like in solrcloud the concept of cores have changed a bit
.. as you explained that for me to have 2 cores with different schemas I
will need 2 different collections .. and one good thing about solrcores was
that you could create new ones with coreadmin api or the http calls .. to
create new collections its not that automated right ..

secondly if collections represent what kind of used to be solrcore then
once i have a collection .. why would i ever want to add multiple cores to
it .. i mean i am trying to think of a reason why it would make sense to do
that.

Thanks


On Wed, Feb 8, 2012 at 4:41 PM, Mark Miller markrmil...@gmail.com wrote:


 On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote:

  okay so after reading Bruno's blog post .. lets add slice to the mix as
  well .. so we have got collections, cores, shards, partitions and slices
 :)
  ..

 Yeah - heh - this has bugged me, but we have not really all come down on
 agreement of terminology here. I was a fan of using shard for each node and
 slice for partition. Another couple of committers wanted partitions rather
 than slice. Another says slice in code, shard for both in terminology and
 use context...

 I'd even go for shards as partitions and replicas for every node in a
 shard. But those fine points are still settling ;)

 
  The whole point with cores is to be able to have different schemas on the
  same solr server instance. So how does that changes with collections ..
 may
  be an example might help .. if I want to setup a solrcloud cluster with 2
  cores (different schema) .. with each core having 2 shards (i m assuming
  shards are really partitions here, across multiple nodes in the cluster)
 ..
  with one shard being the replica..

 So this would mean you want to create 2 collections. Think of a collection
 as a bunch of SolrCores that all share the same schema and config.

 So you would start up 2 nodes set to one collection and with numShards=1
 that will give you one shard hosted by two identical SolrCores, giving you
 a replication factor. The full index will be in each of the two SolrCores.

 Then if you start another two nodes and specify a different collection
 name, you will get the same thing, but distinct from your first collection
 (although, if both collections have compatible shema/config you can still
 search across them).

 
 
  On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
 
  I have been using solr for a while and have recently started getting
 into
  solrcloud .. i am a bit confused with some of the concepts ..
 
  1. what exactly is the relationship between a collection and the core
 ..
  can a core has multiple collections in it .. in this case all
 collections
  within this core will have the same schema .. and i am assuming all
  instances of collections within the core can be deployed on different
  solr
  nodes to achieve distributed search ..
  or is it the other way around where a collection can have multiple
 cores
 
  Currently, a core basically equals a replica of the index.
 
  So you might have a collection called collection1 - lets say it's 2
 shards
  and each shard has a single replica:
 
  Collection1
  shard1 replica1
  shard1 replica2
  shard2 replica1
  shard2 replica2
 
  Each of those replicas is a core. So a collection has multiple cores
  basically. Also, each of those cores can be on a different machine. So
 yes,
  you have distributed indexing and distributed search.
 
 
  2. at some places it has been pointed out that solrcloud doesnt
 actually
  supports replication .. but in the solrcloud wiki the second example is
  supposed to be for replication .. so does solrcloud at this point
  supports
  automatic replication where as you add more servers it automatically
 uses
  the additional servers as replicas
 
  SolrCloud doesn't support the old style Solr replication concept. It
 does
  however, handle replication - it's just all pretty much automatic and
  behind the scenes - eg all the information about Solr replication in the
  wiki documentation for previous versions of Solr is really not
 applicable.
  We now achieve replica copies by sending documents to each shard one
  document at a time so that we can support near realtime search. The old
  style replication is only used in recovery, or when you start a new
 replica
  machine and it has to 'catchup' to the other replicas.
 
 
  I have a few more questions but I wanted to get these basic ones out of
  the
  way first .. I would appreciate any response.
 
  Fire away.
 
 
  Thanks
  Adeel
 
  - Mark Miller
  lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 

 - Mark Miller
 lucidimagination.com














Re: Sorting solrdocumentlist object after querying

2012-02-08 Thread Kashif Khan
No that sorting is based on multiple fields. Basically i want to sort them
as the group by statement like in the SQL based on few fields and many
loops to go through. The problem is that i have say 1,000,000 solr docs
after injecting my few solr docs and then i want to do group by these solr
docs by some fields and then take 20 records for paging. So i need some
shortcut for that.
--
Kashif Khan.
http://www.kashifkhan.in



On Wed, Feb 8, 2012 at 11:07 PM, iorixxx [via Lucene] 
ml-node+s472066n3726788...@n3.nabble.com wrote:

  I want to sort a SolrDocumentList after it has been queried
  and obtained
  from the QueryResponse.getResults(). The reason is i have a
  SolrDocumentList
  obtained after querying using QueryResponse.getResults() and
  i have added
  few docs to it. Now i want to sort this SolrDocumentList
  based on the same
  fields i did the querying before i modified this
  SolrDocumentList.

 QueryResponse.getResults()  will return rows many documents. Cant you sort
 them (plus your injected documents) with your own?


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3726788.html
  To unsubscribe from Sorting solrdocumentlist object after querying, click
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3726303code=dXBsaW5rMjAxMEBnbWFpbC5jb218MzcyNjMwM3wtMTgzODU3NDI3OQ==
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-solrdocumentlist-object-after-querying-tp3726303p3728549.html
Sent from the Solr - User mailing list archive at Nabble.com.

How do i do group by in solr with multiple shards?

2012-02-08 Thread Kashif Khan
Hi all,

I have tried group by in solr with multiple shards but it does not work.
Basically i want to simply do GROUP BY statement like in SQL in solr with
multiple shards. Please suggest me how can i do this as it is not supported
currently OOB by solr.

Thanks  regards,
Kashif Khan

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-i-do-group-by-in-solr-with-multiple-shards-tp3728555p3728555.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to identify the field with highest score in dismax

2012-02-08 Thread Mikhail Khludnev
Hello,

Have you tried to specify debugQuery=on and look into explain section?
Though it's not really performant, but anyway I propose to start from it.

Regards

On Wed, Feb 8, 2012 at 7:32 PM, crisfromnova crisfromn...@gmail.com wrote:

 Hi,

 According solr documentation the dismax score is calculating after the
 formula :
 (score of matching clause with the highest score) + ( (tie paramenter) *
 (scores of any other matching clauses) ).

 Is there a way to identify the field on which the matching clause score is
 the highest?

 For example I suppose that I have the following document :

 doc
  str name=NameFord Mustang Coupe Cabrio/str
  str name=DetailsFord Mustang is a great car/str
 /doc

 and the following dismax query :

 defType=dismaxqf=Name^10+Details^1q=Ford+Mustang+Ford+Mustang

 and receive the document with the score : 5.6.
 Is there a way to find out if the score is for the matching on Name field
 or
 for the matching on Details field?

 Thanks in advance!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-identify-the-field-with-highest-score-in-dismax-tp3726297p3726297.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com