Re: [External] Re: Query all fields

2012-10-24 Thread Muwonge Ronald
On Thu, Oct 25, 2012 at 1:34 AM, Greene, Daniel [USA]
 wrote:
> Another option you'll find out there is to use a 'copy field ' to copy the 
> contents of multiple fields into a single indexed field for "universal " 
> searching...
>
Told with precision
>
>
> - Reply message -
> From: "Ahmet Arslan" 
> To: "solr-user@lucene.apache.org" 
> Subject: [External] Re: Query all fields
> Date: Wed, Oct 24, 2012 6:26 pm
>
>
>
>
>> Looking at the Solr tutorial I see
>> queries like:
>>
>> q=video&fl=name,id (return only name and id fields)
>>
>> Does that query all fields for the word video?
>
> No query is executed on default search field. If you add &debugQuery=on to 
> your URL you can see which field is queried.
>
>> Is there something specific setup in the solr tutorial that
>> allows you
>> to query across all fields?
>
> With http://wiki.apache.org/solr/ExtendedDisMax you can do that. You just 
> need to supply names of fields that you want to search.
>
> defType=edismax&qf=description,title,name,etc.


SolrJ missing CollectionAdmin Api to create new collections dynamically

2012-10-24 Thread Markus.Mirsberger

Hi,

I can't find a good way to create a new Collection with SolrJ.
I need to create my Collections dynamically and at the moment the only 
way I see is to call the CollectionAdmin with a HTTP Call directly to 
any of my SolrServers.


I don't like this because I think its a better way only to communicate 
through the CloudSolrServer connected to the zookeeper Servers and my 
application dont need to know anything about the Solr Servers behind.


Is there a better way to do this? Maybe through the ZkStateReader inside 
the CloudSolrServer instance?


Thanks and regards,
Markus


org.apache.lucene.queryparser.classic.ParseException - a Bug?

2012-10-24 Thread deniz
Hi all,

I was trying to provide spatial search via solrj client. but when i try to
run it i got 


org.apache.solr.common.SolrException:
org.apache.lucene.queryparser.classic.ParseException: Expected identifier at
pos 9 str='{!geofilt+sfield=store}'


I have tried to do the same search on browser and via URL request from java
and there was no problem with those... but via solrj i keep getting the
error above... 

below is the code that I have used to reproduce the error 



import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.net.URLConnection;

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.SolrDocumentList;


public class RunMeFast {


 public static void main(String[] args)
 {
 try{
 HttpSolrServer server = new
HttpSolrServer("http://localhost:8983/solr/testcore";);
 SolrQuery solrQuery = new SolrQuery();


 solrQuery.set("q", "*:*");
 solrQuery.set("d","1000");
 solrQuery.set("fl","*,_dist_:geodist()");
 solrQuery.set("sfield","store");
 solrQuery.set("pt","47,+8");
 solrQuery.set("fq","{!geofilt+sfield=store}"); 
 QueryResponse   response = server.query(solrQuery);
 long totalDocs = response.getResults().getNumFound();
 SolrDocumentList docList = response.getResults();
 System.out.println(docList.toString());
 }
 catch(Exception e ){
 e.printStackTrace();
 }
 
 
 try
 {
 URL url = new
URL("http://localhost:8983/solr/testcore/select?q=*:*&d=1000&fl=*,_dist_:geodist()&sfield=store&pt=47,+8&fq={!geofilt+sfield=store}");
 URLConnection conn = url.openConnection();
conn.setDoOutput(true);
OutputStreamWriter wr = new
OutputStreamWriter(conn.getOutputStream());
wr.flush();

// Get the response
BufferedReader rd = new BufferedReader(new
InputStreamReader(conn.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
System.out.println(line);   }
wr.close();
rd.close();
 
 }catch(Exception e){
 e.printStackTrace();
 }
 
 }




And the output is :


SOLRJ Call:

Oct 25, 2012 11:54:24 AM org.apache.solr.client.solrj.impl.HttpClientUtil
createClient
INFO: Creating new http client,
config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
org.apache.solr.common.SolrException:
org.apache.lucene.queryparser.classic.ParseException: Expected identifier at
pos 9 str='{!geofilt+sfield=store}'
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
at RunMeFast.main(RunMeFast.java:29)

URL Request:


correct doc list




I am using Solr 4.0...

so is this a bug? or simply a mistake? anyone can help me?








-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-lucene-queryparser-classic-ParseException-a-Bug-tp4015763.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)

2012-10-24 Thread Shawn Heisey

On 10/24/2012 6:29 PM, Aaron Daubman wrote:

Let me be clear that that I am not interested in RAMDirectory.
However, I would like to better understand the oft-recommended and
currently-default MMapDirectory, and what the tradeoffs would be, when
using a 64-bit linux server dedicated to this single solr instance,
with plenty (more than 2x index size) of RAM, of storing the index
files on SSDs versus on a ramfs mount.

I understand that using the default MMapDirectory will allow caching
of the index in-memory, however, my understanding is that mmaped files
are demand-paged (lazy evaluated), meaning that only after a block is
read from disk will it be paged into memory - is this correct? is it
actually block-by-block (page size by page size?) - any pointers to
decent documentation on this regardless of the effectiveness of the
approach would be appreciated...


You are correct that the data must have just been accessed to be in the 
disk cache.This does however include writes -- so any data that gets 
indexed will be in the cache because it has just been written.  I do 
believe that it is read in one page block at a time, and I believe that 
the blocks are 4k in size.



My concern with using MMapDirectory for an index stored on disk (even
SSDs), if my understanding is correct, is that there is still a large
startup cost to MMapDirectory, as it may take many queries before even
most of a 20G index has been loaded into memory, and there may yet
still be "dark corners" that only come up in edge-case queries that
cause QTime spikes should these queries ever occur.

I would like to ensure that, at startup, no query will incur
disk-seek/read penalties.

Is the "right" way to achieve this to copy the index to a ramfs (NOT
ramdisk) mount and then continue to use MMapDirectory in Solr to read
the index? I am under the impression that when using ramfs (rather
than ramdisk, for which this would not work) a file mmaped on a ramfs
mount will actually share the same address space, and so would not
incur the typical double-ram overhead of mmaping a file in memory just
o have yet another copy of the file created in a second memory
location. Is this correct? If not, would you please point me to
documentation stating otherwise (I haven't found much documentation
either way).


I am not familiar with any "double-ram overhead" from using mmap.  It 
should be extroardinarily efficient, so much so that even when your 
index won't fit in RAM, performance is typically still excellent.  Using 
an SSD instead of a spinning disk will increase performance across the 
board, until enough of the index is cached in RAM, after which it won't 
make a lot of difference.


My parting thoughts, with a general note to the masses: Do not try this 
if you are not absolutely sure your index will fit in memory!  It will 
tend to cause WAY more problems than it will solve for most people with 
large indexes.


If you actually do have considerably more RAM than your index size, and 
you know that the index will never grow to where it might not fit, you 
can use a simple trick to get it all cached, even before running 
queries.  Just read the entire contents of the index, discarding 
everything you read.  There are two main OS variants to consider here, 
and both can be scripted, as noted below.  Run the command twice to see 
the difference that caching makes for the second run.  Note that an SSD 
would speed the first run of these commands up considerably:


*NIX (may work on a mac too):
cat /path/to/index/files/* > /dev/null

Windows:
type C:\Path\To\Index\Files\* > NUL

Thanks,
Shawn



Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)

2012-10-24 Thread Mark Miller
Was going to say the same thing. It's also usually a good idea to reduce paging 
(eg 0 swappiness in linux).

- Mark

On Oct 24, 2012, at 9:36 PM, François Schiettecatte  
wrote:

> Aaron
> 
> The best way to make sure the index is cached by the OS is to just cat it on 
> startup:
> 
>   cat `find /path/to/solr/index` > /dev/null
> 
> Just make sure your index is smaller than RAM otherwise data will be rotated 
> out.
> 
> Memory mapping is built on the virtual memory system, and I suspect that 
> ramfs is too, so I doubt very much that copying your index to ramfs will help 
> at all. Sidebar - a while ago I did a bunch of testing copying indices to 
> shared memory (/dev/shm in this case) and there was no advantage compared to 
> just accessing indices on disc when using memory mapping once the system got 
> to a steady state.
> 
> There has been a lot written about this topic on the list. Basically it come 
> down to using MMapDirectory (which is the default), make sure your index is 
> smaller than your RAM, and allocate just enough memory to the Java VM. That 
> last part requires some benchmarking because it is so workload dependent.
> 
> Best regards
> 
> François
> 
> On Oct 24, 2012, at 8:29 PM, Aaron Daubman  wrote:
> 
>> Greetings,
>> 
>> Most times I've seen the topic of storing one's index in memory, it
>> seems the asker was referring (or understood to be referring) to the
>> (in)famous "not intended to work with huge indexes" Solr RAMDirectory.
>> 
>> Let me be clear that that I am not interested in RAMDirectory.
>> However, I would like to better understand the oft-recommended and
>> currently-default MMapDirectory, and what the tradeoffs would be, when
>> using a 64-bit linux server dedicated to this single solr instance,
>> with plenty (more than 2x index size) of RAM, of storing the index
>> files on SSDs versus on a ramfs mount.
>> 
>> I understand that using the default MMapDirectory will allow caching
>> of the index in-memory, however, my understanding is that mmaped files
>> are demand-paged (lazy evaluated), meaning that only after a block is
>> read from disk will it be paged into memory - is this correct? is it
>> actually block-by-block (page size by page size?) - any pointers to
>> decent documentation on this regardless of the effectiveness of the
>> approach would be appreciated...
>> 
>> My concern with using MMapDirectory for an index stored on disk (even
>> SSDs), if my understanding is correct, is that there is still a large
>> startup cost to MMapDirectory, as it may take many queries before even
>> most of a 20G index has been loaded into memory, and there may yet
>> still be "dark corners" that only come up in edge-case queries that
>> cause QTime spikes should these queries ever occur.
>> 
>> I would like to ensure that, at startup, no query will incur
>> disk-seek/read penalties.
>> 
>> Is the "right" way to achieve this to copy the index to a ramfs (NOT
>> ramdisk) mount and then continue to use MMapDirectory in Solr to read
>> the index? I am under the impression that when using ramfs (rather
>> than ramdisk, for which this would not work) a file mmaped on a ramfs
>> mount will actually share the same address space, and so would not
>> incur the typical double-ram overhead of mmaping a file in memory just
>> o have yet another copy of the file created in a second memory
>> location. Is this correct? If not, would you please point me to
>> documentation stating otherwise (I haven't found much documentation
>> either way).
>> 
>> Finally, given the desire to be quick at startup with a large index
>> that will still easily fit within a system's memory, am I thinking
>> about this wrong or are there other better approaches?
>> 
>> Thanks, as always,
>>Aaron
> 



Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)

2012-10-24 Thread François Schiettecatte
Aaron

The best way to make sure the index is cached by the OS is to just cat it on 
startup:

cat `find /path/to/solr/index` > /dev/null

Just make sure your index is smaller than RAM otherwise data will be rotated 
out.

Memory mapping is built on the virtual memory system, and I suspect that ramfs 
is too, so I doubt very much that copying your index to ramfs will help at all. 
Sidebar - a while ago I did a bunch of testing copying indices to shared memory 
(/dev/shm in this case) and there was no advantage compared to just accessing 
indices on disc when using memory mapping once the system got to a steady state.

There has been a lot written about this topic on the list. Basically it come 
down to using MMapDirectory (which is the default), make sure your index is 
smaller than your RAM, and allocate just enough memory to the Java VM. That 
last part requires some benchmarking because it is so workload dependent.

Best regards

François

On Oct 24, 2012, at 8:29 PM, Aaron Daubman  wrote:

> Greetings,
> 
> Most times I've seen the topic of storing one's index in memory, it
> seems the asker was referring (or understood to be referring) to the
> (in)famous "not intended to work with huge indexes" Solr RAMDirectory.
> 
> Let me be clear that that I am not interested in RAMDirectory.
> However, I would like to better understand the oft-recommended and
> currently-default MMapDirectory, and what the tradeoffs would be, when
> using a 64-bit linux server dedicated to this single solr instance,
> with plenty (more than 2x index size) of RAM, of storing the index
> files on SSDs versus on a ramfs mount.
> 
> I understand that using the default MMapDirectory will allow caching
> of the index in-memory, however, my understanding is that mmaped files
> are demand-paged (lazy evaluated), meaning that only after a block is
> read from disk will it be paged into memory - is this correct? is it
> actually block-by-block (page size by page size?) - any pointers to
> decent documentation on this regardless of the effectiveness of the
> approach would be appreciated...
> 
> My concern with using MMapDirectory for an index stored on disk (even
> SSDs), if my understanding is correct, is that there is still a large
> startup cost to MMapDirectory, as it may take many queries before even
> most of a 20G index has been loaded into memory, and there may yet
> still be "dark corners" that only come up in edge-case queries that
> cause QTime spikes should these queries ever occur.
> 
> I would like to ensure that, at startup, no query will incur
> disk-seek/read penalties.
> 
> Is the "right" way to achieve this to copy the index to a ramfs (NOT
> ramdisk) mount and then continue to use MMapDirectory in Solr to read
> the index? I am under the impression that when using ramfs (rather
> than ramdisk, for which this would not work) a file mmaped on a ramfs
> mount will actually share the same address space, and so would not
> incur the typical double-ram overhead of mmaping a file in memory just
> o have yet another copy of the file created in a second memory
> location. Is this correct? If not, would you please point me to
> documentation stating otherwise (I haven't found much documentation
> either way).
> 
> Finally, given the desire to be quick at startup with a large index
> that will still easily fit within a system's memory, am I thinking
> about this wrong or are there other better approaches?
> 
> Thanks, as always,
> Aaron



UnsupportedOperationException: ExternalFileField

2012-10-24 Thread Carrie Coy
(Solr4) I'm getting the following error trying to use ExternalFileField 
to maintain an inStock flag.   Any idea what I'm doing wrong?


schema.xml:
 
 indexed="false" class="solr.ExternalFileField" valType="float"/>


-rw-r--r-- 1 tomcat tomcat 100434 Oct 24 20:07 external_inStock:
YM0600=1
YM0544=1
YM0505=1

solrconfig.xml:
 if(inStock,10,1)


SEVERE: null:java.lang.UnsupportedOperationException
at 
org.apache.solr.schema.ExternalFileField.write(ExternalFileField.java:85)
at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:130)
at 
org.apache.solr.response.JSONWriter.writeSolrDocument(JSONResponseWriter.java:355)
at 
org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:275)
at 
org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:172)
at 
org.apache.solr.response.JSONWriter.writeNamedListAsMapMangled(JSONResponseWriter.java:154)
at 
org.apache.solr.response.PHPWriter.writeNamedList(PHPResponseWriter.java:54)
at 
org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:91)
at 
org.apache.solr.response.PHPResponseWriter.write(PHPResponseWriter.java:36)
at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:411)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)

at java.lang.Thread.run(Thread.java:662)


Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Aaron Daubman
Hi Peter,

Thanks for the recommendation - I believe we are thinking along the
same lines, but wanted to check to make sure. Are you suggesting
something different than my #5 (below) or are we essentially
suggesting the same thing?

On Wed, Oct 24, 2012 at 1:20 PM, Peter Keegan  wrote:
> Could you index your 'phrase tags' as single tokens? Then your phrase
> queries become simple TermQuerys.

>>
>> 5) *This is my current favorite*: stop tokenizing/analyzing these
>> terms and just use KeywordTokenizer. Most of these phrases are
>> pre-vetted, and it may be possible to clean/process any others before
>> creating the docs. My main worry here is that, currently, if I
>> understand correctly, a document with the phrase "brazilian pop" would
>> still be returned as a match to a seed document containing only the
>> phrase "brazilian" (not the other way around, but that is not
>> necessary), however, with KeywordTokenizer, this would no longer be
>> the case. If I switched from the current dubious tokenize/stem/etc...
>> and just used Keyword, would this allow queries like "this used to be
>> a long phrase query" to match documents that have "this used to be a
>> long phrase query" as one of the multivalued values in the field
>> without having to pull term positions? (and thus significantly speed
>> up performance).
>>

Thanks again,
 Aaron


Re: Newbie - Setting up date and distance facets

2012-10-24 Thread Chris Hostetter

: Thank you for the reply. The facet range gap loks good but it is too far down
: the line to be of use, I wish it was implemented though.
: 
: What I want is really a more simple question
: 
: http://wiki.apache.org/solr/SimpleFacetParameters#facet.range
: 
: Is it correct that to add facets on date and distance i should be looking at
: "facet range"?

range faceting is appropriate when you want Solr to determine a set of N 
non-overlapping ranges of fixed size based on some lower & upper bounds 
you privde (ie: "counts per day every day for the past month" or "price 
ranges of $50 each from $0-$1000"

If you have a specific, finite, set of ranges you want to facet on --
regardless of wether they overlap -- then you can use facet.query...

?q=...
&facet=true
&facet.query={!key=today}date_field:[NOW/DAY TO NOW/DAY+1DAY]
&facet.query={!key=yesterday}date_field:[NOW-1DAY/DAY TO NOW/DAY]
&facet.query={!key=so_far_this_month}date_field:[NOW/MONTH TO NOW/DAY+1DAY]
&etc...



-Hoss


Re: [External] Re: Query all fields

2012-10-24 Thread Greene, Daniel [USA]
Another option you'll find out there is to use a 'copy field ' to copy the 
contents of multiple fields into a single indexed field for "universal " 
searching...



- Reply message -
From: "Ahmet Arslan" 
To: "solr-user@lucene.apache.org" 
Subject: [External] Re: Query all fields
Date: Wed, Oct 24, 2012 6:26 pm




> Looking at the Solr tutorial I see
> queries like:
>
> q=video&fl=name,id (return only name and id fields)
>
> Does that query all fields for the word video?

No query is executed on default search field. If you add &debugQuery=on to your 
URL you can see which field is queried.

> Is there something specific setup in the solr tutorial that
> allows you
> to query across all fields?

With http://wiki.apache.org/solr/ExtendedDisMax you can do that. You just need 
to supply names of fields that you want to search.

defType=edismax&qf=description,title,name,etc.


Re: Query all fields

2012-10-24 Thread Billy Newman
Makes sense, thanks!

Billy

Sent from my iPhone

On Oct 24, 2012, at 4:25 PM, Ahmet Arslan  wrote:

> 
>> Looking at the Solr tutorial I see
>> queries like:
>> 
>> q=video&fl=name,id (return only name and id fields)
>> 
>> Does that query all fields for the word video? 
> 
> No query is executed on default search field. If you add &debugQuery=on to 
> your URL you can see which field is queried.
> 
>> Is there something specific setup in the solr tutorial that
>> allows you
>> to query across all fields?
> 
> With http://wiki.apache.org/solr/ExtendedDisMax you can do that. You just 
> need to supply names of fields that you want to search.
> 
> defType=edismax&qf=description,title,name,etc.


Re: Query all fields

2012-10-24 Thread Ahmet Arslan

> Looking at the Solr tutorial I see
> queries like:
> 
> q=video&fl=name,id (return only name and id fields)
> 
> Does that query all fields for the word video?  

No query is executed on default search field. If you add &debugQuery=on to your 
URL you can see which field is queried.

> Is there something specific setup in the solr tutorial that
> allows you
> to query across all fields?

With http://wiki.apache.org/solr/ExtendedDisMax you can do that. You just need 
to supply names of fields that you want to search.

defType=edismax&qf=description,title,name,etc.


Seeking Use Cases: Boosting & Biasing to affect search scores

2012-10-24 Thread Chris Hostetter


Hey folks, I'm giving a talk at ApacheCon in two weeks about how domain 
knowledge and/or knowledge of your user base can be used to boost/bias the 
scores of documents in Solr search results.  Simple examples being things 
like: using function queries to boost by numeric fields like date or 
popularity; or customizing the tf and lengthNorm functions in your 
Similarity based on what "good" documents look like.


  http://www.apachecon.eu/schedule/presentation/16/

I would love to be able to highlight some examples beyond things i've 
personally worked on, so I wanted to reach out the the entire solr-user 
list to see if anyone had interesting examples they'd like to share about 
ways they have tweaked the scoring of documents scores based on knowledge 
of thier users/documents.


Most of the "classes" of customizations i can think of involve...

 * Customized Similarity
 * Boost queries or functions
 * ExternalFiledField
 * QueryElevationComponent

...but don't be shy about sharing a technique you've used that didn't 
involve any of those things.


You can feel free to respond to this message publicly with as much/little 
details you're willing to share about the application, users, 
customizations you used.  Or if you'd like to share an interesting use 
case anonymously (ie: w/o divulging your name or company) you can send it 
to me privately.


Thanks.

-Hoss


Re: Search field affecting relevance

2012-10-24 Thread Maxim Kuleshov
Sorry guys, I initially simplified the criteria. Actually, I use
EDisMaxQueryParser with about dozen of fields configured with
different boost values.
If any of them is matched - it's ok to return such document.

But I also have one (at the moment) field - that I would like to use
only to "boost" resulting score. So, if document is matched only by
value of this field - it should not be returned.

I suppose, boost query of form "bq=field2:foobar^2" should do the
trick. As well as I can use boost function to apply more complex
rules.

2012/10/24 Otis Gospodnetic :
> Hi,
>
> This is core lucene/solr functionality. +field1:foo field2:foo makes a
> match in field1 required.
>
> Otis
> --
> Performance Monitoring - http://sematext.com/spm
> On Oct 24, 2012 4:39 AM, "Maxim Kuleshov"  wrote:
>
>> Hi,
>>
>> For example, we have documents with two fields - field1 and field2.
>> Both fields are indexed and both are used in search.
>>
>> Is there way to return only documents that are matched by field1, but
>> taking in account that if field2 is matched - relevance should be
>> higher? In other words, if document "A" is mathced by field1 and
>> field2 it's relevance should be higher than document "B" matched only
>> by field1 and document "C" where only field2 is matched should not be
>> returned at all?
>>
>> Could you please help with outlining the general approach how to
>> achieve this? Either it's core lucene feature, or solr post processing
>> logic or something else?
>>


Re: Search field affecting relevance

2012-10-24 Thread Otis Gospodnetic
Hi,

This is core lucene/solr functionality. +field1:foo field2:foo makes a
match in field1 required.

Otis
--
Performance Monitoring - http://sematext.com/spm
On Oct 24, 2012 4:39 AM, "Maxim Kuleshov"  wrote:

> Hi,
>
> For example, we have documents with two fields - field1 and field2.
> Both fields are indexed and both are used in search.
>
> Is there way to return only documents that are matched by field1, but
> taking in account that if field2 is matched - relevance should be
> higher? In other words, if document "A" is mathced by field1 and
> field2 it's relevance should be higher than document "B" matched only
> by field1 and document "C" where only field2 is matched should not be
> returned at all?
>
> Could you please help with outlining the general approach how to
> achieve this? Either it's core lucene feature, or solr post processing
> logic or something else?
>


Re: is it possible to index

2012-10-24 Thread Erick Erickson
Do note that you _can_ join across cores. BUT the join
capability is a fairly restricted use-case. And even if it was
performant, it's not like a DB join, you can only return info
from a single kind of doc. That is, if you were joining between
customer documents and vendor documents, you could only
get info back from _either_ the customer doc _or_ the
vendor doc. There's nothing like what you automatically think
about in DB terms where you return data from both types of
docs.

And just to confuse matters further, you can do joins within the
same core with different types of documents.

But I'd try denormalizing first.

FWIW,
Erick

On Wed, Oct 24, 2012 at 3:13 PM, Marcelo Elias Del Valle
 wrote:
> This is gold info for me! Thanks!
>
> 2012/10/24 Martin Koch 
>
>> In my experience, about as fast as you can push the new data :) Depending
>> on the size of your records, this should be a matter of seconds.
>>
>> /Martin Koch
>>
>> On Wed, Oct 24, 2012 at 9:01 PM, Marcelo Elias Del Valle <
>> mvall...@gmail.com
>> > wrote:
>>
>> > Erick,
>> >
>> >  Thanks for the help, it sure helps a lot to read that, as it gives
>> me
>> > more confidence I am not crazy about what I am thinking.
>> >  The only problem I see by de-normalizing data as you said is that if
>> > any relation between customer and vendor changes, I will have to update
>> the
>> > index for all the vendors. I could have about 10 000 customers per
>> vendor.
>> >  Anyway, by what you're saying, it's more common than I was
>> imagining,
>> > right? I wonder how long solr will take to reindex 1 records when
>> this
>> > happens.
>> >
>> > Thanks,
>> > Marcelo Valle.
>> >
>> > 2012/10/24 Erick Erickson 
>> >
>> > > One, take off your RDBMS cap ...
>> > >
>> > > DB folks regularly reject the idea of de-normalizing data
>> > > to make best use of Solr, but that's what I would explore
>> > > first. Yes, this repeats the, in your case, vendor information
>> > > perhaps many times, but try that first, even though that
>> > > causes you to update multiple customers whenever a vendor
>> > > changes. You haven't specified how many customers and vendors
>> > > you're talking abou there, but unless the total number of documents
>> > > (where each document is a customer+vendor combination)
>> > > is multiple tens of millions, you probably will be fine.
>> > >
>> > > You can get a list of just customers by using grouping where you
>> > > group on customer, although that may not be the most efficient. You
>> > > could index a field, call it "cust_filter" that was set to true for the
>> > > first
>> > > customer/vendor you indexed and false (or just left out) for all the
>> > > rest and q=blahblah&fq=cust_filter:true.
>> > >
>> > > Hope that helps
>> > > Erick
>> > >
>> > > On Wed, Oct 24, 2012 at 12:01 PM, Marcelo Elias Del Valle
>> > >  wrote:
>> > > > Hello,
>> > > >
>> > > > I am new to Solr and I have a scenario where I want to use it,
>> but
>> > I
>> > > > might be misunderstanding some concepts. I will explain what I want
>> > here,
>> > > > if someone has a solution for this, I would gladly accept the help.
>> > > > I have a core indexing customers. I have another core indexing
>> > > vendors.
>> > > > Both are related to each other.
>> > > > Here is what I want to do in my application: I want to find all
>> the
>> > > > customers that follow some criteria and them find the vendors related
>> > to
>> > > > them.
>> > > >
>> > > > My first option was to to have just vendor core and in for each
>> > > > document in vendor core I would have all the customers related to it.
>> > > > However, I would write the same customer several times to the index,
>> as
>> > > > more than one vendor could be related to the same customer. Besides,
>> I
>> > > > wonder how would I write a query to list just the different
>> customers.
>> > > > Another problem is that I update customers in a different frequency I
>> > > > update vendors, but have vendor + customers in a single document
>> would
>> > > obly
>> > > > me to do the full update.
>> > > >
>> > > > Does anyone have a good solution for this I am not being able to
>> > > see? I
>> > > > might be missing some basic concept here...
>> > > >
>> > > > Thanks,
>> > > > --
>> > > > Marcelo Elias Del Valle
>> > > > http://mvalle.com - @mvallebr
>> > >
>> >
>> >
>> >
>> > --
>> > Marcelo Elias Del Valle
>> > http://mvalle.com - @mvallebr
>> >
>>
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr


Re: is it possible to index

2012-10-24 Thread Marcelo Elias Del Valle
This is gold info for me! Thanks!

2012/10/24 Martin Koch 

> In my experience, about as fast as you can push the new data :) Depending
> on the size of your records, this should be a matter of seconds.
>
> /Martin Koch
>
> On Wed, Oct 24, 2012 at 9:01 PM, Marcelo Elias Del Valle <
> mvall...@gmail.com
> > wrote:
>
> > Erick,
> >
> >  Thanks for the help, it sure helps a lot to read that, as it gives
> me
> > more confidence I am not crazy about what I am thinking.
> >  The only problem I see by de-normalizing data as you said is that if
> > any relation between customer and vendor changes, I will have to update
> the
> > index for all the vendors. I could have about 10 000 customers per
> vendor.
> >  Anyway, by what you're saying, it's more common than I was
> imagining,
> > right? I wonder how long solr will take to reindex 1 records when
> this
> > happens.
> >
> > Thanks,
> > Marcelo Valle.
> >
> > 2012/10/24 Erick Erickson 
> >
> > > One, take off your RDBMS cap ...
> > >
> > > DB folks regularly reject the idea of de-normalizing data
> > > to make best use of Solr, but that's what I would explore
> > > first. Yes, this repeats the, in your case, vendor information
> > > perhaps many times, but try that first, even though that
> > > causes you to update multiple customers whenever a vendor
> > > changes. You haven't specified how many customers and vendors
> > > you're talking abou there, but unless the total number of documents
> > > (where each document is a customer+vendor combination)
> > > is multiple tens of millions, you probably will be fine.
> > >
> > > You can get a list of just customers by using grouping where you
> > > group on customer, although that may not be the most efficient. You
> > > could index a field, call it "cust_filter" that was set to true for the
> > > first
> > > customer/vendor you indexed and false (or just left out) for all the
> > > rest and q=blahblah&fq=cust_filter:true.
> > >
> > > Hope that helps
> > > Erick
> > >
> > > On Wed, Oct 24, 2012 at 12:01 PM, Marcelo Elias Del Valle
> > >  wrote:
> > > > Hello,
> > > >
> > > > I am new to Solr and I have a scenario where I want to use it,
> but
> > I
> > > > might be misunderstanding some concepts. I will explain what I want
> > here,
> > > > if someone has a solution for this, I would gladly accept the help.
> > > > I have a core indexing customers. I have another core indexing
> > > vendors.
> > > > Both are related to each other.
> > > > Here is what I want to do in my application: I want to find all
> the
> > > > customers that follow some criteria and them find the vendors related
> > to
> > > > them.
> > > >
> > > > My first option was to to have just vendor core and in for each
> > > > document in vendor core I would have all the customers related to it.
> > > > However, I would write the same customer several times to the index,
> as
> > > > more than one vendor could be related to the same customer. Besides,
> I
> > > > wonder how would I write a query to list just the different
> customers.
> > > > Another problem is that I update customers in a different frequency I
> > > > update vendors, but have vendor + customers in a single document
> would
> > > obly
> > > > me to do the full update.
> > > >
> > > > Does anyone have a good solution for this I am not being able to
> > > see? I
> > > > might be missing some basic concept here...
> > > >
> > > > Thanks,
> > > > --
> > > > Marcelo Elias Del Valle
> > > > http://mvalle.com - @mvallebr
> > >
> >
> >
> >
> > --
> > Marcelo Elias Del Valle
> > http://mvalle.com - @mvallebr
> >
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: is it possible to index

2012-10-24 Thread Martin Koch
In my experience, about as fast as you can push the new data :) Depending
on the size of your records, this should be a matter of seconds.

/Martin Koch

On Wed, Oct 24, 2012 at 9:01 PM, Marcelo Elias Del Valle  wrote:

> Erick,
>
>  Thanks for the help, it sure helps a lot to read that, as it gives me
> more confidence I am not crazy about what I am thinking.
>  The only problem I see by de-normalizing data as you said is that if
> any relation between customer and vendor changes, I will have to update the
> index for all the vendors. I could have about 10 000 customers per vendor.
>  Anyway, by what you're saying, it's more common than I was imagining,
> right? I wonder how long solr will take to reindex 1 records when this
> happens.
>
> Thanks,
> Marcelo Valle.
>
> 2012/10/24 Erick Erickson 
>
> > One, take off your RDBMS cap ...
> >
> > DB folks regularly reject the idea of de-normalizing data
> > to make best use of Solr, but that's what I would explore
> > first. Yes, this repeats the, in your case, vendor information
> > perhaps many times, but try that first, even though that
> > causes you to update multiple customers whenever a vendor
> > changes. You haven't specified how many customers and vendors
> > you're talking abou there, but unless the total number of documents
> > (where each document is a customer+vendor combination)
> > is multiple tens of millions, you probably will be fine.
> >
> > You can get a list of just customers by using grouping where you
> > group on customer, although that may not be the most efficient. You
> > could index a field, call it "cust_filter" that was set to true for the
> > first
> > customer/vendor you indexed and false (or just left out) for all the
> > rest and q=blahblah&fq=cust_filter:true.
> >
> > Hope that helps
> > Erick
> >
> > On Wed, Oct 24, 2012 at 12:01 PM, Marcelo Elias Del Valle
> >  wrote:
> > > Hello,
> > >
> > > I am new to Solr and I have a scenario where I want to use it, but
> I
> > > might be misunderstanding some concepts. I will explain what I want
> here,
> > > if someone has a solution for this, I would gladly accept the help.
> > > I have a core indexing customers. I have another core indexing
> > vendors.
> > > Both are related to each other.
> > > Here is what I want to do in my application: I want to find all the
> > > customers that follow some criteria and them find the vendors related
> to
> > > them.
> > >
> > > My first option was to to have just vendor core and in for each
> > > document in vendor core I would have all the customers related to it.
> > > However, I would write the same customer several times to the index, as
> > > more than one vendor could be related to the same customer. Besides, I
> > > wonder how would I write a query to list just the different customers.
> > > Another problem is that I update customers in a different frequency I
> > > update vendors, but have vendor + customers in a single document would
> > obly
> > > me to do the full update.
> > >
> > > Does anyone have a good solution for this I am not being able to
> > see? I
> > > might be missing some basic concept here...
> > >
> > > Thanks,
> > > --
> > > Marcelo Elias Del Valle
> > > http://mvalle.com - @mvallebr
> >
>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>


Re: is it possible to index

2012-10-24 Thread Marcelo Elias Del Valle
Erick,

 Thanks for the help, it sure helps a lot to read that, as it gives me
more confidence I am not crazy about what I am thinking.
 The only problem I see by de-normalizing data as you said is that if
any relation between customer and vendor changes, I will have to update the
index for all the vendors. I could have about 10 000 customers per vendor.
 Anyway, by what you're saying, it's more common than I was imagining,
right? I wonder how long solr will take to reindex 1 records when this
happens.

Thanks,
Marcelo Valle.

2012/10/24 Erick Erickson 

> One, take off your RDBMS cap ...
>
> DB folks regularly reject the idea of de-normalizing data
> to make best use of Solr, but that's what I would explore
> first. Yes, this repeats the, in your case, vendor information
> perhaps many times, but try that first, even though that
> causes you to update multiple customers whenever a vendor
> changes. You haven't specified how many customers and vendors
> you're talking abou there, but unless the total number of documents
> (where each document is a customer+vendor combination)
> is multiple tens of millions, you probably will be fine.
>
> You can get a list of just customers by using grouping where you
> group on customer, although that may not be the most efficient. You
> could index a field, call it "cust_filter" that was set to true for the
> first
> customer/vendor you indexed and false (or just left out) for all the
> rest and q=blahblah&fq=cust_filter:true.
>
> Hope that helps
> Erick
>
> On Wed, Oct 24, 2012 at 12:01 PM, Marcelo Elias Del Valle
>  wrote:
> > Hello,
> >
> > I am new to Solr and I have a scenario where I want to use it, but I
> > might be misunderstanding some concepts. I will explain what I want here,
> > if someone has a solution for this, I would gladly accept the help.
> > I have a core indexing customers. I have another core indexing
> vendors.
> > Both are related to each other.
> > Here is what I want to do in my application: I want to find all the
> > customers that follow some criteria and them find the vendors related to
> > them.
> >
> > My first option was to to have just vendor core and in for each
> > document in vendor core I would have all the customers related to it.
> > However, I would write the same customer several times to the index, as
> > more than one vendor could be related to the same customer. Besides, I
> > wonder how would I write a query to list just the different customers.
> > Another problem is that I update customers in a different frequency I
> > update vendors, but have vendor + customers in a single document would
> obly
> > me to do the full update.
> >
> > Does anyone have a good solution for this I am not being able to
> see? I
> > might be missing some basic concept here...
> >
> > Thanks,
> > --
> > Marcelo Elias Del Valle
> > http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: SolrCloud - loop in recovery mode

2012-10-24 Thread Mark Miller
On a quick search, I didn't happen to see an open JIRA for this type of thing. 
Could you file one?

- Mark

On Oct 24, 2012, at 11:35 AM, AlexeyK  wrote:

> The situation can be replayed on solr 4 (solrcloud):
> 1. Define the warmup query
> 2. Add spell checker configuration to the /select search handler
> 3. Set spellcheck.collation = true
> 
> The server will stuck in init phase due to deadlock.
> Is there a bug open for this?
> Actually you cannot get collated spell check results together with a query
> result.
> 
> The workaround is one of the following:
> 1. don't use warmup
> 2. don't use collation
> 3. don't define spell check for /select, but define a distinct handler and
> call it specifically
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-loop-in-recovery-mode-tp4015330p4015622.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: is it possible to index

2012-10-24 Thread Erick Erickson
One, take off your RDBMS cap ...

DB folks regularly reject the idea of de-normalizing data
to make best use of Solr, but that's what I would explore
first. Yes, this repeats the, in your case, vendor information
perhaps many times, but try that first, even though that
causes you to update multiple customers whenever a vendor
changes. You haven't specified how many customers and vendors
you're talking abou there, but unless the total number of documents
(where each document is a customer+vendor combination)
is multiple tens of millions, you probably will be fine.

You can get a list of just customers by using grouping where you
group on customer, although that may not be the most efficient. You
could index a field, call it "cust_filter" that was set to true for the first
customer/vendor you indexed and false (or just left out) for all the
rest and q=blahblah&fq=cust_filter:true.

Hope that helps
Erick

On Wed, Oct 24, 2012 at 12:01 PM, Marcelo Elias Del Valle
 wrote:
> Hello,
>
> I am new to Solr and I have a scenario where I want to use it, but I
> might be misunderstanding some concepts. I will explain what I want here,
> if someone has a solution for this, I would gladly accept the help.
> I have a core indexing customers. I have another core indexing vendors.
> Both are related to each other.
> Here is what I want to do in my application: I want to find all the
> customers that follow some criteria and them find the vendors related to
> them.
>
> My first option was to to have just vendor core and in for each
> document in vendor core I would have all the customers related to it.
> However, I would write the same customer several times to the index, as
> more than one vendor could be related to the same customer. Besides, I
> wonder how would I write a query to list just the different customers.
> Another problem is that I update customers in a different frequency I
> update vendors, but have vendor + customers in a single document would obly
> me to do the full update.
>
> Does anyone have a good solution for this I am not being able to see? I
> might be missing some basic concept here...
>
> Thanks,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr


Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Aaron Daubman
Thanks for the ideas - some followup questions in-line below:


> * use shingles e.g. to turn two-word phrases into single terms (how
> long is your average phrase?).

Would this be different than what I was calling "common grams"? (other
than shingling every two words, rather than just common ones?)


> * in addition to the above, maybe for phrases with > 2 terms, consider
> just a boolean conjunction of the shingled phrases instead of a "real"
> phrase query: e.g. "more like this" -> (more_like AND like_this). This
> would have some false positives.

This would definitely help, but, IIRC, we moved to phrase queries due
to too many false positives, it would be an interesting experiment to
see how many false positives were left when shingling and then just
doing conjunctive queries.


> * use a more aggressive stopwords list for your "MorePhrasesLikeThis".
> * reduce this number 200, and instead work harder to prune out which
> phrases are the "most descriptive" from the seed document, e.g. based
> on some heuristics like their frequency or location within that seed
> document, so your query isnt so massive.

This is something I've been asking for (perform some sort of PCA /
feature selection on the actual terms used) but is of questionable
value and hard to do "right" so hasn't happened yet (it's not clear
that there will be terms that are very common that are not also very
descriptive, so the extent to which this would help is unknown).

Thanks again for the ideas!
 Aaron


Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Peter Keegan
Could you index your 'phrase tags' as single tokens? Then your phrase
queries become simple TermQuerys.

On Wed, Oct 24, 2012 at 12:26 PM, Robert Muir  wrote:

> On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman  wrote:
> > Greetings,
> >
> > We have a solr instance in use that gets some perhaps atypical queries
> > and suffers from poor (>2 second) QTimes.
> >
> > Documents (~2,350,000) in this instance are mainly comprised of
> > various "descriptive fields", such as multi-word (phrase) tags - an
> > average document contains 200-400 phrases like this across several
> > different multi-valued field types.
> >
> > A custom QueryComponent has been built that functions somewhat like a
> > very specific MoreLikeThis. A seed document is specified via the
> > incoming query, its terms are retrieved, boosted both by query
> > parameters as well as fields within the document that specify term
> > weighting, sorted by this custom boosting, and then a second query is
> > crafted by taking the top 200 (sorted by the custom boosting)
> > resulting field values paired with their fields and searching for
> > documents matching these 200 values.
>
> a few more ideas:
> * use shingles e.g. to turn two-word phrases into single terms (how
> long is your average phrase?).
> * in addition to the above, maybe for phrases with > 2 terms, consider
> just a boolean conjunction of the shingled phrases instead of a "real"
> phrase query: e.g. "more like this" -> (more_like AND like_this). This
> would have some false positives.
> * use a more aggressive stopwords list for your "MorePhrasesLikeThis".
> * reduce this number 200, and instead work harder to prune out which
> phrases are the "most descriptive" from the seed document, e.g. based
> on some heuristics like their frequency or location within that seed
> document, so your query isnt so massive.
>


Re: solr 4.0 missing SolrPluginUtils addOrReplaceResults

2012-10-24 Thread varun srivastava
Hi Solr-Users,
 Anyone has any work around for SolrPluginUtils.addOrReplaceResults in solr
4.0 ? Should be easy to migrate the code from 3.6 branch to
4.0 SolrPluginUtils. Is there any specific reason why this method is
dropped in 4.0 ?

Thanks
Varun

On Tue, Oct 23, 2012 at 11:14 AM, varun srivastava
wrote:

> Hi,
>  What is the replacement for SolrPluginUtils.addOrReplaceResults in solr
> 4.0 ?
>
> Thanks
> Varun
>


Re: Monitor Deleted Event

2012-10-24 Thread Amit Nithian
Since Lucene is a library there isn't much of a support for this since
in theory the client application issuing the delete could also then do
something else upon delete. solr on the other hand being a layer (a
server layer) sitting on top of lucene, it makes sense for hooks to be
configured there.

Since here you can intercept the delete event, you can do what you
wish with it (i.e. in your case maybe send a notification event to
another solr server to add a record).

On Wed, Oct 24, 2012 at 9:19 AM, jefferyyuan  wrote:
> Thanks very much :)
>
> This is what I am looking for.
> And I also wonder whether this some thing as DeleteEvent in Solr or Lucene?
>
> Is there a way to do this in Lucene? - Not familiar with Lucene yet :)
> As I may choose to do this in lower level...
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Monitor-Deleted-Event-tp4015624p4015641.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Robert Muir
On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman  wrote:
> Greetings,
>
> We have a solr instance in use that gets some perhaps atypical queries
> and suffers from poor (>2 second) QTimes.
>
> Documents (~2,350,000) in this instance are mainly comprised of
> various "descriptive fields", such as multi-word (phrase) tags - an
> average document contains 200-400 phrases like this across several
> different multi-valued field types.
>
> A custom QueryComponent has been built that functions somewhat like a
> very specific MoreLikeThis. A seed document is specified via the
> incoming query, its terms are retrieved, boosted both by query
> parameters as well as fields within the document that specify term
> weighting, sorted by this custom boosting, and then a second query is
> crafted by taking the top 200 (sorted by the custom boosting)
> resulting field values paired with their fields and searching for
> documents matching these 200 values.

a few more ideas:
* use shingles e.g. to turn two-word phrases into single terms (how
long is your average phrase?).
* in addition to the above, maybe for phrases with > 2 terms, consider
just a boolean conjunction of the shingled phrases instead of a "real"
phrase query: e.g. "more like this" -> (more_like AND like_this). This
would have some false positives.
* use a more aggressive stopwords list for your "MorePhrasesLikeThis".
* reduce this number 200, and instead work harder to prune out which
phrases are the "most descriptive" from the seed document, e.g. based
on some heuristics like their frequency or location within that seed
document, so your query isnt so massive.


Re: SolrJ CloudSolrServer throws ClassCastException

2012-10-24 Thread Steve Rowe
Hi Kevin,

Solrj 4.0.0 is on Maven Central now, and has been since Oct. 11th: 



Steve
 
On Oct 24, 2012, at 11:21 AM, Kevin Osborn  wrote:

> Thanks for that idea. The problem was that my Solr server was on 4.0.0, but
> the latest version for SolrJ on Maven is 4.0.0-Beta. I downgraded my server
> to 4.0.0-beta and it worked.
> 
> -Kevin
> 
> On Wed, Oct 24, 2012 at 6:03 AM, Mark Miller  wrote:
> 
>> Did up upgrade your Solr instance from the beta or alpha to 4 at some
>> point?
>> 
>> - Mark
>> 
>> On Wed, Oct 24, 2012 at 1:14 AM, Kevin Osborn 
>> wrote:
>>> It looks like this is where the problem lies. Here is the JSON that SolrJ
>>> is receiving from Zookeeper:
>>> 
>>> "data":"{\\"manufacturer\\":{\\n\\"shard1\\":{\\n
>>> \\"range\\":\\"8000-\\",\\n
>>> \\"replicas\\":{\\"myhost:5270_solr_manufacturer\\":{\\n
>>> \\"shard\\":\\"shard1\\",\\n  \\"roles\\":null,\\n
>>> \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
>>>   \\"collection\\":\\"manufacturer\\",\\n
>>> \\"node_name\\":\\"phx2-ccs-apl-dev-wax1.cnet.com:5270_solr\\",\\n
>>> \\"base_url\\":\\"http://myhost:5270/solr\\",\\n
>>> \\"leader\\":\\"true\\"}}},\\n\\"shard2\\":{\\n
>>> \\"range\\":\\"0-7fff\\",\\n
>>> \\"replicas\\":{\\"myhost:5275_solr_manufacturer\\":{\\n
>>> \\"shard\\":\\"shard2\\",\\n  \\"roles\\":null,\\n
>>> \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
>>>   \\"collection\\":\\"manufacturer\\",\\n
>>> \\"node_name\\":\\"myhost:5275_solr\\",\\n  \\"base_url\\":\\"
>>> http://myhost:5275/solr\\",\\n
>>> \\"leader\\":\\"true\\"}"}},{"data":{
>>> 
>>> Where SolrJ is expecting the shard Name, it is actually getting "range"
>> as
>>> the shard name and "8000-" as the value. Any ideas? Did I
>>> configure something wrong?
>>> 
>>> 
>>> On Tue, Oct 23, 2012 at 5:17 PM, Kevin Osborn 
>> wrote:
>>> 
 I am getting a ClassCastException when i call Solr. My code is pretty
 simple.
 
 SolrServer mySolrServer = new CloudSolrServer(zookeeperHost);
 ((CloudSolrServer)mySolrServer).setDefaultCollection("manufacturer")
 ((CloudSolrServer)mySolrServer).connect()
 
 
 The actual error is thrown on line 300 of ClusterState.java:
 new ZkNodeProps(sliceMap.get(shardName))
 
 It is trying to convert a String to a Map which causes the
 ClassCastException.
 
 My zookeepHost string is simply  "myHost:6200". My SolrCloud has 2
>> shards
 over a single collection. And two instances are running. I also tried an
 external Zookeeper with the same results.
 
 
 --
 *KEVIN OSBORN*
 LEAD SOFTWARE ENGINEER
 CNET Content Solutions
 OFFICE 949.399.8714
 CELL 949.310.4677  SKYPE osbornk
 5 Park Plaza, Suite 600, Irvine, CA 92614
 [image: CNET Content Solutions]
 
 
 
>>> 
>>> 
>>> --
>>> *KEVIN OSBORN*
>>> LEAD SOFTWARE ENGINEER
>>> CNET Content Solutions
>>> OFFICE 949.399.8714
>>> CELL 949.310.4677  SKYPE osbornk
>>> 5 Park Plaza, Suite 600, Irvine, CA 92614
>>> [image: CNET Content Solutions]
>> 
>> 
>> 
>> --
>> - Mark
>> 
> 
> 
> 
> -- 
> *KEVIN OSBORN*
> LEAD SOFTWARE ENGINEER
> CNET Content Solutions
> OFFICE 949.399.8714
> CELL 949.310.4677  SKYPE osbornk
> 5 Park Plaza, Suite 600, Irvine, CA 92614
> [image: CNET Content Solutions]



Re: SolrJ CloudSolrServer throws ClassCastException

2012-10-24 Thread Jack Krupansky
Maven Central looks up to date now for SolrJ, with all three of 4.0.0-ALPHA, 
4.0.0-BETA, and 4.0.0.


The latter is dated 11-Oct-2012.

See:
http://search.maven.org/#browse%7C1147257723
http://search.maven.org/#browse%7C-591611598

-- Jack Krupansky

-Original Message- 
From: Kevin Osborn

Sent: Wednesday, October 24, 2012 11:21 AM
To: solr-user@lucene.apache.org ; markrmil...@gmail.com
Subject: Re: SolrJ CloudSolrServer throws ClassCastException

Thanks for that idea. The problem was that my Solr server was on 4.0.0, but
the latest version for SolrJ on Maven is 4.0.0-Beta. I downgraded my server
to 4.0.0-beta and it worked.

-Kevin

On Wed, Oct 24, 2012 at 6:03 AM, Mark Miller  wrote:


Did up upgrade your Solr instance from the beta or alpha to 4 at some
point?

- Mark

On Wed, Oct 24, 2012 at 1:14 AM, Kevin Osborn 
wrote:
> It looks like this is where the problem lies. Here is the JSON that 
> SolrJ

> is receiving from Zookeeper:
>
> "data":"{\\"manufacturer\\":{\\n\\"shard1\\":{\\n
>  \\"range\\":\\"8000-\\",\\n
>  \\"replicas\\":{\\"myhost:5270_solr_manufacturer\\":{\\n
>  \\"shard\\":\\"shard1\\",\\n  \\"roles\\":null,\\n
>  \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
>\\"collection\\":\\"manufacturer\\",\\n
>  \\"node_name\\":\\"phx2-ccs-apl-dev-wax1.cnet.com:5270_solr\\",\\n
>  \\"base_url\\":\\"http://myhost:5270/solr\\",\\n
>  \\"leader\\":\\"true\\"}}},\\n\\"shard2\\":{\\n
>  \\"range\\":\\"0-7fff\\",\\n
>  \\"replicas\\":{\\"myhost:5275_solr_manufacturer\\":{\\n
>  \\"shard\\":\\"shard2\\",\\n  \\"roles\\":null,\\n
>  \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
>\\"collection\\":\\"manufacturer\\",\\n
>  \\"node_name\\":\\"myhost:5275_solr\\",\\n  \\"base_url\\":\\"
> http://myhost:5275/solr\\",\\n
>  \\"leader\\":\\"true\\"}"}},{"data":{
>
> Where SolrJ is expecting the shard Name, it is actually getting "range"
as
> the shard name and "8000-" as the value. Any ideas? Did I
> configure something wrong?
>
>
> On Tue, Oct 23, 2012 at 5:17 PM, Kevin Osborn 
wrote:
>
>> I am getting a ClassCastException when i call Solr. My code is pretty
>> simple.
>>
>> SolrServer mySolrServer = new CloudSolrServer(zookeeperHost);
>> ((CloudSolrServer)mySolrServer).setDefaultCollection("manufacturer")
>> ((CloudSolrServer)mySolrServer).connect()
>>
>>
>> The actual error is thrown on line 300 of ClusterState.java:
>> new ZkNodeProps(sliceMap.get(shardName))
>>
>> It is trying to convert a String to a Map which causes the
>> ClassCastException.
>>
>> My zookeepHost string is simply  "myHost:6200". My SolrCloud has 2
shards
>> over a single collection. And two instances are running. I also tried 
>> an

>> external Zookeeper with the same results.
>>
>>
>> --
>> *KEVIN OSBORN*
>> LEAD SOFTWARE ENGINEER
>> CNET Content Solutions
>> OFFICE 949.399.8714
>> CELL 949.310.4677  SKYPE osbornk
>> 5 Park Plaza, Suite 600, Irvine, CA 92614
>> [image: CNET Content Solutions]
>>
>>
>>
>
>
> --
> *KEVIN OSBORN*
> LEAD SOFTWARE ENGINEER
> CNET Content Solutions
> OFFICE 949.399.8714
> CELL 949.310.4677  SKYPE osbornk
> 5 Park Plaza, Suite 600, Irvine, CA 92614
> [image: CNET Content Solutions]



--
- Mark





--
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677  SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions] 



Re: How to import a part of index from main Solr server(based on a query) to another Solr server and then do incremental import at intervals later(the updated index)?

2012-10-24 Thread jefferyyuan
Hi, all:

Sorry for the late response: )
Thanks for your reply.

I think Solr Replication may not help in my case, as the central server
would store all docs of all users(1000+), and in each client, I only want to
copy index of his/her docs created or changed in last 2 weeks(for example),
after the first import, make a delta-import each day to get the changed or
deleted index from remote central server.

In my current implementation, I use DataImportHandler and
SOlrEntityProcessor, in short:

I write a new request handler: ImportLocalCacheHandler, url: /importcache
for first import, I call
/importcachequery?command=full-import&from:jeffery&first_index_time={first_index_time}
In my ImportLocalCacheHandler, I will build a query, such as
query=from:jeffery&last_modified:{first_index_time TO NOW}, and then call
/dataimport?command=full-import&query:{previous_query}.
After it succeeds, save last_index_time to a property file.
 
for delta-import, I call /importcachequery?command=delta-import
In my ImportLocalCacheHandler, I will build a query like
"from:jeffery&last_modified:{last_index_time TO NOW}
and call /dataimport?command=full-import&clean=false&query={previous_query}.

This will import index of docs created or changed between last_index_time TO
NOW.

But Now I am trying to figure out how to remove the index from local cache
server that are alredy deleted in remote server but still exist in local
cache.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-import-a-part-of-index-from-main-Solr-server-based-on-a-query-to-another-Solr-server-and-then-tp4013479p4015633.html
Sent from the Solr - User mailing list archive at Nabble.com.


is it possible to index

2012-10-24 Thread Marcelo Elias Del Valle
Hello,

I am new to Solr and I have a scenario where I want to use it, but I
might be misunderstanding some concepts. I will explain what I want here,
if someone has a solution for this, I would gladly accept the help.
I have a core indexing customers. I have another core indexing vendors.
Both are related to each other.
Here is what I want to do in my application: I want to find all the
customers that follow some criteria and them find the vendors related to
them.

My first option was to to have just vendor core and in for each
document in vendor core I would have all the customers related to it.
However, I would write the same customer several times to the index, as
more than one vendor could be related to the same customer. Besides, I
wonder how would I write a query to list just the different customers.
Another problem is that I update customers in a different frequency I
update vendors, but have vendor + customers in a single document would obly
me to do the full update.

Does anyone have a good solution for this I am not being able to see? I
might be missing some basic concept here...

Thanks,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: Monitor Deleted Event

2012-10-24 Thread Amit Nithian
I'm not 100% sure about this but looks like update processors may help?
http://wiki.apache.org/solr/UpdateRequestProcessor

It looks like you can put in custom code to execute when certain
actions happen so sounds like this is what you are looking for.

Cheers
Amit

On Wed, Oct 24, 2012 at 8:43 AM, jefferyyuan  wrote:
> When some docs are deleted from Solr server, I want to execute some code -
> for example, add an record such as {contentid, deletedat} to another solr
> server or database.
>
> How can I do this through Solr or Lucene?
>
> Thanks for any reply and help :)
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Monitor-Deleted-Event-tp4015624.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Use Regex in the Query Phrase

2012-10-24 Thread Daisy
Ok, now I have apache-solr-4.0.0 on windows. And I am able to use the plugin
ComplexPhraseQuery as mentioned above.
So I can search for: For example: "art(.*?)le" or "he sa*". Thanks for all
help.

What if I want to search a phrase like that: "he (.*?) that"
in sentences like:

he said that
he is always saying that
he mentioned before that
...etc

Also I would like the phrase to be considered as one item in the phrase
frequency, i.e. I want to tell solr not considering "he" tf alone and "that"
tf alone. But I would like to know the  "he (.*?) that" phrase frequency.
May be it is not a feature in solr.
Any way How could i execute a query like the one above??





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Use-Regex-in-the-Query-Phrase-tp4015335p4015628.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: help with solritas config

2012-10-24 Thread Marcelo Elias Del Valle
Hello again,

Sorry, I took some time to process everything... I looked at some more
documentation and realized I am confusing documents with cores. Actually, I
was expecting to be able to have a USER core and a CITY core and be able to
relate them somehow. Thanks for the clarification, I will write another
email with my current doubts, but I think I understood the joins now, they
just were not what I was expecting. Maybe I am misunderstanding some
concepts...


Best regards,
Marcelo.

2012/10/24 Marcelo Elias Del Valle 

> Shawn,
>
> First of all, thanks a lot for your answer, it was very useful.
> By the content of your email, it seemed to me the /browser is just
> something as a solr admin interface, so now I am confused. I am already
> using SOLRJ in my application and I am currently able to perform a query
> like follows, for example:
>  http://localhost:8983/solr/user/select?q=*%3A*&wt=xml
>
>  However, I want to use JOINs. I was trying to use the /browse handler
> because I was following the quick start on this page:
>  http://wiki.apache.org/solr/Join
>  In this wiki page, I couldn't see an example of query joining two
> documents.
>
>  If I understood correctly, I don't need /browse, right? I could
> perform a query joining user document with city document, for instance,
> without relying on /browse?
>  Do I need to configure anything to be able to use joins? Or the
> plugin comes already installed in solr 4?
>  Sorry for the amount of questions. ;-)
>
> Thanks,
> Marcelo Valle.
>
>
> 2012/10/24 Shawn Heisey 
>
>> On 10/24/2012 8:05 AM, Marcelo Elias Del Valle wrote:
>>
>>> I saw there is some documentation in solr wiki for SearchHandler and
>>> VelocityResponseWriter, which I am trying to digest. However, I saw there
>>> are some configuration fields that aren't there, like this QF field. I am
>>> not sure on how to customize... Should I use only my custom fields there?
>>>
>>
>> Marcelo, the /browse handler that comes with the Solr example is just
>> that -- an example.  It's not intended for production use without a lot of
>> customization, and definitely not intended to be directly available to
>> 'regular' people or the Internet.  I'm not saying it can't be a useful
>> tool, but nothing in Solr is hardened against abuse, so it should not be
>> directly exposed to attack.
>>
>> Also, the /browse handler configuration is highly tied in with the
>> schema.xml in the example.  If you change the schema, you'll probably have
>> to also perform surgery on the SolrItas config, which will likely require
>> an understanding of Velocity.  For real help with Velocity, you'd need to
>> consult other resources.  Here's some stuff that I was able to find:
>>
>> http://velocity.apache.org/**engine/releases/velocity-1.5/**overview.html
>> http://velocity.apache.org/**engine/releases/velocity-1.5/**
>> user-guide.html
>> http://velocity.apache.org/**engine/releases/velocity-1.5/**
>> developer-guide.html
>>
>> If you choose to customize the Velocity config and have questions about
>> the Solr pieces of the puzzle, then this list can probably give you the
>> answers you need.
>>
>> Generally speaking, rather than use /browse, you'll want to access Solr
>> directly from whatever application you have that faces your users, either
>> using constructed URLs and a standard http library, or a solr-specific
>> library that gives you the tools to tell it what you want and handles the
>> URL construction for you.  For Java, that would be SolrJ.  There are also
>> solr libraries for other languages.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Monitor Deleted Event

2012-10-24 Thread jefferyyuan
When some docs are deleted from Solr server, I want to execute some code -
for example, add an record such as {contentid, deletedat} to another solr
server or database.

How can I do this through Solr or Lucene?

Thanks for any reply and help :)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Monitor-Deleted-Event-tp4015624.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - loop in recovery mode

2012-10-24 Thread AlexeyK
The situation can be replayed on solr 4 (solrcloud):
1. Define the warmup query
2. Add spell checker configuration to the /select search handler
3. Set spellcheck.collation = true

The server will stuck in init phase due to deadlock.
Is there a bug open for this?
Actually you cannot get collated spell check results together with a query
result.

The workaround is one of the following:
1. don't use warmup
2. don't use collation
3. don't define spell check for /select, but define a distinct handler and
call it specifically



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-loop-in-recovery-mode-tp4015330p4015622.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: help with solritas config

2012-10-24 Thread Marcelo Elias Del Valle
Shawn,

First of all, thanks a lot for your answer, it was very useful.
By the content of your email, it seemed to me the /browser is just
something as a solr admin interface, so now I am confused. I am already
using SOLRJ in my application and I am currently able to perform a query
like follows, for example:
 http://localhost:8983/solr/user/select?q=*%3A*&wt=xml

 However, I want to use JOINs. I was trying to use the /browse handler
because I was following the quick start on this page:
 http://wiki.apache.org/solr/Join
 In this wiki page, I couldn't see an example of query joining two
documents.

 If I understood correctly, I don't need /browse, right? I could
perform a query joining user document with city document, for instance,
without relying on /browse?
 Do I need to configure anything to be able to use joins? Or the plugin
comes already installed in solr 4?
 Sorry for the amount of questions. ;-)

Thanks,
Marcelo Valle.

2012/10/24 Shawn Heisey 

> On 10/24/2012 8:05 AM, Marcelo Elias Del Valle wrote:
>
>> I saw there is some documentation in solr wiki for SearchHandler and
>> VelocityResponseWriter, which I am trying to digest. However, I saw there
>> are some configuration fields that aren't there, like this QF field. I am
>> not sure on how to customize... Should I use only my custom fields there?
>>
>
> Marcelo, the /browse handler that comes with the Solr example is just that
> -- an example.  It's not intended for production use without a lot of
> customization, and definitely not intended to be directly available to
> 'regular' people or the Internet.  I'm not saying it can't be a useful
> tool, but nothing in Solr is hardened against abuse, so it should not be
> directly exposed to attack.
>
> Also, the /browse handler configuration is highly tied in with the
> schema.xml in the example.  If you change the schema, you'll probably have
> to also perform surgery on the SolrItas config, which will likely require
> an understanding of Velocity.  For real help with Velocity, you'd need to
> consult other resources.  Here's some stuff that I was able to find:
>
> http://velocity.apache.org/**engine/releases/velocity-1.5/**overview.html
> http://velocity.apache.org/**engine/releases/velocity-1.5/**
> user-guide.html
> http://velocity.apache.org/**engine/releases/velocity-1.5/**
> developer-guide.html
>
> If you choose to customize the Velocity config and have questions about
> the Solr pieces of the puzzle, then this list can probably give you the
> answers you need.
>
> Generally speaking, rather than use /browse, you'll want to access Solr
> directly from whatever application you have that faces your users, either
> using constructed URLs and a standard http library, or a solr-specific
> library that gives you the tools to tell it what you want and handles the
> URL construction for you.  For Java, that would be SolrJ.  There are also
> solr libraries for other languages.
>
> Thanks,
> Shawn
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


SolrJ and clustering with Carrot

2012-10-24 Thread DanP
Hi, 

At the moment, I run a Solr query in my Java program, using SolrJ, and get a 
QueryResponse object.

But now I've just started using Carrot to do results clustering when I run a 
search in Solr, and although I can see that the clusters are now part of the
response, the QueryResponse class doesn't have any methods for me to 
get those clusters. 

How people use clustering with SolrJ? Am I missing something? Should the 
QueryResponse be modified/extended? 

Thanks, 

DanP





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-and-clustering-with-Carrot-tp4015618.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ CloudSolrServer throws ClassCastException

2012-10-24 Thread Kevin Osborn
Thanks for that idea. The problem was that my Solr server was on 4.0.0, but
the latest version for SolrJ on Maven is 4.0.0-Beta. I downgraded my server
to 4.0.0-beta and it worked.

-Kevin

On Wed, Oct 24, 2012 at 6:03 AM, Mark Miller  wrote:

> Did up upgrade your Solr instance from the beta or alpha to 4 at some
> point?
>
> - Mark
>
> On Wed, Oct 24, 2012 at 1:14 AM, Kevin Osborn 
> wrote:
> > It looks like this is where the problem lies. Here is the JSON that SolrJ
> > is receiving from Zookeeper:
> >
> > "data":"{\\"manufacturer\\":{\\n\\"shard1\\":{\\n
> >  \\"range\\":\\"8000-\\",\\n
> >  \\"replicas\\":{\\"myhost:5270_solr_manufacturer\\":{\\n
> >  \\"shard\\":\\"shard1\\",\\n  \\"roles\\":null,\\n
> >  \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
> >\\"collection\\":\\"manufacturer\\",\\n
> >  \\"node_name\\":\\"phx2-ccs-apl-dev-wax1.cnet.com:5270_solr\\",\\n
> >  \\"base_url\\":\\"http://myhost:5270/solr\\",\\n
> >  \\"leader\\":\\"true\\"}}},\\n\\"shard2\\":{\\n
> >  \\"range\\":\\"0-7fff\\",\\n
> >  \\"replicas\\":{\\"myhost:5275_solr_manufacturer\\":{\\n
> >  \\"shard\\":\\"shard2\\",\\n  \\"roles\\":null,\\n
> >  \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
> >\\"collection\\":\\"manufacturer\\",\\n
> >  \\"node_name\\":\\"myhost:5275_solr\\",\\n  \\"base_url\\":\\"
> > http://myhost:5275/solr\\",\\n
> >  \\"leader\\":\\"true\\"}"}},{"data":{
> >
> > Where SolrJ is expecting the shard Name, it is actually getting "range"
> as
> > the shard name and "8000-" as the value. Any ideas? Did I
> > configure something wrong?
> >
> >
> > On Tue, Oct 23, 2012 at 5:17 PM, Kevin Osborn 
> wrote:
> >
> >> I am getting a ClassCastException when i call Solr. My code is pretty
> >> simple.
> >>
> >> SolrServer mySolrServer = new CloudSolrServer(zookeeperHost);
> >> ((CloudSolrServer)mySolrServer).setDefaultCollection("manufacturer")
> >> ((CloudSolrServer)mySolrServer).connect()
> >>
> >>
> >> The actual error is thrown on line 300 of ClusterState.java:
> >> new ZkNodeProps(sliceMap.get(shardName))
> >>
> >> It is trying to convert a String to a Map which causes the
> >> ClassCastException.
> >>
> >> My zookeepHost string is simply  "myHost:6200". My SolrCloud has 2
> shards
> >> over a single collection. And two instances are running. I also tried an
> >> external Zookeeper with the same results.
> >>
> >>
> >> --
> >> *KEVIN OSBORN*
> >> LEAD SOFTWARE ENGINEER
> >> CNET Content Solutions
> >> OFFICE 949.399.8714
> >> CELL 949.310.4677  SKYPE osbornk
> >> 5 Park Plaza, Suite 600, Irvine, CA 92614
> >> [image: CNET Content Solutions]
> >>
> >>
> >>
> >
> >
> > --
> > *KEVIN OSBORN*
> > LEAD SOFTWARE ENGINEER
> > CNET Content Solutions
> > OFFICE 949.399.8714
> > CELL 949.310.4677  SKYPE osbornk
> > 5 Park Plaza, Suite 600, Irvine, CA 92614
> > [image: CNET Content Solutions]
>
>
>
> --
> - Mark
>



-- 
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677  SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions]


Re: Search field affecting relevance

2012-10-24 Thread Jack Krupansky

First, documents are not matched "by field", but "by field value".

So, make sure q.op=OR, mm=0%, and query:

+field1:x field2:x^20

This means that "x" MUST be present in field1 of each document, but IF "x" 
happens to be present in field2 of the same document the score will be 
boosted (by however much you want.) But "x" is NOT required to be present in 
field2 of a document.


You could also specify field2 separately in a boost query:

q=field1:x&bq=field2:x^20

-- Jack Krupansky

-Original Message- 
From: Maxim Kuleshov

Sent: Wednesday, October 24, 2012 4:38 AM
To: solr-user@lucene.apache.org
Subject: Search field affecting relevance

Hi,

For example, we have documents with two fields - field1 and field2.
Both fields are indexed and both are used in search.

Is there way to return only documents that are matched by field1, but
taking in account that if field2 is matched - relevance should be
higher? In other words, if document "A" is mathced by field1 and
field2 it's relevance should be higher than document "B" matched only
by field1 and document "C" where only field2 is matched should not be
returned at all?

Could you please help with outlining the general approach how to
achieve this? Either it's core lucene feature, or solr post processing
logic or something else? 



Improving performance for use-case where large (200) number of phrase queries are used?

2012-10-24 Thread Aaron Daubman
Greetings,

We have a solr instance in use that gets some perhaps atypical queries
and suffers from poor (>2 second) QTimes.

Documents (~2,350,000) in this instance are mainly comprised of
various "descriptive fields", such as multi-word (phrase) tags - an
average document contains 200-400 phrases like this across several
different multi-valued field types.

A custom QueryComponent has been built that functions somewhat like a
very specific MoreLikeThis. A seed document is specified via the
incoming query, its terms are retrieved, boosted both by query
parameters as well as fields within the document that specify term
weighting, sorted by this custom boosting, and then a second query is
crafted by taking the top 200 (sorted by the custom boosting)
resulting field values paired with their fields and searching for
documents matching these 200 values.

For many searches, 25-50% of the documents match the query of 200
terms (so 600,000 to 1,200,000).

After doing some profiling, it seems that a majority of the QTime
comes from dealing with phrases and resulting term positions, since a
majority of the search terms are actually multi-word tokenized
phrases. (processing is dominated by ExactPhraseScorer on down,
particularly: SegmentTermPositions, readVInt)

I have thought of a few ways to improve performance for this use case,
and am looking for feedback as to which seems best, as well as any
insight into other ways to approach this problem that I haven't
considered (or things to look into to help better understand the slow
QTimes more fully):

1) Shard the index - since there is no key to really specify which
shard queries would go to, this would only be of benefit if scoring is
done in parallel. Is there documentation I have so far missed that
describes distributed searching for this case? (I haven't found
anything that really describes the differences in scoring for
distributed vs. non-distributed indices, aside from the warnings that
IDF doesn't work - which I don't think we really care about).

2) Implement "Common Grams" as described here:
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
It's not clear how many individual words in the phrases being used
are, in fact, common, but given that 25-50% of the documents in the
index match many queries, it seems this may be of value

3) Try and make mm (minimum terms should match) work for the custom
query. I haven't been able to figure out how exactly this parameter
works, but, my thinking is along the lines of "if only 2 of those 200
terms match a document, it doesn't need to get scored". What I don't
currently understand is at what point failing the mm requirement
short-circuits - e.g. does the doc still get scored? If it does
short-circuit prior to scoring, this may help somewhat, although it's
not clear this would still prevent the many many gets against term
positions that is still killing QTime

4) Set a dynamic number (rather than the currently fixed 200) of terms
based on the custom boosting/weighting value - e.g. only use terms
whose calculated value is above some threshold. I'm not keen on this
since some documents may be dominated by many weak terms and not have
any great ones, it it might break for those (finding the "sweet spot"
cutoff would not be straightforward).

5) *This is my current favorite*: stop tokenizing/analyzing these
terms and just use KeywordTokenizer. Most of these phrases are
pre-vetted, and it may be possible to clean/process any others before
creating the docs. My main worry here is that, currently, if I
understand correctly, a document with the phrase "brazilian pop" would
still be returned as a match to a seed document containing only the
phrase "brazilian" (not the other way around, but that is not
necessary), however, with KeywordTokenizer, this would no longer be
the case. If I switched from the current dubious tokenize/stem/etc...
and just used Keyword, would this allow queries like "this used to be
a long phrase query" to match documents that have "this used to be a
long phrase query" as one of the multivalued values in the field
without having to pull term positions? (and thus significantly speed
up performance).

Thanks,
 Aaron


Re: help with solritas config

2012-10-24 Thread Shawn Heisey

On 10/24/2012 8:05 AM, Marcelo Elias Del Valle wrote:

I saw there is some documentation in solr wiki for SearchHandler and
VelocityResponseWriter, which I am trying to digest. However, I saw there
are some configuration fields that aren't there, like this QF field. I am
not sure on how to customize... Should I use only my custom fields there?


Marcelo, the /browse handler that comes with the Solr example is just 
that -- an example.  It's not intended for production use without a lot 
of customization, and definitely not intended to be directly available 
to 'regular' people or the Internet.  I'm not saying it can't be a 
useful tool, but nothing in Solr is hardened against abuse, so it should 
not be directly exposed to attack.


Also, the /browse handler configuration is highly tied in with the 
schema.xml in the example.  If you change the schema, you'll probably 
have to also perform surgery on the SolrItas config, which will likely 
require an understanding of Velocity.  For real help with Velocity, 
you'd need to consult other resources.  Here's some stuff that I was 
able to find:


http://velocity.apache.org/engine/releases/velocity-1.5/overview.html
http://velocity.apache.org/engine/releases/velocity-1.5/user-guide.html
http://velocity.apache.org/engine/releases/velocity-1.5/developer-guide.html

If you choose to customize the Velocity config and have questions about 
the Solr pieces of the puzzle, then this list can probably give you the 
answers you need.


Generally speaking, rather than use /browse, you'll want to access Solr 
directly from whatever application you have that faces your users, 
either using constructed URLs and a standard http library, or a 
solr-specific library that gives you the tools to tell it what you want 
and handles the URL construction for you.  For Java, that would be 
SolrJ.  There are also solr libraries for other languages.


Thanks,
Shawn



Re: Occasional Solr performance issues

2012-10-24 Thread Walter Underwood
Please consider never running "optimize". That should be called "force merge". 

wunder

On Oct 24, 2012, at 3:28 AM, Dotan Cohen wrote:

> On Tue, Oct 23, 2012 at 3:07 PM, Erick Erickson  
> wrote:
>> Maybe you've been looking at it but one thing that I didn't see on a fast
>> scan was that maybe the commit bit is the problem. When you commit,
>> eventually the segments will be merged and a new searcher will be opened
>> (this is true even if you're NOT optimizing). So you're effectively 
>> committing
>> every 1-2 seconds, creating many segments which get merged, but more
>> importantly opening new searchers (which you are getting since you pasted
>> the message: Overlapping onDeckSearchers=2).
>> 
>> You could pinpoint this by NOT committing explicitly, just set your 
>> autocommit
>> parameters (or specify commitWithin in your indexing program, which is
>> preferred). Try setting it at a minute or so and see if your problem goes 
>> away
>> perhaps?
>> 
>> The NRT stuff happens on soft commits, so you have that option to have the
>> documents immediately available for search.
>> 
> 
> 
> Thanks, Erick. I'll play around with different configurations. So far
> just removing the periodic optimize command worked wonders. I'll see
> how much it helps or hurts to run that daily or more or less frequent.
> 
> 
> -- 
> Dotan Cohen
> 
> http://gibberish.co.il
> http://what-is-what.com






Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-24 Thread Bill Au
Sorry, I had copy/paste the wrong link before.  Here is the correct one:

https://issues.apache.org/jira/browse/SOLR-3986

Bill

On Wed, Oct 24, 2012 at 10:26 AM, Bill Au  wrote:

> I just filed a bug with all the details:
>
> https://issues.apache.org/jira/browse/SOLR-3681
>
> Bill
>
>
> On Tue, Oct 23, 2012 at 2:47 PM, Chris Hostetter  > wrote:
>
>> : Just discovered that the replication admin REST API reports the correct
>> : index version and generation:
>> :
>> : http://master_host:port/solr/replication?command=indexversion
>> :
>> : So is this a bug in the admin UI?
>>
>> Ya gotta be specific Bill: where in the admin UI do you think it's
>> displaying the incorrect information?
>>
>> The Admin UI just adds pretty markup to information fetched from the
>> admin handlers using javascript, so if there is a problem it's either in
>> the admin handlers, or in the javascript possibly caching the olds values.
>>
>> Off the cuff, this reminds me of...
>>
>> https://issues.apache.org/jira/browse/SOLR-3681
>>
>> The root confusion there was that /admin/replication explicitly shows data
>> about the commit point available for replication -- not the current commit
>> point being "searched" on the master.
>>
>> So if you are seeing a disconnect, then perhaps it's just that same
>> descrepency? -- allthough if you are *only* seeing a disconnect after a
>> deleteByQuery (and not after document adds, or a deleteById) then that
>> does smell fishy, and makes me wonder if there is a code path where the
>> "userData" for the commits aren't being set properly.
>>
>> Can you file a bug with a unit test to reproduce?  or at the very list a
>> set of specific commands to run against the solr example including what
>> request handler URLs to hit (so there's no risk of confusion about the ui
>> javascript behavior) to see the problem?
>>
>>
>> -Hoss
>>
>
>


Re: uniqueKey not enforced

2012-10-24 Thread Jack Krupansky
Just do a query on one of the keys that appears to be duplicated and see if 
the "duplicates" are also returned.


Also, look at all of the field values for the documents with "duplicated" 
keys - are they identical as well, or are there differences in specific 
field values. That might highlight when the duplication of keys occurred.


-- Jack Krupansky

-Original Message- 
From: Robert Krüger

Sent: Wednesday, October 24, 2012 3:25 AM
To: solr-user@lucene.apache.org
Subject: Re: uniqueKey not enforced

On Tue, Oct 23, 2012 at 2:37 PM, Erick Erickson  
wrote:

From left field:

Try looking at your admin/schema browser page for the ID in question.
That actually
gets stuff out of your index (the actual indexed terms). See if you
have two values

I'm running embedded, so I don't have that. However I have a simple UI
for performing queries and the duplicate records are displayed issuing
a "*:*" query.

for that ID. In which case you _might_ have spaces before or after the 
value
somehow. I notice your comment says something about "computed", so... 
Since

String types are totally unanalyzed, spaces would count.
No, the way the id is computed can not lead to leading or trailing 
whitespace.




you can also use the TermsComponent to see what's there, see:
http://wiki.apache.org/solr/TermsComponent

I'll take a look.



Best
Erick


Thanks,

Robert 



Re: Solr 4.0.0 - index version and generation not changed after delete by query on master

2012-10-24 Thread Bill Au
I just filed a bug with all the details:

https://issues.apache.org/jira/browse/SOLR-3681

Bill

On Tue, Oct 23, 2012 at 2:47 PM, Chris Hostetter
wrote:

> : Just discovered that the replication admin REST API reports the correct
> : index version and generation:
> :
> : http://master_host:port/solr/replication?command=indexversion
> :
> : So is this a bug in the admin UI?
>
> Ya gotta be specific Bill: where in the admin UI do you think it's
> displaying the incorrect information?
>
> The Admin UI just adds pretty markup to information fetched from the
> admin handlers using javascript, so if there is a problem it's either in
> the admin handlers, or in the javascript possibly caching the olds values.
>
> Off the cuff, this reminds me of...
>
> https://issues.apache.org/jira/browse/SOLR-3681
>
> The root confusion there was that /admin/replication explicitly shows data
> about the commit point available for replication -- not the current commit
> point being "searched" on the master.
>
> So if you are seeing a disconnect, then perhaps it's just that same
> descrepency? -- allthough if you are *only* seeing a disconnect after a
> deleteByQuery (and not after document adds, or a deleteById) then that
> does smell fishy, and makes me wonder if there is a code path where the
> "userData" for the commits aren't being set properly.
>
> Can you file a bug with a unit test to reproduce?  or at the very list a
> set of specific commands to run against the solr example including what
> request handler URLs to hit (so there's no risk of confusion about the ui
> javascript behavior) to see the problem?
>
>
> -Hoss
>


Re: SolrCloud - loop in recovery mode

2012-10-24 Thread AlexeyK
It is actually connected to this:
https://gist.github.com/2880527

Once you have collation = true + warmup, the init is stuck on wait



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-loop-in-recovery-mode-tp4015330p4015593.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - loop in recovery mode

2012-10-24 Thread AlexeyK
After a little bit of investigation, it's about the searcher warmup that
doesn't happen.
I see the main thread waiting for the searcher. The warmup query handler is
stuck in another thread on the very same lock in getSearcher(), and no
notify() is called.
If I set the useColdSearcher = true, this obviously doesn't happen and the
application starts normally.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-loop-in-recovery-mode-tp4015330p4015581.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: uniqueKey not enforced

2012-10-24 Thread Robert Krüger
On Wed, Oct 24, 2012 at 2:03 PM, Erick Erickson  wrote:
> Robert:
>
> But you do have an index somewhere, so the alternative for
> looking at it low-level would be
> 1> get a copy of Luke and point it at your index. Very useful tool

I will do that, next time I have that condition. Unfortunately I
didn't back up the index files when that happened.

Thanks for the advice!

Robert


Re: SolrJ CloudSolrServer throws ClassCastException

2012-10-24 Thread Mark Miller
Did up upgrade your Solr instance from the beta or alpha to 4 at some point?

- Mark

On Wed, Oct 24, 2012 at 1:14 AM, Kevin Osborn  wrote:
> It looks like this is where the problem lies. Here is the JSON that SolrJ
> is receiving from Zookeeper:
>
> "data":"{\\"manufacturer\\":{\\n\\"shard1\\":{\\n
>  \\"range\\":\\"8000-\\",\\n
>  \\"replicas\\":{\\"myhost:5270_solr_manufacturer\\":{\\n
>  \\"shard\\":\\"shard1\\",\\n  \\"roles\\":null,\\n
>  \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
>\\"collection\\":\\"manufacturer\\",\\n
>  \\"node_name\\":\\"phx2-ccs-apl-dev-wax1.cnet.com:5270_solr\\",\\n
>  \\"base_url\\":\\"http://myhost:5270/solr\\",\\n
>  \\"leader\\":\\"true\\"}}},\\n\\"shard2\\":{\\n
>  \\"range\\":\\"0-7fff\\",\\n
>  \\"replicas\\":{\\"myhost:5275_solr_manufacturer\\":{\\n
>  \\"shard\\":\\"shard2\\",\\n  \\"roles\\":null,\\n
>  \\"state\\":\\"active\\",\\n  \\"core\\":\\"manufacturer\\",\\n
>\\"collection\\":\\"manufacturer\\",\\n
>  \\"node_name\\":\\"myhost:5275_solr\\",\\n  \\"base_url\\":\\"
> http://myhost:5275/solr\\",\\n
>  \\"leader\\":\\"true\\"}"}},{"data":{
>
> Where SolrJ is expecting the shard Name, it is actually getting "range" as
> the shard name and "8000-" as the value. Any ideas? Did I
> configure something wrong?
>
>
> On Tue, Oct 23, 2012 at 5:17 PM, Kevin Osborn  wrote:
>
>> I am getting a ClassCastException when i call Solr. My code is pretty
>> simple.
>>
>> SolrServer mySolrServer = new CloudSolrServer(zookeeperHost);
>> ((CloudSolrServer)mySolrServer).setDefaultCollection("manufacturer")
>> ((CloudSolrServer)mySolrServer).connect()
>>
>>
>> The actual error is thrown on line 300 of ClusterState.java:
>> new ZkNodeProps(sliceMap.get(shardName))
>>
>> It is trying to convert a String to a Map which causes the
>> ClassCastException.
>>
>> My zookeepHost string is simply  "myHost:6200". My SolrCloud has 2 shards
>> over a single collection. And two instances are running. I also tried an
>> external Zookeeper with the same results.
>>
>>
>> --
>> *KEVIN OSBORN*
>> LEAD SOFTWARE ENGINEER
>> CNET Content Solutions
>> OFFICE 949.399.8714
>> CELL 949.310.4677  SKYPE osbornk
>> 5 Park Plaza, Suite 600, Irvine, CA 92614
>> [image: CNET Content Solutions]
>>
>>
>>
>
>
> --
> *KEVIN OSBORN*
> LEAD SOFTWARE ENGINEER
> CNET Content Solutions
> OFFICE 949.399.8714
> CELL 949.310.4677  SKYPE osbornk
> 5 Park Plaza, Suite 600, Irvine, CA 92614
> [image: CNET Content Solutions]



-- 
- Mark


Re: SolrCloud - loop in recovery mode

2012-10-24 Thread AlexeyK
I only started learning the new features, so chances are it's about some
misconfiguration.
I removed the collection2 from the setup and indexed some files.
Now there is another pattern that stucks the init, and it's about the
overseer polling the queue:

Oct 24, 2012 2:18:52 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to Searcher@5f00498c
main{StandardDirectoryReader(segments_2:3 _0(4.0.0.2):C8)}
Oct 24, 2012 2:19:04 PM org.apache.zookeeper.server.ZooKeeperServer expire
INFO: Expiring session 0x13a92a39200, timeout of 15000ms exceeded
Oct 24, 2012 2:19:04 PM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Processed session termination for sessionid: 0x13a92a39200
Oct 24, 2012 2:19:04 PM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x13a92b5f199 type:delete cxid:0x282 zxid:0xfffe
txntype:unknown reqpath:n/a Error Path:/overseer_elect/leader
Error:KeeperErrorCode = NoNode for /overseer_elect/leader
Oct 24, 2012 2:19:04 PM org.apache.solr.common.cloud.SolrZkClient makePath
INFO: makePath: /overseer_elect/leader
Oct 24, 2012 2:19:04 PM org.apache.solr.cloud.Overseer start
INFO: Overseer (id=88544452827217920-akudinov-pc:8080_solr-n_01)
starting
Oct 24, 2012 2:19:04 PM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x13a92b5f199 type:create cxid:0x287 zxid:0xfffe
txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
NodeExists for /overseer
Oct 24, 2012 2:19:04 PM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x13a92b5f199 type:create cxid:0x288 zxid:0xfffe
txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
NodeExists for /overseer
Oct 24, 2012 2:19:04 PM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x13a92b5f199 type:create cxid:0x289 zxid:0xfffe
txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
NodeExists for /overseer
Oct 24, 2012 2:19:04 PM org.apache.zookeeper.server.PrepRequestProcessor
pRequest
INFO: Got user-level KeeperException when processing
sessionid:0x13a92b5f199 type:create cxid:0x28a zxid:0xfffe
txntype:unknown reqpath:n/a Error Path:/overseer Error:KeeperErrorCode =
NodeExists for /overseer
Oct 24, 2012 2:19:04 PM org.apache.solr.cloud.OverseerCollectionProcessor
run
INFO: Process current queue of collection creations
Oct 24, 2012 2:19:04 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
run
INFO: Starting to work on the main queue


Can you give a clue of what's happening with it?

Now my setup is:
collection1
2 shards
4 cores

There are several documents in both shards, automatically distributed by
solrcloud.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-loop-in-recovery-mode-tp4015330p4015574.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: uniqueKey not enforced

2012-10-24 Thread Erick Erickson
Robert:

But you do have an index somewhere, so the alternative for
looking at it low-level would be
1> get a copy of Luke and point it at your index. Very useful tool
2> just copy all your conf and data files somewhere and run the Jetty
 instance of Solr on that...

FWIW,
Erick

On Wed, Oct 24, 2012 at 3:25 AM, Robert Krüger  wrote:
> On Tue, Oct 23, 2012 at 2:37 PM, Erick Erickson  
> wrote:
>> From left field:
>>
>> Try looking at your admin/schema browser page for the ID in question.
>> That actually
>> gets stuff out of your index (the actual indexed terms). See if you
>> have two values
> I'm running embedded, so I don't have that. However I have a simple UI
> for performing queries and the duplicate records are displayed issuing
> a "*:*" query.
>
>> for that ID. In which case you _might_ have spaces before or after the value
>> somehow. I notice your comment says something about "computed", so... Since
>> String types are totally unanalyzed, spaces would count.
> No, the way the id is computed can not lead to leading or trailing whitespace.
>
>>
>> you can also use the TermsComponent to see what's there, see:
>> http://wiki.apache.org/solr/TermsComponent
> I'll take a look.
>
>>
>> Best
>> Erick
>
> Thanks,
>
> Robert


Re: solr 4.1 compression

2012-10-24 Thread Radim Kolar

i found this ticket: https://issues.apache.org/jira/browse/SOLR-3927
compression is currently lucene 4.1-branch only and not yet in solr4.1 
branch?


RE: Failure to open existing log file (non fatal)

2012-10-24 Thread Markus Jelsma

-Original message-
> From:Mark Miller 
> Sent: Wed 24-Oct-2012 01:33
> To: solr-user@lucene.apache.org
> Subject: Re: Failure to open existing log file (non fatal)
> 
> Why the process died, I cannot say. Seems like the world of guesses is
> just too large :) If there is nothing in the logs, it's likely a the
> OS level? But if there are no dump files or evidence of it in system
> logs, I don't even know where to start.

Me neither, perhaps it was cosmic radiation, i'll keep digging.

> 
> All I can help with is that the exception is an expected possibility
> after a Solr crash (on the next startup).

Yes, the node recovered itself in the minutes after the exception and continued 
to run fine. At least other users can now google for the error and ignore it 
for now.

Thanks!

> 
> - Mark
> 
> On Tue, Oct 23, 2012 at 6:48 PM, Markus Jelsma
>  wrote:
> > Hi,
> >
> > I checked the logs and it confirms the error is not fatal, it was logged 
> > just a few seconds before it was restarted. The node runs fine after it was 
> > restarted but logged this non fatal error replayed the log twice. This 
> > leaves the question why it died, there is no log of it dying anywhere. We 
> > don't recover rsyslogd so it was running all the time and there is no 
> > report of an OOM-killer there.
> >
> > Any more thoughts to share?
> >
> > Thanks
> > Markus
> >
> >
> > -Original message-
> >> From:Chris Hostetter 
> >> Sent: Wed 24-Oct-2012 00:38
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Failure to open existing log file (non fatal)
> >>
> >>
> >> : Perhaps we can improve the appearance of this - but it's expected to 
> >> happen in crash cases.
> >>
> >> in case it wasn't clear: there's no indication that this "Failure to open
> >> existing log file" *caused* any sort of crash -- it comes from the
> >> initialization code of the UpdateHandler when a SolrCore is being created
> >> as part of Solr startup.
> >>
> >> So this is an error that was definitely logged after tomcat was restarted.
> >>
> >> : > It seem the error is more fatal than the error tells me, the indexing
> >> : error and the exception happened within a few seconds of eachother. Any
> >> : ideas? Existing issue? File bug?
> >>
> >>
> >> -Hoss
> >>
> 
> 
> 
> -- 
> - Mark
> 


Re: Occasional Solr performance issues

2012-10-24 Thread Dotan Cohen
On Tue, Oct 23, 2012 at 3:07 PM, Erick Erickson  wrote:
> Maybe you've been looking at it but one thing that I didn't see on a fast
> scan was that maybe the commit bit is the problem. When you commit,
> eventually the segments will be merged and a new searcher will be opened
> (this is true even if you're NOT optimizing). So you're effectively committing
> every 1-2 seconds, creating many segments which get merged, but more
> importantly opening new searchers (which you are getting since you pasted
> the message: Overlapping onDeckSearchers=2).
>
> You could pinpoint this by NOT committing explicitly, just set your autocommit
> parameters (or specify commitWithin in your indexing program, which is
> preferred). Try setting it at a minute or so and see if your problem goes away
> perhaps?
>
> The NRT stuff happens on soft commits, so you have that option to have the
> documents immediately available for search.
>


Thanks, Erick. I'll play around with different configurations. So far
just removing the periodic optimize command worked wonders. I'll see
how much it helps or hurts to run that daily or more or less frequent.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: DIH nested entities don't work

2012-10-24 Thread mroosendaal
Hi,

The views are very specific and every column contains relevant information,
thus the '*'.

But a strange thing happened, i ran the data-import again and for some
reason products had features but still no synonyms. The only thing i changed
was to use: processor="SqlEntityProcessor " cacheImpl="SortedMapBackedCache"
to speed up the import. Still no descriptions or songtitle.

Then I updated the data-config to import more data from other views and then
the less products had features and no other data was loaded.

I'll try some different combinations of jus products and features and work
from there. It does seem to fetch everthing but simply not index it for
subentities. 

But if a field is empty, can the DIH handle this or is that only when you
tag a field as 'required'?

Cheers,
Maarten



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4015535.html
Sent from the Solr - User mailing list archive at Nabble.com.


Search field affecting relevance

2012-10-24 Thread Maxim Kuleshov
Hi,

For example, we have documents with two fields - field1 and field2.
Both fields are indexed and both are used in search.

Is there way to return only documents that are matched by field1, but
taking in account that if field2 is matched - relevance should be
higher? In other words, if document "A" is mathced by field1 and
field2 it's relevance should be higher than document "B" matched only
by field1 and document "C" where only field2 is matched should not be
returned at all?

Could you please help with outlining the general approach how to
achieve this? Either it's core lucene feature, or solr post processing
logic or something else?


Re: DIH nested entities don't work

2012-10-24 Thread Gora Mohanty
On 24 October 2012 13:03, mroosendaal  wrote:
> Hi,
>
> Here's the relevant part of my schema:
[...]
>
>  pdt_id
> ...
>
> the data is read from into another searchengine fine but i'll try the select
> queries individually. The field-definitions need some tweaking.
[...]

As you are not specifically defining field entries inside
the entities, the names of the columns from the SELECT
statements must match the names of fields in the Solr
DIH configuration file (case does not matter). E.g.,
if you expect pdt_description to be filled, one of the SELECTs
must fetch a column of that name. Please see
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example ,
if you have not already come across it.

Also, a nested entity will be skipped if the SELECT for the
outer one fails.

Regards,
Gora


Re: DIH nested entities don't work

2012-10-24 Thread mroosendaal
Hi,

Here's the relevant part of my schema:

 
   
   
   
   



   


   


 pdt_id
...

the data is read from into another searchengine fine but i'll try the select
queries individually. The field-definitions need some tweaking.

As for the songtitle, this was a typo

Cheers,
Maarten



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tp4015514p4015524.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: uniqueKey not enforced

2012-10-24 Thread Robert Krüger
On Tue, Oct 23, 2012 at 2:37 PM, Erick Erickson  wrote:
> From left field:
>
> Try looking at your admin/schema browser page for the ID in question.
> That actually
> gets stuff out of your index (the actual indexed terms). See if you
> have two values
I'm running embedded, so I don't have that. However I have a simple UI
for performing queries and the duplicate records are displayed issuing
a "*:*" query.

> for that ID. In which case you _might_ have spaces before or after the value
> somehow. I notice your comment says something about "computed", so... Since
> String types are totally unanalyzed, spaces would count.
No, the way the id is computed can not lead to leading or trailing whitespace.

>
> you can also use the TermsComponent to see what's there, see:
> http://wiki.apache.org/solr/TermsComponent
I'll take a look.

>
> Best
> Erick

Thanks,

Robert