Using FieldCache in SolrIndexSearcher for distributed id retrieval

2013-01-29 Thread Michael Ryan
Following up from a post I made back in 2011...

> I am a user of Solr 3.2 and I make use of the distributed search capabilities 
> of Solr using
> a fairly simple architecture of a coordinator + some shards.
> 
> Correct me if I am wrong:  In a standard distributed search with 
> QueryComponent, the first
> query sent to the shards asks for fl=myUniqueKey or fl=myUniqueKey,score.  
> When the response
> is being generated to send back to the coordinator, SolrIndexSearcher.doc 
> (int i, Set
> fields) is called for each document.  As I understand it, this will read each 
> document from
> the index _on disk_ and retrieve the myUniqueKey field value for each 
> document.
> 
> My idea is to have a FieldCache for the myUniqueKey field in 
> SolrIndexSearcher (or somewhere
> else?) that would be used in cases where the only field that needs to be 
> retrieved is myUniqueKey.
>  Is this something that would improve performance?
> 
> In our actual setup, we are using an extended version of QueryComponent that 
> queries for a
> couple other fields besides myUniqueKey in the initial query to the shards, 
> and it asks a
> lot of rows when doing so, many more than what the user ends up getting back 
> when they see
> the results.  (The reasons for this are complicated and aren't related much 
> to this question.)
>  We already maintain FieldCaches for the fields that we are asking for, but 
> for other purposes.
>  Would it make sense to utilize these FieldCaches in SolrIndexSearcher?  Is 
> this something
> that anyone else has done before?

We did end up doing this inside of the SolrIndexSearcher.doc() method. 
Basically I check if the fields Set only contains fields that I am willing to 
use the FieldCache for, and if so, build up the Document from the data inside 
of the FieldCache. Basically looks like this...

if (fieldNamesToRetrieveFromFieldCache.containsAll(fields)) {
  d = new Document();
  if (fields.contains("myUniqueKeyField")) {
long value = FieldCache.DEFAULT.getLongs(reader, "myUniqueKeyField")[i];
if (value != 0) {
  d.add(new NumericField("myUniqueKeyField", Field.Store.YES, 
true).setLongValue(value));
}
  }
  if (fields.contains("someOtherField")) {
long value = FieldCache.DEFAULT.getLongs(reader, "someOtherField")[i];
if (value != 0) {
  d.add(new NumericField("someOtherField", Field.Store.YES, 
true).setLongValue(value));
}
  }
}

I don't have a more generalized patch that makes it easily configurable, but 
the idea is fairly simple.

We have had good results from this. For a system of n shards, this reduces the 
average number of docs to retrieve from disk per shard from rows to rows/n. For 
requests with a large rows parameter (e.g., 1000) and many shards, this makes a 
noticeable difference in response time. Obviously this isn't the typical Solr 
use case, so your mileage may vary. 

-Michael


Re: small QTime but slow results to user

2013-01-29 Thread S L
I'm just writing to close the loop on this issue.

I moved my servlet to a beefier server with lots of RAM. I also cleaned up
the data to make the index somewhat smaller. And, I turned off all the
caches since my application doesn't benefit very much from caching. My
application is now quite zippy, returning 500 results in an average time of
half a second of real time, and a qtime of .25 seconds. 

I did have to play a bit with the java stack size for tomcat. 160k was
leading to lots of stackoverflow errors. I upped -Xss to 256k and haven't
seen a single error since.

I do see some disk activity but the disk is never pegged and performance is
great.

Thank you everyone for steering me in the right direction. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/small-QTime-but-slow-results-to-user-tp4027100p4037262.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to migrate SolrCloud shards to different servers?

2013-01-29 Thread Timothy Potter
Just one suggestion, instead of stopping zk and removing zoo_data,
better to use Solr's zkcli.sh script from cloud-scripts to clear out
data, e.g.

zkcli.sh -zkhost localhost:9983 -cmd clear /solr

The paths I clear when I want a full clean-up are:

/configs/CONFIG_NAME
/collections/COLLECTION_NAME
/clusterstate.json



On Tue, Jan 29, 2013 at 4:14 PM, Mingfeng Yang  wrote:
> An experiment found that stop all shards, remove the zoo_data (assume your
> zookeeper is used for this particular solrcloud, otherwise, be cautious),
> and then start instance by order works fine.
>
> Ming
>
>
>
> On Sat, Jan 26, 2013 at 5:31 AM, Per Steffensen  wrote:
>
>> Hi
>>
>> We have actually tested this and found that the following will do it
>> * Shutdown all Solr nodes - make sure ZKs are still running
>> * For each replica (shard-instance) move its data-folder to the new server
>> (if they are not already available to it through some shared storage)
>> * For each repilca (shard-instance) also move solr.xmls
>> * Extract clusterstate.json from ZK into a file. Modify that file so that
>> hosts/IPs and ports are correct according to new setup. Replace
>> clusterstate.json in ZK with the modified content of the clusterstate.json
>> file
>> * Start new Solr nodes
>>
>> Good luck!
>>
>> Regards, Per Steffensen
>>
>>
>>
>> On 1/26/13 6:56 AM, Mingfeng Yang wrote:
>>
>>> Hi Mark,
>>>
>>> When I did testing with SolrCloud, I found the following.
>>>
>>> 1. I started 4 shards on the same host on port 8983, 8973, 8963, and 8953.
>>> 2. Index some data.
>>> 3. Shutdown all 4 shards.
>>> 4. Started 4 shards again, all pointing to the same data directory and use
>>> the same configuration, except that now we use different ports 8983, 8973,
>>>   7633 and 7648.
>>> 5. Now Solr has problem to load all cores properly.
>>>
>>> Therefore, I had the impression that ZooKeeper may have a memory of which
>>> hosts correspond to which shards. If I change the host info, it may get
>>> confused.  I could not find any related documentation or discussion about
>>> this issue.
>>>
>>> Thanks,
>>> Ming
>>>
>>>
>>>
>>>
>>> On Fri, Jan 25, 2013 at 5:52 PM, Mark Miller 
>>> wrote:
>>>
>>>  You could do it that way.

 I'm not sure why you are worried about the leaders. That shouldn't
 matter.

 You could also start up new Solrs on the new machines as replicas of the
 cores you want to move - then once they are active, unload the cores on
 the
 old machine, stop the Solr instances and remove the stuff left on the
 filesystem.

 - Mark

 On Jan 25, 2013, at 7:42 PM, Mingfeng Yang 
 wrote:

  Right now I have an index with four shards on a single EC2 server, each
> running on different ports.  Now I'd like to migrate three shards
> to independent servers.
>
> What should I do to safely accomplish this process?
>
> Can I just
> 1. shutdown all four solr instances.
> 2. copy three shards (indexes) to different servers.
> 3. launch 4 solr instances on 4 different servers, each with -zKhost
> specified, pointing to the zookeeper servers.
>
> In my impression, zookeeper remembers which shards are leaders.  What I
> plan to do above could not elect the three new servers as leaders.  If
>
 so,

> what's the correct way to do it?
>
> Thanks,
> Ming
>


>>


Re: edismax, qf, multiterm analyzer bug?

2013-01-29 Thread Ahmet Arslan
> Looks like a bug to me.

Thanks Jack for the reply. I created SOLR-4382 for this.



Re: How to migrate SolrCloud shards to different servers?

2013-01-29 Thread Mingfeng Yang
An experiment found that stop all shards, remove the zoo_data (assume your
zookeeper is used for this particular solrcloud, otherwise, be cautious),
and then start instance by order works fine.

Ming



On Sat, Jan 26, 2013 at 5:31 AM, Per Steffensen  wrote:

> Hi
>
> We have actually tested this and found that the following will do it
> * Shutdown all Solr nodes - make sure ZKs are still running
> * For each replica (shard-instance) move its data-folder to the new server
> (if they are not already available to it through some shared storage)
> * For each repilca (shard-instance) also move solr.xmls
> * Extract clusterstate.json from ZK into a file. Modify that file so that
> hosts/IPs and ports are correct according to new setup. Replace
> clusterstate.json in ZK with the modified content of the clusterstate.json
> file
> * Start new Solr nodes
>
> Good luck!
>
> Regards, Per Steffensen
>
>
>
> On 1/26/13 6:56 AM, Mingfeng Yang wrote:
>
>> Hi Mark,
>>
>> When I did testing with SolrCloud, I found the following.
>>
>> 1. I started 4 shards on the same host on port 8983, 8973, 8963, and 8953.
>> 2. Index some data.
>> 3. Shutdown all 4 shards.
>> 4. Started 4 shards again, all pointing to the same data directory and use
>> the same configuration, except that now we use different ports 8983, 8973,
>>   7633 and 7648.
>> 5. Now Solr has problem to load all cores properly.
>>
>> Therefore, I had the impression that ZooKeeper may have a memory of which
>> hosts correspond to which shards. If I change the host info, it may get
>> confused.  I could not find any related documentation or discussion about
>> this issue.
>>
>> Thanks,
>> Ming
>>
>>
>>
>>
>> On Fri, Jan 25, 2013 at 5:52 PM, Mark Miller 
>> wrote:
>>
>>  You could do it that way.
>>>
>>> I'm not sure why you are worried about the leaders. That shouldn't
>>> matter.
>>>
>>> You could also start up new Solrs on the new machines as replicas of the
>>> cores you want to move - then once they are active, unload the cores on
>>> the
>>> old machine, stop the Solr instances and remove the stuff left on the
>>> filesystem.
>>>
>>> - Mark
>>>
>>> On Jan 25, 2013, at 7:42 PM, Mingfeng Yang 
>>> wrote:
>>>
>>>  Right now I have an index with four shards on a single EC2 server, each
 running on different ports.  Now I'd like to migrate three shards
 to independent servers.

 What should I do to safely accomplish this process?

 Can I just
 1. shutdown all four solr instances.
 2. copy three shards (indexes) to different servers.
 3. launch 4 solr instances on 4 different servers, each with -zKhost
 specified, pointing to the zookeeper servers.

 In my impression, zookeeper remembers which shards are leaders.  What I
 plan to do above could not elect the three new servers as leaders.  If

>>> so,
>>>
 what's the correct way to do it?

 Thanks,
 Ming

>>>
>>>
>


RE: queryResultCache *very* low hit ratio

2013-01-29 Thread Petersen, Robert
Hi Shawn,

Since my solr services power product search for a large retail web site with 
over fourteen million unique products, so I'm suspecting the main reason for 
the low hit rate is many unique user queries.  We're expanding our product 
count and product type categories every day as fast as we can.

Thanks!
Robi

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Tuesday, January 29, 2013 2:24 PM
To: solr-user@lucene.apache.org
Subject: Re: queryResultCache *very* low hit ratio

On 1/29/2013 1:36 PM, Petersen, Robert wrote:
> My queryResultCache hitratio has been trending down lately and is now at 
> 0.01%, and also it's warmup time was almost a minute.  I have lowered the 
> autowarm count dramatically since there are no hits anyway.  I also wanted to 
> lower my autowarm counts across the board because I am about to expand the 
> warmup queries in my newSearcher config section.  Would I be better just 
> turning off this cache completely?  I don't really want to increase its size 
> because I've found that by keeping my cache sizes limited keeps me from 
> getting OOM exceptions across my slave farm.

A low hit ratio on this cache means quite simply that most of your queries (q 
parameter) are unique.

Often this is the result of including unique identifiers within the query text, 
or using the NOW variable in queries against a date field, because NOW changes 
every millisecond.  By using rounding (NOW/HOUR,
NOW/DAY) you can fix the latter.

Sometimes it's caused by an unexpected and very very active query source.  If 
your developers see your Solr service as an unlimited resource, they might 
write programs that bombard the server with unique queries.  If that's what is 
happening, you might need another copy of your solr infrastructure that's for 
internal use only.

Sometimes it's just because your users are entering a lot of unique searches, 
or not visiting multiple pages of results.

If you're not seeing any value from the cache, turning it off might be sensible 
so it doesn't use memory.

Thanks,
Shawn





RE: queryResultCache *very* low hit ratio

2013-01-29 Thread Petersen, Robert
Thanks Yonik,

I'm cooking up some static warming queries right now, based upon our commonly 
issued queries.  I've already been noticing occasional long running queries.  
Our web farm times out a search after twenty seconds and issues an exception.  
I see a few of these every day and am trying to combat them with better warm up 
queries.  My current static warm up queries are too simple I suspect.  They 
don't replicate any of our typically issued filter queries nor function queries.

Thanks
Robi

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, January 29, 2013 2:46 PM
To: solr-user@lucene.apache.org
Subject: Re: queryResultCache *very* low hit ratio

One other thing that some auto-warming of the query result cache can achieve is 
loading FieldCache entries for sorting / function queries so real user queries 
don't experience increased latency.  If you remove all auto-warming of the 
query result cache, you may want to add static warming entries for these fields.

-Yonik
http://lucidworks.com


On Tue, Jan 29, 2013 at 3:36 PM, Petersen, Robert  wrote:
> Hi solr users,
>
> My queryResultCache hitratio has been trending down lately and is now at 
> 0.01%, and also it's warmup time was almost a minute.  I have lowered the 
> autowarm count dramatically since there are no hits anyway.  I also wanted to 
> lower my autowarm counts across the board because I am about to expand the 
> warmup queries in my newSearcher config section.  Would I be better just 
> turning off this cache completely?  I don't really want to increase its size 
> because I've found that by keeping my cache sizes limited keeps me from 
> getting OOM exceptions across my slave farm.
>
> Thanks,
>
> Robert (Robi) Petersen
> Senior Software Engineer
> Search Department
>




Re: replicateOnStartup not finding commits after SOLR-3911?

2013-01-29 Thread Gregg Donovan
Thanks, Mark -- that fixed the issue for us. I created
https://issues.apache.org/jira/browse/SOLR-4380 to track it.

On Tue, Jan 29, 2013 at 4:06 PM, Mark Miller  wrote:
>
> On Jan 29, 2013, at 3:50 PM, Gregg Donovan  wrote:
>
>>  should we
>> just try uncommenting that line in ReplicationHandler?
>
> Please try. I'd file a JIRA issue in any case. I can probably take a closer 
> look.
>
> - Mark


Re: queryResultCache *very* low hit ratio

2013-01-29 Thread Yonik Seeley
One other thing that some auto-warming of the query result cache can
achieve is loading FieldCache entries for sorting / function queries
so real user queries don't experience increased latency.  If you
remove all auto-warming of the query result cache, you may want to add
static warming entries for these fields.

-Yonik
http://lucidworks.com


On Tue, Jan 29, 2013 at 3:36 PM, Petersen, Robert  wrote:
> Hi solr users,
>
> My queryResultCache hitratio has been trending down lately and is now at 
> 0.01%, and also it's warmup time was almost a minute.  I have lowered the 
> autowarm count dramatically since there are no hits anyway.  I also wanted to 
> lower my autowarm counts across the board because I am about to expand the 
> warmup queries in my newSearcher config section.  Would I be better just 
> turning off this cache completely?  I don't really want to increase its size 
> because I've found that by keeping my cache sizes limited keeps me from 
> getting OOM exceptions across my slave farm.
>
> Thanks,
>
> Robert (Robi) Petersen
> Senior Software Engineer
> Search Department
>


Re: queryResultCache *very* low hit ratio

2013-01-29 Thread Shawn Heisey

On 1/29/2013 1:36 PM, Petersen, Robert wrote:

My queryResultCache hitratio has been trending down lately and is now at 0.01%, 
and also it's warmup time was almost a minute.  I have lowered the autowarm 
count dramatically since there are no hits anyway.  I also wanted to lower my 
autowarm counts across the board because I am about to expand the warmup 
queries in my newSearcher config section.  Would I be better just turning off 
this cache completely?  I don't really want to increase its size because I've 
found that by keeping my cache sizes limited keeps me from getting OOM 
exceptions across my slave farm.


A low hit ratio on this cache means quite simply that most of your 
queries (q parameter) are unique.


Often this is the result of including unique identifiers within the 
query text, or using the NOW variable in queries against a date field, 
because NOW changes every millisecond.  By using rounding (NOW/HOUR, 
NOW/DAY) you can fix the latter.


Sometimes it's caused by an unexpected and very very active query 
source.  If your developers see your Solr service as an unlimited 
resource, they might write programs that bombard the server with unique 
queries.  If that's what is happening, you might need another copy of 
your solr infrastructure that's for internal use only.


Sometimes it's just because your users are entering a lot of unique 
searches, or not visiting multiple pages of results.


If you're not seeing any value from the cache, turning it off might be 
sensible so it doesn't use memory.


Thanks,
Shawn



RE: Solr Faceting with Name Values

2013-01-29 Thread O. Olson
Thank you Robi for the information. I will be looking into this esp. the
implementation. Having to join the names together and then split them later
is something I have to discuss with my team. 

O. O.



Petersen, Robert wrote
> Hi O.O
> 
> 1.  Yes faceting on field function_s would return all the facet values in
> the search results with their counts.
> 2.  You would probably have to join the names together with a special
> character and then split them later in the UI.  
> 3.  I'm sure there is a way to query the index for all defined fields. 
> The admin schema browser page does this exact thing.
> 
> Resources for further exploration:
> http://wiki.apache.org/solr/SolrFacetingOverview
> http://wiki.apache.org/solr/SimpleFacetParameters
> http://searchhub.org/2009/09/02/faceted-search-with-solr/
> http://wiki.apache.org/solr/HierarchicalFaceting
> http://lucidworks.lucidimagination.com/display/solr/Faceting
> 
> Have fun!
> Robi





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Faceting-with-Name-Values-tp4036872p4037201.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple-fields multilingual indexing - Query expansion for multilingual fields

2013-01-29 Thread Alexandre Rafalovitch
On Tue, Jan 29, 2013 at 4:39 PM, Eduard Moraru  wrote:

> Now, what worries me a bit is the fact that I have a copyField set up from
> "title_*" to "title_ml" to do what I have mentioned above.
>

copyField is not recursive, nor chained. Even if some people wished it was
(chained). So, I think you are safe from the infinite loops with that.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Re: In DIH, does column vs. name depend on data source?

2013-01-29 Thread Alexandre Rafalovitch
And I have just confirmed that this is indeed the case (unless I lost my
mind).

This must be causing great confusing to anybody trying to piece together
examples from multiple places in DIH wiki and getting completely confused.

The example I had just now was:
1) I had this in XpathEntityProcessor example:
  

2) I copy this to JDBC example:
  
The import fails due to date format not matching the field. But I have
DateFormatTransformer and it just worked for the other example.

3) Here is the correct line _for JDBC_:
  
Notice how the _name_ is what should match to the schema.xml. But of course
in the example 2 above, it worked against the DB, but then
DateFormatTransformer was looking for "DATE" rather than "date" field.

And this is an obvious setup, imagine if this was more subtle. I bet
some people just give up on DIH handler because of this.

I'll fill in an issue, but I am surprised nobody got caught by this before.
Plus, wiki needs to be seriously updated.

Regards,
   Alex. (wearing my ex-Tech Support hat)

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Mon, Jan 21, 2013 at 10:24 AM, Alexandre Rafalovitch
wrote:

> Hello,
>
> When I look at XPathEntityProcessor, it seems that "column" refers to a
> field registered in schema.xml and "xpath" to the source content.
>
> When I look at JdbcDataSource, it seems that "column" refers to the source
> content (in the database) and "name " to a field registered in schema.xml
>
> When I look at Transformers, which one assumes can apply to both, they use
> "column" and "sourceColName" to refer to names that may or may not be
> registered in schema.xml (the critical architecture diagram image seem to
> be missing).
>
> Given that this is all within one Wiki page and the examples
> cross-reference each other a bit, this is awfully confusing.
>
> Does anybody have a straight insight into this?
>
> Thank you,
>Alex.
>


Re: indexing Text file in solr

2013-01-29 Thread Jan Høydahl
If you're lucky, the file has a format suitable for the CSV update handler 
http://wiki.apache.org/solr/UpdateCSV
Note that if your file does not containt unique ID, you can generate those 
http://wiki.apache.org/solr/UniqueKey
You can then use CURL or http://wiki.apache.org/solr/post.jar to index it

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

27. jan. 2013 kl. 11:23 skrev hadyelsahar :

> i have a large Arabic Text File that contains Tweets each line contains one
> tweet , that i want to index in solr such that each line of this document
> should be indexed in a separate solr document
> 
> what i tried so far :
> 
> i know how to SQL databse records in solr
> i know how to change solr schema to fit the data and working with Data
> import handler
> i know how the queries used to index data in solr
> what i want is :
> 
> know how to index text file in solr in order that each line is considered a
> solr document
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/indexing-Text-file-in-solr-tp4036496.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Data Config Queries per Field

2013-01-29 Thread O. Olson
Gora Mohanty-3 wrote
> Yes, things should function as you describe, and no you should not
> need any change in your schema from changing the DIH configuration
> file. Please take a look at
> http://wiki.apache.org/solr/SolrFacetingOverview#Facet_Indexing for
> how best to define faceting fields. Also, see this tutorial on faceted
> search with Solr:
> http://searchhub.org/2009/09/02/faceted-search-with-solr/
> 
> Regards,
> Gora

Thank you Gora. I implemented it the way you suggested, and it worked
perfectly!
O. O.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Data-Config-Queries-per-Field-tp4037092p4037189.html
Sent from the Solr - User mailing list archive at Nabble.com.


Traditional replication behind SolrCloud

2013-01-29 Thread Mingfeng Yang
Our application of Solr is somehow non-typical.  We constantly feed Solr
with lots of documents grabbed from internet, and NRT searching is not
required.  A typical search will return millions of result, and query
response need to be as fast as possible.

Since in SolrCloud environment, indexing request is constantly distributing
to all leaders and replicas, and I think that may impact the query
performance since the replicas are doing indexing and searching at the same
time.   I think about setting up a traditional replication behind each
shard of SolrCloud, and set the replication interval to a few minutes, to
minimize the impact of indexing on system resources.

Or is there already some way to enforce traditional type of replication in
the replicas of SolrCloud?

Thanks,
Ming


Re: Multiple-fields multilingual indexing - Query expansion for multilingual fields

2013-01-29 Thread Eduard Moraru
Hi Alex,

On Wed, Jan 23, 2013 at 7:47 PM, Alexandre Rafalovitch
wrote:

> On Wed, Jan 23, 2013 at 12:23 PM, Eduard Moraru  >wrote:
>
> > The only small but workable problem I have now is the same as
> > https://issues.apache.org/jira/browse/SOLR-3598. When you are creating
> an
> > alias for the field "who", you can't include the actual field in the list
> > of alias like "f.who.qf=who,what,where" because you`ll get an "alias
> loop"
> > exception.
> >
>
> But why do you need 'title' field at all? I can see it is 'generic'
> formatting, but how useful can that be if you are actively multilingual?
>

I find it useful when you index a (XY) language that is not configured in
schema.xml. When this happens, text_ml (the name that I`ve come up with for
the actual text_general type field) indexes the lightly analyzed content
and then I can query it because it's included in the
"f.title.qf=title_ml,title_en,title_fr,..." alias that I have set up. The
title_XY field that is attempted to be indexed when the language is not
configured gets ignored by a dynamic field that catches unknown fields (as
per the example in the example schema.xml).

>
> But if you need it, can't it be just title_generic in the schema. You can
> probably use Request Update Processors to change the field name if you
> can't rename it in the client/source.
>
> And if you are worried about the client getting the field names, I believe
> you can alias them on the way out as well, using a different parameter.
>

Yes, I have come to the conclusion that "text_ml" is a good choice and
semantically OK in my schema.xml.

Now, what worries me a bit is the fact that I have a copyField set up from
"title_*" to "title_ml" to do what I have mentioned above. My worry is that
the copyField might also cause "title_ml" (since it matches the wildcard
pattern) to be copied (redundantly) to "title_ml". I have not yet tested
this in practice, but I hope it is not an issue (hoping that Solr is smart
enough).

Thanks,
Eduard


> Regards,
>Alex.
>
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>


Re: replicateOnStartup not finding commits after SOLR-3911?

2013-01-29 Thread Mark Miller

On Jan 29, 2013, at 3:50 PM, Gregg Donovan  wrote:

>  should we
> just try uncommenting that line in ReplicationHandler?

Please try. I'd file a JIRA issue in any case. I can probably take a closer 
look.

- Mark

Re: DIH datasource configuration

2013-01-29 Thread Gora Mohanty
On 30 January 2013 01:52, Lapera-Valenzuela, Elizabeth [Primerica]
 wrote:
> Is there a way to pass in password and user to datasource in db-config
> xml file?  Thanks.

Do you mean something beyond what is covered in the Solr
DIH Wiki page:
http://wiki.apache.org/solr/DataImportHandler#Configuring_DataSources

Regards,
Gora


replicateOnStartup not finding commits after SOLR-3911?

2013-01-29 Thread Gregg Donovan
In the process of upgrading to 4.1 from 3.6, I've noticed that our
master servers do not show any commit points available until after a
new commit happens. So, for static indexes, replication doesn't happen
and for dynamic indexes, we have to wait until an incremental update
of master for slaves to see any commits.

Tracing through the code, it looks like the change that may have
effected us was part of SOLR-3911 [1], specifically commenting out the
initialization of the newIndexWriter in the replicateAfterStartup
block [2]:

// TODO: perhaps this is no longer necessary then?
// core.getUpdateHandler().newIndexWriter(true);

I'm guessing this is commented out because it is assumed that
indexCommitPoint was going to be set by that block, but when a slave
requests commits, that goes back to
core.getDeletionPolicy().getCommits() to fetch the list of commits. If
no indexWriter has been initialized, then, as far as I can tell,
IndexDeletionPolicyWrapper#onInit will not have been called and there
will be no commits available.

Is there something in the code or configuration that we may be missing
that should be initializing the commits for replication or should we
just try uncommenting that line in ReplicationHandler?

Thanks!

--Gregg

Gregg Donovan
Senior Software Engineer, Etsy.com
gr...@etsy.com


[1]
https://issues.apache.org/jira/browse/SOLR-3911
https://issues.apache.org/jira/secure/attachment/12548596/SOLR-3911.patch

[2]
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/core/src/java/org/apache/solr/handler/ReplicationHandler.java?annotate=1420992&diff_format=h&pathrev=1420992#l880


queryResultCache *very* low hit ratio

2013-01-29 Thread Petersen, Robert
Hi solr users,

My queryResultCache hitratio has been trending down lately and is now at 0.01%, 
and also it's warmup time was almost a minute.  I have lowered the autowarm 
count dramatically since there are no hits anyway.  I also wanted to lower my 
autowarm counts across the board because I am about to expand the warmup 
queries in my newSearcher config section.  Would I be better just turning off 
this cache completely?  I don't really want to increase its size because I've 
found that by keeping my cache sizes limited keeps me from getting OOM 
exceptions across my slave farm.

Thanks,

Robert (Robi) Petersen
Senior Software Engineer
Search Department



Re: DIH datasource configuration

2013-01-29 Thread Lapera-Valenzuela, Elizabeth [Primerica]
Is there a way to pass in password and user to datasource in db-config
xml file?  Thanks.



RE: Solr Faceting with Name Values

2013-01-29 Thread Petersen, Robert
Hi O.O

1.  Yes faceting on field function_s would return all the facet values in the 
search results with their counts.
2.  You would probably have to join the names together with a special character 
and then split them later in the UI.  
3.  I'm sure there is a way to query the index for all defined fields.  The 
admin schema browser page does this exact thing.

Resources for further exploration:
http://wiki.apache.org/solr/SolrFacetingOverview
http://wiki.apache.org/solr/SimpleFacetParameters
http://searchhub.org/2009/09/02/faceted-search-with-solr/
http://wiki.apache.org/solr/HierarchicalFaceting
http://lucidworks.lucidimagination.com/display/solr/Faceting

Have fun!
Robi


-Original Message-
From: O. Olson [mailto:olson_...@yahoo.it] 
Sent: Monday, January 28, 2013 3:11 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Faceting with Name Values

Thank you Robi. Your idea seems good but I have a few questions: 

1.  From your description, I would create a field “Function_s” with the 
value
“Scanner” and “Function_s” with the value “Printer” for my two Products.
This seems good. Is it possible for you give me a query for this dynamic field. 
For e.g., could I do something like: 

&facet=true&facet.field=Function_s

I would like this to tell me how many of the products are Scanners and how many 
of the products are Printers.

2.  Many of my Attribute Names have spaces e.g. “PC Connection”, or even
brackets and slashes e.g. “Scan Speed (ppm)”. Would there be a problem putting 
these in a dynamic field name?

3.  Is it possible to query for the possible list of dynamic fieldnames? I
might need this when creating a list of attributes.


Thanks again Robi.
O. O.

--

Petersen, Robert wrote
> Hi O.O.,
> 
> You don't need to add them all into the schema.  You can use the 
> wildcard fields like  indexed="true"  stored="true" />  to hold them.  You can then have the 
> attribute name be the part of the wildcard and the attribute value be 
> the field contents. So you could have fields like Function_s:Scanner 
> etc and then you could ask for facets which are relevant based upon 
> query or category.
> 
> That would be a much more straightforward approach and much easier to 
> facet on.  Hope that helps a little bit.
> 
> -Robi





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Faceting-with-Name-Values-tp4036872p4036904.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Solr load balancer

2013-01-29 Thread Phil Hoy
Hi Erick,

Thanks, I have read the blogs you cited and I found them very interesting, and 
we have tuned the jvm accordingly but still we get the odd longish gc pause. 

That said we perhaps have an unusual setup; we index a lot of small documents 
using servers with ssd's and 128 GB RAM in a sharded set up with replicas and 
our queries rely heavily on query filters and faceting with minimal free-text 
style searching. For that reason we rely heavily on the filter cache to improve 
query latency, therefore we assign a large percentage of available ram to the 
jvm hosting solr. 

Anyhow we are happy with the current configuration and performance profile, 
aside from the odd gc pause that is, and as we have index replicas it seems to 
me that we should be able to cope, hence my willingness to tweak how the load 
balancer behaves.

Thanks,
Phil



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 20 January 2013 15:56
To: solr-user@lucene.apache.org
Subject: Re: Solr load balancer

Hmmm, the first thing I'd look at is why you are having long GC pauses. Here's 
a great place to start:

http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
and:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

I've wondered about a similar approach, but by firing off the same query to 
multiple nodes in your cluster, you'll be effectively doubling (at least) the 
load on your system. Leading to more memory issues perhaps in a "non-virtuous 
cycle".

FWIW,
Erick

On Fri, Jan 18, 2013 at 5:41 AM, Phil Hoy  wrote:
> Hi,
>
> I would like to experiment with some custom load balancers to help with query 
> latency in the face of long gc pauses and the odd time-consuming query that 
> we need to be able to support. At the moment setting the socket timeout via 
> the HttpShardHandlerFactory does help, but of course it can only be set to a 
> length of time as long as the most time consuming query we are likely to 
> receive.
>
> For example perhaps a load balancer that sends multiple queries concurrently 
> to all/some replicas and only keeps the first response might be effective. Or 
> maybe a load balancer which takes account of the frequency of timeouts would 
> be able to recognize zombies more effectively.
>
> To use alternative load balancer implementations cleanly and without having 
> to hack solr directly, I would need to be able to make the existing 
> LBHttpSolrServer and HttpShardHandlerFactory more amenable to extension, I 
> can then override the default load balancer using solr's plugin mechanism.
>
> So my question is, if I made a patch to make the load balancer more 
> pluggable, is this something that would be acceptable and if so what do I do 
> next?
>
> Phil
>
> __
> "brightsolid" is used in this email to collectively mean brightsolid online 
> innovation limited and its subsidiary companies brightsolid online publishing 
> limited and brightsolid online technology limited.
> findmypast.co.uk is a brand of brightsolid online publishing limited.
> brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
> Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
> brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
> Street, London EC2A 3DQ. Registered in England No. 04369607.
> brightsolid online technology limited, Gateway House, Luna Place, Dundee 
> Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC161678.
>
> Email Disclaimer
>
> This message is confidential and may contain privileged information. You 
> should not disclose its contents to any other person. If you are not the 
> intended recipient, please notify the sender named above immediately. It is 
> expressly declared that this e-mail does not constitute nor form part of a 
> contract or unilateral obligation. Opinions, conclusions and other 
> information in this message that do not relate to the official business of 
> brightsolid shall be understood as neither given nor endorsed by it.
> __
> This email has been scanned by the brightsolid Email Security System. 
> Powered by MessageLabs 
> __

__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs 
__

__
"brightsolid" is used in this email to collectively mean brightsolid online 
innovation limited and its subsidiary companies brightsolid online publishing 
limited and brightsolid online technology limited.
findmypast.co.uk is a brand of brightsolid online publishing limited.
brightsolid online innovation

Re: overlap function query

2013-01-29 Thread Mikhail Khludnev
Daniel,

You can start from here
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/search/similarities/Similarity.html#coord%28int,%20int%29but
it requires deep understanding of Lucene internals



On Tue, Jan 29, 2013 at 2:12 PM, Daniel Rosher  wrote:

> Hi,
>
> I'm wondering if there exists or if someone has implemented something like
> the following as a function query:
>
> overlap(query,field) = number of matching terms in field/number of terms in
> field
>
> e.g. with three docs having these tokens(e.g.A B C) in a field
> D
> 1:A B B
> 2:A B
> 3:A
>
> The overlap would be for these queries (-- highlights possibly highest
> scoring doc):
>
> Q:A
> 1:1/3
> 2:1/2
> 3:1/1 --
>
> Q:A B
> 1:2/3
> 2:2/2 --
> 3:1/1
>
> Q:A B C
> 1:2/3
> 2:2/2 --
> 3:1/1
>
> The objective to to pick the most likely doc using the overlap to boost the
> score.
>
> Cheers,
> Dan
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: MERGING SPATIAL SEARCH QUERY

2013-01-29 Thread Smiley, David W.
Hi Jaspreet.

Your post is confusing.  You're using spatial, so you say, yet your
question suggests you have yet to use it.  If your documents are
associated with a city, then you should index the lat-lon location of that
city in your documents.  It's denormalized like this.

~ David

On 1/23/13 11:17 PM, "jaspreet_sethi5875" 
wrote:

>Hi,
>
>I have just implemented spatial search on solr4.0 and its working fine.
>In my index i have many fields including "city" and "latlong". I need to
>do
>below 2 tasks:
>1.) Fetch latlong of certain city.
>2.) Use the above fetched latlong to show the results based on certain
>conditions.
>
>Can i do these both tasks in a single solr query? If yes then How
>Any help will be really appreciable.
>
>Thanks
>Jaspreet Sethi 
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/MERGING-SPATIAL-SEARCH-QUERY-tp4035856.
>html
>Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problem with migration from solr 3.5 with SOLR-2155 usage to solr 4.0

2013-01-29 Thread Smiley, David W.
The wiki is open to everyone.  If you do edit it, please try to keep it
organized.  

On 1/24/13 9:41 AM, "Viacheslav Davidovich"
 wrote:

>Hi David,
>
>thank you for your answer.
>
>After update to this field type and change the SOLR query I receive
>required behavior.
>
>Also could you update the WIKI page after the words "it needs to be in
>WEB-INF/lib in Solr's war file, basically" also add the maven artifact
>code like this?
>
>
>com.vividsolutions
>jts
>1.13
> 
>
>I think this may help for users used maven.
>
>WBR Viacheslav.
>
>On 23.01.2013, at 19:24, Smiley, David W. wrote:
>
>> Viacheslav,
>> 
>> 
>> SOLR-2155 is only compatible with Solr 3.  However the technology it is
>> based on lives on in Lucene/Solr 4 in the
>> "SpatialRecursivePrefixTreeFieldType" field type.  In the example schema
>> it's registered under the name "location_rpt".  For more information on
>> how to use this field type, see: SpatialRecursivePrefixTreeFieldType
>> 
>> ~ David Smiley
>> 
>> On 1/23/13 11:11 AM, "Viacheslav Davidovich"
>>  wrote:
>> 
>>> Hi, 
>>> 
>>> With Solr 3.5 I use SOLR-2155 plugin to filter the documents by
>>>distance
>>> as described in
>>> http://wiki.apache.org/solr/SpatialSearch#Advanced_Spatial_Search and
>>> this solution perfectly filter the multiValued data defined in
>>>schema.xml
>>> like
>>> 
>>> >> length="12" />
>>> 
>>> >> multiValued="true"/>
>>> 
>>> the query looks like this with Solr 3.5:  q=*:*&fq={!geofilt}&sfield=
>>> location_data&pt=45.15,-93.85&d=50&sort=geodist() asc
>>> 
>>> As SOLR-2155 plugin not compatible with solr 4.0 I try to change the
>>> field definition to next:
>>> 
>>> >> subFieldSuffix="_coordinate" />
>>> 
>>> >>stored="true"
>>> multiValued="true"/>
>>> 
>>> >> stored="false" />
>>> 
>>> But in this case after geofilt by location_data execution the correct
>>> values returns only if the field have 1 value, if more them 1 value
>>> stored in index required documents returns only when all the location
>>> points are matched.
>>> 
>>> Have anybody experience or any ideas how to receive the same behavior
>>>in
>>> solr4.0 as this was in solr3.5 with SOLR-2155 plugin usage?
>>> 
>>> Is this possible at all or I need to refactor the document structure
>>>and
>>> field definition to store only 1 location value per document?
>>> 
>>> WBR Viacheslav.
>>> 
>> 
>> 
>



Re: Solr Data Config Queries per Field

2013-01-29 Thread Gora Mohanty
On 29 January 2013 23:34, O. Olson  wrote:
[...]
> Thank you. Good call Gora, I forgot to mention about the query. I am trying
> to query something like the following in the URL for the Example:
> http://localhost:8983/solr/db/select
>
> ?q=&facet=true&facet.field=Category1
>
> I expect the above query to give me the counts for the products that satisfy
> the  in Category1. For example given my  I get: Hardware (21),
> Software (3), Office Supplies (10). These are Category1 values.  Lets then
> say a user selects Hardware. I think I would do something like:
>
>
> ?q=&facet=true&fq=Category1:Hardware&facet.field=Category2
>
> I assume this would be give me the list of Category 2 values e.g. Printers
> (7), Fax Machines (11), LCD Monitors (3) (7 + 11 + 3 = 21).
>
> You suggest I create separate entities for each Category Level. Would this
> affect my schema? i.e. would the above queries work??

Yes, things should function as you describe, and no you should not
need any change in your schema from changing the DIH configuration
file. Please take a look at
http://wiki.apache.org/solr/SolrFacetingOverview#Facet_Indexing for
how best to define faceting fields. Also, see this tutorial on faceted
search with Solr: http://searchhub.org/2009/09/02/faceted-search-with-solr/

Regards,
Gora


Re: Solr Data Config Queries per Field

2013-01-29 Thread O. Olson
Gora Mohanty-3 wrote
> On 29 January 2013 22:42, O. Olson <

> olson_ord@

> > wrote:
> [...]
>> SQL Database Schema:
>>
>> Table: Prod_Table
>> Column 1: SKU  <- ID/Primary Key
>> Column 2: Title
>>
>> Table: Cat_Table
>> Column 1: SKU <- Foreign Key
>> Column 2: CategoryLevel
>> Column 3: CategoryName
>>
>> Where CategoryLevel is 1, I would like to save the value to Category1
>> field,
>> where CategoryLevel is 2, I would like to save this to the Category2
>> field
>> etc.
> [...]
> 
> It is not very clear from your description, nor from your example,
> what you want saved to the Category1, Category2,... fields, and
> how you expect your user searches to function. You seem to imply
> that the categories are hierarchical, but there is no relationship in
> the database to define this hierarchy.
> 
> For a given product SKU, do you want the multi-valued Category1
> field to contain all CategoryName values from Cat_Table that have
> CategoryLevel = 1 and SKU matching the product SKU, and so on
> for the other categories? If so, this should do it:
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> Regards,
> Gora

Thank you. Good call Gora, I forgot to mention about the query. I am trying
to query something like the following in the URL for the Example:
http://localhost:8983/solr/db/select

?q=&facet=true&facet.field=Category1

I expect the above query to give me the counts for the products that satisfy
the  in Category1. For example given my  I get: Hardware (21),
Software (3), Office Supplies (10). These are Category1 values.  Lets then
say a user selects Hardware. I think I would do something like: 


?q=&facet=true&fq=Category1:Hardware&facet.field=Category2

I assume this would be give me the list of Category 2 values e.g. Printers
(7), Fax Machines (11), LCD Monitors (3) (7 + 11 + 3 = 21). 

You suggest I create separate entities for each Category Level. Would this
affect my schema? i.e. would the above queries work??

Thanks again Gora,
O. O.







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Data-Config-Queries-per-Field-tp4037092p4037118.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Data Config Queries per Field

2013-01-29 Thread Gora Mohanty
On 29 January 2013 22:42, O. Olson  wrote:
[...]
> SQL Database Schema:
>
> Table: Prod_Table
> Column 1: SKU  <- ID/Primary Key
> Column 2: Title
>
> Table: Cat_Table
> Column 1: SKU <- Foreign Key
> Column 2: CategoryLevel
> Column 3: CategoryName
>
> Where CategoryLevel is 1, I would like to save the value to Category1 field,
> where CategoryLevel is 2, I would like to save this to the Category2 field
> etc.
[...]

It is not very clear from your description, nor from your example,
what you want saved to the Category1, Category2,... fields, and
how you expect your user searches to function. You seem to imply
that the categories are hierarchical, but there is no relationship in
the database to define this hierarchy.

For a given product SKU, do you want the multi-valued Category1
field to contain all CategoryName values from Cat_Table that have
CategoryLevel = 1 and SKU matching the product SKU, and so on
for the other categories? If so, this should do it:


 
 

 
 
 

 
 
 

 
 
 


Regards,
Gora


Solr Data Config Queries per Field

2013-01-29 Thread O. Olson
Hi,

I am new to Solr, and I am using the DataImportHandler to Query a SQL
Server and populate Solr. I specify the SQL Query in the db-data-config.xml
file. Each SQL Query seems to be associated with an entity. Is it possible
to have a query per field? I think it would be easier to explain this using
an example: 

I have products that are classified in a hierarchy of Categories. A single
product can be in multiple Categories. I want to provide the user the
ability to drill down i.e. first select the top level category Category1,
next select the next level category Category2 etc. Since a single product
can be in multiple Categories, all of these i.e. Category1, Category2,
Category3 etc. are multi-valued.


SQL Database Schema:

Table: Prod_Table
Column 1: SKU  <- ID/Primary Key
Column 2: Title 

Table: Cat_Table
Column 1: SKU <- Foreign Key
Column 2: CategoryLevel
Column 3: CategoryName

Where CategoryLevel is 1, I would like to save the value to Category1 field,
where CategoryLevel is 2, I would like to save this to the Category2 field
etc. My db-data-config.xml looks like:









 Query:  "SELECT CategoryName from 
CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=2"

 Query:  "SELECT CategoryName from 
CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=3"






How do I populate Category2 and Category3??

Thank you for all your help.
O. O.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Data-Config-Queries-per-Field-tp4037092.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: thanks for solr 4.1

2013-01-29 Thread Pires, Guilherme
Subscribed! Just integrating solr 4.1 in a corporate GIS architecture as we 
speak.
Thanks!

Guilherme Pires 

-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
Sent: terça-feira, 29 de Janeiro de 2013 15:34
To: solr-user@lucene.apache.org
Subject: thanks for solr 4.1

Now this must be said, thanks for solr 4.1 (and lucene 4.1)!

Great improvements compared to 4.0.

After building the first 4.1 index I thought the index was broken, but had no 
error messages anywhere.
Why I thought it was damaged?
The index size went down from 167 GB (solr 4.0) to 115 GB (solr 4.1)!!!

Will now move the new 4.1 index to testing stage and after it passes all 
testing it goes online.
Can't wait to see the new stats.

Regards,
Bernd




thanks for solr 4.1

2013-01-29 Thread Bernd Fehling
Now this must be said, thanks for solr 4.1 (and lucene 4.1)!

Great improvements compared to 4.0.

After building the first 4.1 index I thought the index was broken, but had no 
error messages anywhere.
Why I thought it was damaged?
The index size went down from 167 GB (solr 4.0) to 115 GB (solr 4.1)!!!

Will now move the new 4.1 index to testing stage and after it passes all 
testing it goes online.
Can't wait to see the new stats.

Regards,
Bernd


Re: edismax, qf, multiterm analyzer bug?

2013-01-29 Thread Jack Krupansky

Looks like a bug to me.

Actually, when I try it with the Solr 4.0 example, I get:

O*t*v*h
O*t*v*h
(+DisjunctionMaxQuery((sku:otvh)))/no_coord
+(sku:otvh)

For:
curl 
"http://localhost:8983/solr/collection1/select?q=O*t*v*h&wt=xml&debugQuery=on&defType=edismax&qf=sku%20doesNotExit&indent=true";


-- Jack Krupansky

-Original Message- 
From: Ahmet Arslan

Sent: Monday, January 28, 2013 12:43 PM
To: solr-user@lucene.apache.org
Subject: edismax, qf, multiterm analyzer bug?

Hello,

If I fire a wildcard query o*t*v*h using edismax, and add a non existed 
field to qf parameter, i get this phrase query at the end.


http://localhost:8983/solr/collection1/select?q=O*t*v*h&wt=xml&debugQuery=on&defType=edismax&qf=sku%20doesNotExit

parsedquery = (+DisjunctionMaxQuery((sku:"o t v h")))/no_coord
parsedquery_toString = +(sku:"o t v h")

Existing field(s) works as expected :

http://localhost:8983/solr/collection1/select?q=O*t*v*h&wt=xml&debugQuery=on&defType=edismax&qf=sku

yields

(+DisjunctionMaxQuery((sku:o*t*v*h)))/no_coord
+(sku:o*t*v*h)


Why I am including a field that does not exist?

Actually this is a distributed search.

Core A: field1 field2
Core B: field2 field3

To query all fields I use qf=field1 field2 
field3&shards=coreA,coreB&defType=edismax  (is there a better way?)


For a workaround I enabled following dynamic field :

  

I will open en issue but i would like to get your opinions first.

Ahmet 



Re: How to disable compression on stored fields in Solr 4.1?

2013-01-29 Thread Shawn Heisey

On 1/29/2013 7:40 AM, Shawn Heisey wrote:

I don't think there's a way to turn off the stored field compression in
the 4.1 index format, but I think there is something else you can do
right now - switch to the 4.0 index format.  To do this, you need a
postingsFormat value of "Lucene40" on some or all of your stored fields
in schema.xml and an option in solrconfig.xml.


It seems that I was wrong about this.  Apparently postingsFormat does 
not affect the stored fields format.  Apologies for offering incorrect 
information!  See the comments on SOLR-4375 for more details.


Thanks,
Shawn




Re: Issue with mutiple records in full text search

2013-01-29 Thread Jack Krupansky
I don't know if there is a highlighter option to highlight all hits in a 
document, as opposed to the snippet for the first. If not, you could right 
your own highlighter search component to do that. But it may be possible to 
use pieces of code that are already there.


-- Jack Krupansky

-Original Message- 
From: Soumyanayan Kar

Sent: Tuesday, January 29, 2013 9:38 AM
To: solr-user@lucene.apache.org
Subject: RE: Issue with mutiple records in full text search

Thanks Jack for the explanation.

But lets say if my requirement needs me to return all occurrences of the
search term along with the text snippet around them for each document under
the search scope, how do we go about achieving that with Solr?

Thanks & Regards,

Soumya.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: 29 January 2013 08:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Issue with mutiple records in full text search

The number of "hits" of a term in a Solr document impacts the score, but
still only counts as one "hit" in the numFound count. Solr doesn't track
"hits" for individual term occurrences, except that you could check the
"term frequency" of a specific term in a specific document if you wanted,
using a function query - tf(field,term) - which can also be included in the
&fl field list.

To be clear - Solr has no concept of "records", just documents and fields.

-- Jack Krupansky

-Original Message-
From: Soumyanayan Kar
Sent: Tuesday, January 29, 2013 9:01 AM
To: solr-user@lucene.apache.org
Subject: Issue with mutiple records in full text search

Hi,



We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
"text_general" type field in the solr schema file.  Relevant part of the
schema file:





   

   

   

   

   

   

   

   

   

   

   

   

   

   





   



MediaId





We are using a .net based solution and using the solrnet client to
communicate with Solr.



The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term "case" exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for "case" against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.



Used Luke to test the index records but cannot find anything apparently
wrong.

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?



Soumya.





Thanks & Regards,



Soumya.







Re: How to disable compression on stored fields in Solr 4.1?

2013-01-29 Thread Shawn Heisey

On 1/29/2013 4:57 AM, Artyom wrote:

I guess, I have to "write a new codec that uses a stored fields format which
does not compress stored fields such as Lucene40StoredFieldsFormat"
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1

What is the purpose of  tag then, does it affect
anything at all or LUCENE_40 and LUCENE_41 are just the same in Solr 4.1?


The luceneMatchVersion causes some analysis components to behave like 
they did in older versions.  Usually you'd want to use this when you 
rely on older behavior that was not working correctly and got fixed in 
the newer version that you are now running.  The index format is unaffected.


I don't think there's a way to turn off the stored field compression in 
the 4.1 index format, but I think there is something else you can do 
right now - switch to the 4.0 index format.  To do this, you need a 
postingsFormat value of "Lucene40" on some or all of your stored fields 
in schema.xml and an option in solrconfig.xml.


 

http://wiki.apache.org/solr/SchemaXml#Data_Types

Because I could not find an existing issue in Jira to add a compression 
config knob, I made one:


https://issues.apache.org/jira/browse/SOLR-4375

Thanks,
Shawn



RE: Issue with mutiple records in full text search

2013-01-29 Thread Soumyanayan Kar
Thanks Jack for the explanation.

But lets say if my requirement needs me to return all occurrences of the
search term along with the text snippet around them for each document under
the search scope, how do we go about achieving that with Solr?

Thanks & Regards,

Soumya.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: 29 January 2013 08:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Issue with mutiple records in full text search

The number of "hits" of a term in a Solr document impacts the score, but
still only counts as one "hit" in the numFound count. Solr doesn't track
"hits" for individual term occurrences, except that you could check the
"term frequency" of a specific term in a specific document if you wanted,
using a function query - tf(field,term) - which can also be included in the
&fl field list.

To be clear - Solr has no concept of "records", just documents and fields.

-- Jack Krupansky

-Original Message-
From: Soumyanayan Kar
Sent: Tuesday, January 29, 2013 9:01 AM
To: solr-user@lucene.apache.org
Subject: Issue with mutiple records in full text search

Hi,



We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
"text_general" type field in the solr schema file.  Relevant part of the
schema file:



































 





MediaId





We are using a .net based solution and using the solrnet client to
communicate with Solr.



The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term "case" exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for "case" against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.



Used Luke to test the index records but cannot find anything apparently
wrong.

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?



Soumya.





Thanks & Regards,



Soumya.








Re: Issue with mutiple records in full text search

2013-01-29 Thread Jack Krupansky
The number of "hits" of a term in a Solr document impacts the score, but 
still only counts as one "hit" in the numFound count. Solr doesn't track 
"hits" for individual term occurrences, except that you could check the 
"term frequency" of a specific term in a specific document if you wanted, 
using a function query - tf(field,term) - which can also be included in the 
&fl field list.


To be clear - Solr has no concept of "records", just documents and fields.

-- Jack Krupansky

-Original Message- 
From: Soumyanayan Kar

Sent: Tuesday, January 29, 2013 9:01 AM
To: solr-user@lucene.apache.org
Subject: Issue with mutiple records in full text search

Hi,



We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
"text_general" type field in the solr schema file.  Relevant part of the
schema file:





   

   

   

   

   

   

   

   

   

   

   

   

   

   





   



MediaId





We are using a .net based solution and using the solrnet client to
communicate with Solr.



The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term "case" exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for "case" against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.



Used Luke to test the index records but cannot find anything apparently
wrong.

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?



Soumya.





Thanks & Regards,



Soumya.







Issue with mutiple records in full text search

2013-01-29 Thread Soumyanayan Kar
Hi,

 

We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
"text_general" type field in the solr schema file.  Relevant part of the
schema file:

 

































 



 

MediaId



 

We are using a .net based solution and using the solrnet client to
communicate with Solr. 

 

The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term "case" exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for "case" against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.

 

Used Luke to test the index records but cannot find anything apparently
wrong. 

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?

 

Soumya.

 

 



Issue with mutiple records in full text search

2013-01-29 Thread Soumyanayan Kar
Hi,

 

We are trying to use solr for a text based search solution in a web
application. The documents that are getting indexed are essentially text
based files like *.txt, *.pdf, etc. We are using the Tika extraction plugin
to extract the text content from the files and storing it using a
"text_general" type field in the solr schema file.  Relevant part of the
schema file:

 

































 



 

MediaId



 

We are using a .net based solution and using the solrnet client to
communicate with Solr. 

 

The content field is supposed to store the text content of the file and the
ContentSearch field will be used for executing the search.

While the documents are getting indexed properly, while executing search we
are getting only the first occurrence of the search term returned for each
document.

For example, if we have a.txt and b.pdf which are indexed, and the search
term "case" exists in both the documents multiple times(a.txt - 7 hits,
b.pdf - 10 hits), when executing a search for "case" against both the
documents, we are getting two records returned which are the first
occurrences of the search term in the respective docs, while this should
return 17 hits.

 

Used Luke to test the index records but cannot find anything apparently
wrong. 

Is this something to do with the type(text_general) of the search field or
the way we are loading the entire content of the file into one index
document?

 

Soumya.

 

 

Thanks & Regards,

 

Soumya.

 

 



Re: web app :Returning document ID From Solr search

2013-01-29 Thread Michael Della Bitta
If your metadata requirements aren't too heavy, you could store all
the title, author, etc. info in Solr along with the index of the full
text of the document. Then when you submitted a query to Solr, you
could retrieve back the list of information you'd need to display a
page of search results, and use the start and rows params of a Solr
query to page through them.

If you're working in JSP and Servlets, you should definitely be using
the SolrJ library to talk to Solr.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Mon, Jan 28, 2013 at 10:10 PM, Luis Pedro  wrote:
> Hi there! I should first mention that im new to Solr and Lucene working
>
> I am trying to build a web application that interacts with Solr, i am
> already able to index pdf and word files using solr and search them through
> the solr interface which then gets me the xml with file details. Pretty
> good undestanding of schema.xml and solrconfig.xml. My problem is so far i
> haven't figured out to bring solr to my webapp so as to speak.
>
> Im using JSP and servlets mainly. I figured on my app i wud have the user
> enter a query string and once he hits enter, solr would return the id of
> the document which wud then be grabbed and searched on a db so i would have
> the author | title | etc (meaning i would be able to customize the search
> results to my liking according to details i defined on db for each file id).
>
> At first i thought the DataImportHandler(DIH) would be the way to go (bit
> silly i know), but then i realised that the DIH  was meant to read the data
> on the table , which isn't exactly what i need. So please if anyone sees my
> Waldo please care to point it out.
>
> Im open to suggestions on a better approach to look into things!
>
>
> Much gratitude in advance!


Re: why search time increases without term vectors?

2013-01-29 Thread Artyom
Yes, I guess, full index replication is a general bug of 4.x. I tried the
same routine with termVectors and got the same result:

1. stopped all Solr instances
2. cleared data folders of all instances
3. ran master, made full-import with optimize option using DIH
4. after import ran slave and did replication
5. ran solrmeter to send updates to the master and gets from the slave

the full index was replicated every 20 seconds (not only modified segments).

To fix this:

1) I stopped solrmeter
2) stopped the slave and cleared its data directory
3) clicked Optimize in the control panel of the master
4) ran slave and did replication
5) ran solrmeter to send updates to the master and gets from the slave

But this replication bug didn't affect increased response time. There is
still a strong relation between absense of termVectors and increased
response time. I have no idea why...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-search-time-increases-without-term-vectors-tp4035900p4037010.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexVersion returns multiple results when called

2013-01-29 Thread davidq
Hi,

We thought we'd sorted this out but it's come back again.

We're on 4.0GA which I forgot to mention before. We disabled polling on the
slaves and have a PHP script that gets the current version number of the
master, fires a reindex (optimize,clean) and then loops on a sleep(120)
function at which it checks the version number. If the same as original
version, indexing is incomplete so sleep again. If different, in theory,
indexing is complete so do a fetch.

This worked for 3 days but today obviously didn't and we ended up with a
corrupted tiny index on the slave.

Anyone have any clues why the version number is changing?

Regards,

DQ



--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexVersion-returns-multiple-results-when-called-tp4036046p4037007.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to disable compression on stored fields in Solr 4.1?

2013-01-29 Thread Artyom
I guess, I have to "write a new codec that uses a stored fields format which
does not compress stored fields such as Lucene40StoredFieldsFormat"
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1

What is the purpose of  tag then, does it affect
anything at all or LUCENE_40 and LUCENE_41 are just the same in Solr 4.1?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-disable-compression-on-stored-fields-in-Solr-4-1-tp4037001p4037005.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: why search time increases without term vectors?

2013-01-29 Thread Upayavira
No, not at all. Presence or not of term vectors won't impact replication
in that way.

For SolrCloud, it is up to each node to create term vectors when it
receives a document for indexing. Using 3.x style replication, the slave
will pull all changed files making up changed segments on replication,
and this will include term vector files.

Upayavira

On Tue, Jan 29, 2013, at 06:21 AM, Artyom wrote:
> I guess, response time increased, because I use master-slave
> configuration in
> Solr 4.0 and Solr 4.1: if there are no termVectors, the full index is
> replicated; if there are termVectors, only modified segments of the index
> are transferred from the master to slaves. Am I right?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/why-search-time-increases-without-term-vectors-tp4035900p4036962.html
> Sent from the Solr - User mailing list archive at Nabble.com.


How to disable compression on stored fields in Solr 4.1?

2013-01-29 Thread Artyom
I tried Solr 4.1, reindexed data using DIH (full-import) and compared
response time with version 4.0. Response time increased 1.5-2 times. How to
disable compression on stored fields in Solr 4.1? I tried to change codec
version in solrconfig:

  LUCENE_40

and reindexed data using DIH (full-import) but it doesn't help, the index is
still compressed and response time is still high...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-disable-compression-on-stored-fields-in-Solr-4-1-tp4037001.html
Sent from the Solr - User mailing list archive at Nabble.com.


overlap function query

2013-01-29 Thread Daniel Rosher
Hi,

I'm wondering if there exists or if someone has implemented something like
the following as a function query:

overlap(query,field) = number of matching terms in field/number of terms in
field

e.g. with three docs having these tokens(e.g.A B C) in a field
D
1:A B B
2:A B
3:A

The overlap would be for these queries (-- highlights possibly highest
scoring doc):

Q:A
1:1/3
2:1/2
3:1/1 --

Q:A B
1:2/3
2:2/2 --
3:1/1

Q:A B C
1:2/3
2:2/2 --
3:1/1

The objective to to pick the most likely doc using the overlap to boost the
score.

Cheers,
Dan


Re: indexing Text file in solr

2013-01-29 Thread Edward Garrett
i don't have experience with this but it looks like you could use, from DIH:

http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor


On Sun, Jan 27, 2013 at 10:23 AM, hadyelsahar  wrote:
> i have a large Arabic Text File that contains Tweets each line contains one
> tweet , that i want to index in solr such that each line of this document
> should be indexed in a separate solr document
>
> what i tried so far :
>
> i know how to SQL databse records in solr
> i know how to change solr schema to fit the data and working with Data
> import handler
> i know how the queries used to index data in solr
> what i want is :
>
> know how to index text file in solr in order that each line is considered a
> solr document
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/indexing-Text-file-in-solr-tp4036496.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
edge


Re: [ANNOUNCE] Web Crawler

2013-01-29 Thread SivaKarthik
Hi,
 i resolved the issue "Access denied for user 'crawler'@'localhost' (using
password: YES)" 
 
 mysql user crawler/crawler was created and privileges added as mentioned in
the tutorial..
 Thank you.
  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036978.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [ANNOUNCE] Web Crawler

2013-01-29 Thread SivaKarthik
Klein,
 Thank you for ur reply..
 
 i hosted the application in apache2 server
 and able to access the link http://localhost/search/

 but while accessing http://localhost/crawler/login.php
  its showing the error msg as 
 "Access denied for user 'crawler'@'localhost' (using
password: YES)"

 i tried to access
   http://localhost/crawler/log.php
  http://localhost/crawler/display.php

  but all throws the same error msg
  "Access denied for user 'crawler'@'localhost' (using
password: YES)"

 for testing purpose
i created test1.html and test2.php under  /opt/crawler/web/crawler/pub
folder
and i succeeded  to access it
 http://localhost/crawler/test2.php
 http://localhost/crawler/test1.html

im not completely sure why the Access denied error for login.php page
Any idea?

Regards
 

 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036966.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issue with spellcheck and autosuggest

2013-01-29 Thread Artyom
you should check not suggestions, but collations in the response xml



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-spellcheck-and-autosuggest-tp4036208p4036977.html
Sent from the Solr - User mailing list archive at Nabble.com.