date:20110808

One kind of hacky way to accomplish some of those tasks involves 
creating a lot more Solr fields. (This kind of 'de-normalization' is 
often the answer to how to make Solr do something).


So facet fields are ordinarily not tokenized or normalized at all. But 
that doesn't work very well for matching query terms.  So if you want 
actual queries to match on these categories, you probably want an 
additional field that is tokenized/analyzed.  If you want to boost 
different category assignments differently, you probably want _multiple_ 
additional tokenized/analyzed fields.


So for instance, create separate analyzed fields for each category 
'weight', perhaps using the default 'text' analysis type.


categor_text_weight_1
category_text_weight_2
etc

Then use dismax to query, include all those category_text_* fields in 
the 'qf', and boost the higher weight ones more than the lower weight ones.


That will handle a number of your use cases, but not all of them.

Your first two cases are the most problematic:

filter: category=some_category_name, query: *.* - Results should be 
score by the above mentioned weight 


So Solr doesn't really work like that. Normally a filter does not effect 
the scoring of the actual results _at all_. But if you change the query to:


fq=category:some_category
q=some_category
defType=dismax
qf=category_text_weight1, category_text_weight2^10, 
category_text_weight3^20


THEN, with the multiple analyzed category_text_weight_* fields, as 
described above, I think it should do what you want. You may have to 
play with exactly what boost to give to each field.


But your second use case is still tricky.

Solr doesn't really do exactly what you ask, but by using this method I 
think you can figure out hacky ways to accomplish it.  I'm not sure if 
it will solve all of your use cases, but maybe this will give you a 
start to figuring it out.



On 8/5/2011 6:55 AM, Michael Lorz wrote:

Hi all,

I have documents which are (manually) tagged whith categories. Each
category-document relation has a weight between 1 and 5:

5: document fits perfectly in this category,
.
.
1: document may be considered as belonging to this category.


I would now like to use this information with solr. At the moment, I don't use
the weight at all:

field name=category type=string indexed=true stored=true
multiValued=true/

Both the category as well as the document body are specified as query fields
(str name=qf  in solrconfig.xml).


What I would like is the following:

- filter: category=some_category_name, query: *.*  - Results should be score by
the above mentioned weight
- filter: category=some_category_name, query: some_keyword - Results should be
scored by a combination of the score of 'some_keyword' and the above mentioned
weight
- filter: none, query: some_category_name - Documents with category
'some_category_name' should be found as well as documents which contain the term
'some_category_name'. Results should be scored by a combination of the score of
'some_keyword' and the above mentioned weight


Do you have any ideas how this could be done?

Thanks in advance
Michi

Re: how to enable MMapDirectory in solr 1.4?

2011-08-08 Thread Rich Cariens

We patched our 1.4.1 build with
SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969(making
MMapDirectory configurable) and realized a 64% search performance
boost on our Linux hosts.

On Mon, Aug 8, 2011 at 10:05 AM, Dyer, James james.d...@ingrambook.comwrote:

 If you want to try MMapDirectory with Solr 1.4, then copy the class
 org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add
 it to the .war file (you can just add it under src/java and re-package the
 war), or you can put it in its own .jar file in the lib directory under
 solr_home.  Then, in solrconfig.xml, add this entry under the root
 config element:

 directoryFactory class=org.apache.solr.core.MMapDirectoryFactory /

 I'm not sure if MMapDirectory will perform better for you with Linux over
 NIOFSDir.  I'm pretty sure in Trunk/4.0 it's the default for Windows and
 maybe Solaris.  In Windows, there is a definite advantage for using
 MMapDirectory on a 64-bit system.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Li Li [mailto:fancye...@gmail.com]
 Sent: Monday, August 08, 2011 4:09 AM
 To: solr-user@lucene.apache.org
 Subject: how to enable MMapDirectory in solr 1.4?

 hi all,
I read Apache Solr 3.1 Released Note today and found that
 MMapDirectory is now the default implementation in 64 bit Systems.
I am now using solr 1.4 with 64-bit jvm in Linux. how can I use
 MMapDirectory? will it improve performance?

solr-ruby: Error undefined method `closed?' for nil:NilClass

2011-08-08 Thread Ian Connor

Hi,

I have seen some of these errors come through from time to time. It looks
like:

/usr/lib/ruby/1.8/net/http.rb:1060:in
`request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in
`post'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in
`send'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in
`create_and_send_query'

/usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in
`query'

It is as if the http object has gone away. Would it be good to create a new
one inside of the connection or is something more serious going on?
ubuntu 10.04
passenger 3.0.8
rails 2.3.11

-- 
Regards,

Ian Connor

Re: solr-ruby: Error undefined method `closed?' for nil:NilClass

Ian -

What does your solr-ruby using code look like?

Solr::Connection is light-weight, so you could just construct a new one of 
those for each request.  Are you keeping an instance around?

Erik


On Aug 8, 2011, at 12:03 , Ian Connor wrote:

 Hi,
 
 I have seen some of these errors come through from time to time. It looks
 like:
 
 /usr/lib/ruby/1.8/net/http.rb:1060:in
 `request'\n/usr/lib/ruby/1.8/net/http.rb:845:in `post'
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:158:in
 `post'
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:151:in
 `send'
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:174:in
 `create_and_send_query'
 
 /usr/lib/ruby/gems/1.8/gems/solr-ruby-0.0.8/lib/solr/connection.rb:92:in
 `query'
 
 It is as if the http object has gone away. Would it be good to create a new
 one inside of the connection or is something more serious going on?
 ubuntu 10.04
 passenger 3.0.8
 rails 2.3.11
 
 -- 
 Regards,
 
 Ian Connor

edismax configuration

2011-08-08 Thread Mark juszczec

Hello all

Can someone direct me to a link with config info in order to allow use of
the edismax QueryHandler?

Mark

is it possible to do a sort without query?

I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable to do it
without passing in data to the q paramater.  Is it possible to get a sorted
list without searching for any terms?

Test failures on lucene_solr_3_3 and branch_3x

2011-08-08 Thread Shawn Heisey

I've got a consistent test failure on Solr source code checked out from 
svn.  The same thing happens with 3.3 and branch_3x.  I have information 
saved from the failures on branch_3x, which I have gotten to to fail 
about a dozen times in a row.  It fails on a test called 
TestSqlEntityProcessorDelta, part of the dataimporthandler tests.  It is 
consistently reproducible in a shorter timeframe than normal with the 
following commandline:


ant test -Dtestcase=TestSqlEntityProcessorDelta

Comprehensive ant output here, from a full test run:

http://pastebin.com/eyAt8Qg8

Platform information:

[root@idxst0-a solr]# uname -a
Linux idxst0-a 2.6.18-238.12.1.el5.centos.plusxen #1 SMP Wed Jun 1 
11:57:54 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

[root@idxst0-a solr]# cat /etc/redhat-release
CentOS release 5.6 (Final)
[root@idxst0-a solr]# java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

[root@idxst0-a yum.repos.d]# yum repolist
Loaded plugins: fastestmirror, protectbase
Loading mirror speeds from cached hostfile
 * addons: mirror.san.fastserv.com
 * base: mirrors.tummy.com
 * centosplus: mirror.san.fastserv.com
 * contrib: mirror.san.fastserv.com
 * epel: mirrors.xmission.com
 * extras: mirrors.xmission.com
 * jpackage-generic: jpackage.netmindz.net
 * jpackage-generic-nonfree: www.mirrorservice.org
 * jpackage-generic-nonfree-updates: www.mirrorservice.org
 * jpackage-generic-updates: jpackage.netmindz.net
 * jpackage-rhel: jpackage.netmindz.net
 * jpackage-rhel-updates: jpackage.netmindz.net
 * rpmforge: fr2.rpmfind.net
 * updates: mirrors.tummy.com

Re: is it possible to do a sort without query?

2011-08-08 Thread Alexei Martchenko

You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toy jason...@gmail.com

 I am trying to list some data based on a function I run ,
 specifically  termfreq(post_text,'indie music')  and I am unable to do it
 without passing in data to the q paramater.  Is it possible to get a sorted
 list without searching for any terms?




-- 

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533

bug in termfreq? was Re: is it possible to do a sort without query?

Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if its because
of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.
All the results don't have the phrase indie music anywhere in their data.
 Does termfreq not support phrases?
If not, how can I sort specifically by termfreq of a phrase?



On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
ale...@superdownloads.com.br wrote:

 You can use the standard query parser and pass q=*:*

 2011/8/8 Jason Toy jason...@gmail.com

  I am trying to list some data based on a function I run ,
  specifically  termfreq(post_text,'indie music')  and I am unable to do it
  without passing in data to the q paramater.  Is it possible to get a
 sorted
  list without searching for any terms?
 



 --

 *Alexei Martchenko* | *CEO* | Superdownloads
 ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
 5083.1018/5080.3535/5080.3533




-- 
- sent from my mobile
6176064373

solr 3.1, not indexing entire document?

2011-08-08 Thread dhastings

hi, i have my solr field text configured as per earlier discussion:

 fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/



filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType


and for debugging purposes i am storing the text field as well, so:


   field name=text type=text indexed=true stored=true /

now when i do a search against a document, that i KNOW has a certain phrase,
in this case official handbook of the Federal Government

my query looks like:

result name=response numFound=0 start=0 maxScore=0.0/lst
name=debugstr name=rawquerystringid:062085.1 AND text:official
handbook of the Federal Government/strstr name=querystringid:062085.1
AND text:official handbook of the Federal Government/strstr
name=parsedquery+id:062085.1 +PhraseQuery(text:official handbook of the
federal government)/strstr name=parsedquery_toString+id:062085.1
+text:official handbook of the federal government/str


i get 0 results, so, when i search just for that id, and i get the result:


way way at the end sure enough is the string

http://qihealing.net/doc.txt output 

is there a document size limit or is it the fact that im sending to solr
using solrj and its too large?






--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-1-not-indexing-entire-document-tp3236719p3236719.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 3.1, not indexing entire document?

Check your maxFieldLength settting.

 hi, i have my solr field text configured as per earlier discussion:
 
  fieldType name=text class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 
 
 
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType
 
 
 and for debugging purposes i am storing the text field as well, so:
 
 
field name=text type=text indexed=true stored=true /
 
 now when i do a search against a document, that i KNOW has a certain
 phrase, in this case official handbook of the Federal Government
 
 my query looks like:
 
 result name=response numFound=0 start=0 maxScore=0.0/lst
 name=debugstr name=rawquerystringid:062085.1 AND text:official
 handbook of the Federal Government/strstr
 name=querystringid:062085.1 AND text:official handbook of the Federal
 Government/strstr
 name=parsedquery+id:062085.1 +PhraseQuery(text:official handbook of the
 federal government)/strstr name=parsedquery_toString+id:062085.1
 +text:official handbook of the federal government/str
 
 
 i get 0 results, so, when i search just for that id, and i get the result:
 
 
 way way at the end sure enough is the string
 
 http://qihealing.net/doc.txt output
 
 is there a document size limit or is it the fact that im sending to solr
 using solrj and its too large?
 
 
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-3-1-not-indexing-entire-document-t
 p3236719p3236719.html Sent from the Solr - User mailing list archive at
 Nabble.com.

Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Yury Kats

On 8/8/2011 4:34 PM, Jason Toy wrote:
 Aelexei, thank you , that does seem to work.
 
 My sort results seem to be totally wrong though, I'm not sure if its because
 of my sort function or something else.
 
 My query consists of:
 sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
 And I get back 4571232 hits.

That would be the total number of docs, I guess.
Since your query is *:*, ie find everything.

 All the results don't have the phrase indie music anywhere in their data.

You are only sorting on termfreq of indie music, you are not querying
documents that contain it.

Re: bug in termfreq? was Re: is it possible to do a sort without query?


 Aelexei, thank you , that does seem to work.
 
 My sort results seem to be totally wrong though, I'm not sure if its
 because of my sort function or something else.
 
 My query consists of:
 sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
 And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work but..

 All the results don't have the phrase indie music anywhere in their data.
  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't know how 
this function parses your input but it might not understand your + escape and 
think it's one term constisting of exactly that.

 If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term using the 
shingle filter. Take care, it can significantly increase your index size and 
number of unique terms.

 
 
 
 On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
 
 ale...@superdownloads.com.br wrote:
  You can use the standard query parser and pass q=*:*
  
  2011/8/8 Jason Toy jason...@gmail.com
  
   I am trying to list some data based on a function I run ,
   specifically  termfreq(post_text,'indie music')  and I am unable to do
   it without passing in data to the q paramater.  Is it possible to get
   a
  
  sorted
  
   list without searching for any terms?
  
  --
  
  *Alexei Martchenko* | *CEO* | Superdownloads
  ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
  5083.1018/5080.3535/5080.3533

Re: edismax configuration

http://wiki.apache.org/solr/CommonQueryParameters#defType

 Hello all
 
 Can someone direct me to a link with config info in order to allow use of
 the edismax QueryHandler?
 
 Mark

Re: edismax configuration

2011-08-08 Thread Mark juszczec

Got it.  Thank you.

I thought this was going to be much more difficult than it actually was.

Mark

On Mon, Aug 8, 2011 at 4:50 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 http://wiki.apache.org/solr/CommonQueryParameters#defType

  Hello all
 
  Can someone direct me to a link with config info in order to allow use of
  the edismax QueryHandler?
 
  Mark

Re: PivotFaceting in solr 3.3

As far as I know, there isn't a patch for pivot faceting for 3.x.  It'd require 
extracting the code from trunk and porting it.  Perhaps as easy as applying the 
diff from the pivot commit from trunk to the 3.x codebase?  (but probably not 
quite that easy)

Erik

On Aug 3, 2011, at 00:58 , Isha Garg wrote:

 Hi Pranav,
 
 I know Pivot faceting is a feature in solr 4.0 But i want is 
 there any patch that can make pivot faceting possible in solr3.3.
 Thanks!
 Isha
 
 
 On Wednesday 03 August 2011 10:23 AM, Pranav Prakash wrote:
 From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA.
 Is this what you are looking for ?
 
 https://issues.apache.org/jira/browse/SOLR-792
 
 
 *Pranav Prakash*
 
 temet nosce
 
 Twitterhttp://twitter.com/pranavprakash  | Bloghttp://blog.myblive.com  |
 Googlehttp://www.google.com/profiles/pranny
 
 
 On Wed, Aug 3, 2011 at 10:16, Isha Gargisha.g...@orkash.com  wrote:
 
   
 Hi All!
 
  Can anyone tell which patch should I apply to solr 3.3 to enable pivot
 faceting in it.
 
 Thanks in advance!
 Isha garg

Re: string cut-off filter?

2011-08-08 Thread karsten-solr

Hi Bernd,

I also searched for such a filter but did not found it.

Best regards
  Karsten

P.S. I am using now this filter:

public class CutMaxLengthFilter extends TokenFilter {

public CutMaxLengthFilter(TokenStream in) {
this(in, DEFAULT_MAXLENGTH);
}

public CutMaxLengthFilter(TokenStream in, int maxLength) {
super(in);
this.maxLength = maxLength;
}

public static final int DEFAULT_MAXLENGTH = 15;
private final int maxLength;
private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);

@Override
public final boolean incrementToken() throws IOException {
if (!input.incrementToken()) {
return false;
}
int length = termAtt.length();
if (maxLength  0  length  maxLength) {
termAtt.setLength(maxLength);
}
return true;
}
}

with this factory

public class CutMaxLengthFilterFactory extends BaseTokenFilterFactory {

private int maxLength;

@Override
public void init(MapString, String args) {
super.init(args);
maxLength = getInt(maxLength, 
CutMaxLengthFilter.DEFAULT_MAXLENGTH);
}

public TokenStream create(TokenStream input) {
return new CutMaxLengthFilter(input, maxLength);
}
}



 Original-Nachricht 
 Datum: Mon, 08 Aug 2011 10:15:45 +0200
 Von: Bernd Fehling bernd.fehl...@uni-bielefeld.de
 An: solr-user@lucene.apache.org
 Betreff: string cut-off filter?

 Hi list,
 
 is there a string cut-off filter to limit the length
 of a KeywordTokenized string?
 
 So the string should not be dropped, only limitited to a
 certain length.
 
 Regards
 Bernd

Re: Dispatching a query to multiple different cores

You could use Solr's distributed (shards parameter) capability to do this.  
However, if you've got somewhat different schemas that isn't necessarily going 
to work properly.  Perhaps unify your schemas in order to facilitate this using 
Solr's distributed search feature?

Erik

On Aug 3, 2011, at 05:22 , Ahmed Boubaker wrote:

 Hello there!
 
 I have a multicore solr with 6 different simple cores and somewhat
 different schemas and I defined another meta core which I would it to be a
 dispatcher:  the requests are sent to simple cores and results are
 aggregated before sending back the results to the user.
 
 Any idea or hints how can I achieve this?
 I am wondering whether writing custom SearchComponent or a custom
 SearchHandler are good entry points?
 Is it possible to acces other SolrCore which are in the same container as
 the meta core?
 
 Many thanks for your help.
 
 Boubaker

Re: solr 3.1, not indexing entire document?

2011-08-08 Thread dhastings

that was it... thanks.  obviously the document is well over 2 mgs.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-1-not-indexing-entire-document-tp3236719p3236773.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: string cut-off filter?

There is none indeed exept using copyField and maxChars. Could you perhaps 
come up with some regex that replaces the group of chars beyond the desired 
limit and replace it with '' ?

That would fit in a pattern replace char filter.

 Hi Bernd,
 
 I also searched for such a filter but did not found it.
 
 Best regards
   Karsten
 
 P.S. I am using now this filter:
 
 public class CutMaxLengthFilter extends TokenFilter {
 
   public CutMaxLengthFilter(TokenStream in) {
   this(in, DEFAULT_MAXLENGTH);
   }
 
   public CutMaxLengthFilter(TokenStream in, int maxLength) {
   super(in);
   this.maxLength = maxLength;
   }
 
   public static final int DEFAULT_MAXLENGTH = 15;
   private final int maxLength;
   private final CharTermAttribute termAtt =
 addAttribute(CharTermAttribute.class);
 
   @Override
   public final boolean incrementToken() throws IOException {
   if (!input.incrementToken()) {
   return false;
   }
   int length = termAtt.length();
   if (maxLength  0  length  maxLength) {
   termAtt.setLength(maxLength);
   }
   return true;
   }
 }
 
 with this factory
 
 public class CutMaxLengthFilterFactory extends BaseTokenFilterFactory {
 
   private int maxLength;
 
   @Override
   public void init(MapString, String args) {
   super.init(args);
   maxLength = getInt(maxLength, 
CutMaxLengthFilter.DEFAULT_MAXLENGTH);
   }
 
   public TokenStream create(TokenStream input) {
   return new CutMaxLengthFilter(input, maxLength);
   }
 }
 
 
 
  Original-Nachricht 
 
  Datum: Mon, 08 Aug 2011 10:15:45 +0200
  Von: Bernd Fehling bernd.fehl...@uni-bielefeld.de
  An: solr-user@lucene.apache.org
  Betreff: string cut-off filter?
  
  Hi list,
  
  is there a string cut-off filter to limit the length
  of a KeywordTokenized string?
  
  So the string should not be dropped, only limitited to a
  certain length.
  
  Regards
  Bernd

Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Fred Smith

Hi,
Currently we are in the process of figuring out how to deal with
millions of CSV files containing weather data(20+ million files). Each
file is about 500 bytes in size.
We want to calculate statistics on fields read from the file. For
example, the standard deviation of wind speed across all 20+ million files.
Processing speed isn't an important issue. The analysis routine can run
for days, if needed.

The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr
appears to be able to calculate the statistics we are interested in.

Will the StatsComponent in Solr do what we need with minimal configuration?
Can the StatsComponent only be used on a subset of the data? For
example, only look at data from certain months?
Are there other free programs out there that can parse and analyze 20+
million files?

We are still very new to Solr and really appreciate all your help.
Thanks,
Fred

Re: bug in termfreq? was Re: is it possible to do a sort without query?

Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase  searches, I
don't understand why I would need to reindex t be able to access phrases
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsma markus.jel...@openindex.iowrote:


  Aelexei, thank you , that does seem to work.
 
  My sort results seem to be totally wrong though, I'm not sure if its
  because of my sort function or something else.
 
  My query consists of:
  sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
  And I get back 4571232 hits.

 That's normal, you issue a catch all query. Sorting should work but..

  All the results don't have the phrase indie music anywhere in their
 data.
   Does termfreq not support phrases?

 No, it is TERM frequency and indie music is not one term. I don't know how
 this function parses your input but it might not understand your + escape
 and
 think it's one term constisting of exactly that.

  If not, how can I sort specifically by termfreq of a phrase?

 You cannot. What you can do is index multiple terms as one term using the
 shingle filter. Take care, it can significantly increase your index size
 and
 number of unique terms.

 
 
 
  On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
 
  ale...@superdownloads.com.br wrote:
   You can use the standard query parser and pass q=*:*
  
   2011/8/8 Jason Toy jason...@gmail.com
  
I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable to
 do
it without passing in data to the q paramater.  Is it possible to get
a
  
   sorted
  
list without searching for any terms?
  
   --
  
   *Alexei Martchenko* | *CEO* | Superdownloads
   ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
   5083.1018/5080.3535/5080.3533




-- 
- sent from my mobile
6176064373

Example Solr Config on EC2

2011-08-08 Thread Matt Shields

I'm looking for some examples of how to setup Solr on EC2.  The
configuration I'm looking for would have multiple nodes for redundancy.
 I've tested in-house with a single master and slave with replication
running in Tomcat on Windows Server 2003, but even if I have multiple slaves
the single master is a single point of failure.  Any suggestions or example
configurations?  The project I'm working on is a .NET setup, so ideally I'd
like to keep this search cluster on Windows Server, even though I prefer
Linux.

Matthew Shields
Owner
BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, Colocation,
Managed Services
www.beantownhost.com
www.sysadminvalley.com
www.jeeprally.com

Re: Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Walter Underwood

This does not seem well matched to Solr. Solr and Lucene are optimized to show 
the best few matches, not every match.

I'd use Hadoop for this. Or MarkLogic, if you'd like to talk about that 
off-list.

wunder
Lead Engineer, MarkLogic

On Aug 8, 2011, at 1:59 PM, Fred Smith wrote:

 Hi,
 Currently we are in the process of figuring out how to deal with
 millions of CSV files containing weather data(20+ million files). Each
 file is about 500 bytes in size.
 We want to calculate statistics on fields read from the file. For
 example, the standard deviation of wind speed across all 20+ million files.
 Processing speed isn't an important issue. The analysis routine can run
 for days, if needed.
 
 The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr
 appears to be able to calculate the statistics we are interested in.
 
 Will the StatsComponent in Solr do what we need with minimal configuration?
 Can the StatsComponent only be used on a subset of the data? For
 example, only look at data from certain months?
 Are there other free programs out there that can parse and analyze 20+
 million files?
 
 We are still very new to Solr and really appreciate all your help.
 Thanks,
 Fred

Re: Dispatching a query to multiple different cores

However, if you unify your schemas to do this, I'd consider whether you 
really want seperate cores/shards in the first place.


If you want to search over all of them together, what are your reasons 
to put them in seperate solr indexes in the first place?  Ordinarily, if 
you want to search over them all together, the best place to start is 
putting them in the same solr index.


Then, the distribution/sharding feature is generally your next step, 
only if you have so many documents that you need to shard for 
performance reasons. That is the intended use case of the 
distribution/sharding feature.


On 8/8/2011 4:54 PM, Erik Hatcher wrote:

You could use Solr's distributed (shards parameter) capability to do this.  
However, if you've got somewhat different schemas that isn't necessarily going 
to work properly.  Perhaps unify your schemas in order to facilitate this using 
Solr's distributed search feature?

Erik

On Aug 3, 2011, at 05:22 , Ahmed Boubaker wrote:


Hello there!

I have a multicore solr with 6 different simple cores and somewhat
different schemas and I defined another meta core which I would it to be a
dispatcher:  the requests are sent to simple cores and results are
aggregated before sending back the results to the user.

Any idea or hints how can I achieve this?
I am wondering whether writing custom SearchComponent or a custom
SearchHandler are good entry points?
Is it possible to acces other SolrCore which are in the same container as
the meta core?

Many thanks for your help.

Boubaker

Re: Can Solr with the StatsComponent analyze 20+ million files?

 Hi,
 Currently we are in the process of figuring out how to deal with
 millions of CSV files containing weather data(20+ million files). Each
 file is about 500 bytes in size.
 We want to calculate statistics on fields read from the file. For
 example, the standard deviation of wind speed across all 20+ million files.
 Processing speed isn't an important issue. The analysis routine can run
 for days, if needed.
 
 The StatsComponent(http://wiki.apache.org/solr/StatsComponent) for Solr
 appears to be able to calculate the statistics we are interested in.
 
 Will the StatsComponent in Solr do what we need with minimal configuration?
 Can the StatsComponent only be used on a subset of the data? For
 example, only look at data from certain months?

If i remember correctly, it cannot.

 Are there other free programs out there that can parse and analyze 20+
 million files?

Yes, if analyzing data like your data is all you do (not search, that's Solr's 
power) then you're most likely much better of not using Solr and write 
map/reduce programs for Apache Hadoop, it will analyze huge amounts of data. 
Hadoop can be quite difficult to start with so you can use the excellent Apache 
CouchDB database that supports map/reduce as well.

CouchDB is much easier to begin with. If you transform a sample of your data 
to the JSON format, install CouchDB, load your data, write a simple map/reduce 
function all in 8 hours. Loading and processing all the data will take a bit 
longer.

Cheers


 
 We are still very new to Solr and really appreciate all your help.
 Thanks,
 Fred

Re: Example Solr Config on EC2

2011-08-08 Thread Yury Kats

On 8/8/2011 5:03 PM, Matt Shields wrote:
 I'm looking for some examples of how to setup Solr on EC2.  The
 configuration I'm looking for would have multiple nodes for redundancy.
  I've tested in-house with a single master and slave with replication
 running in Tomcat on Windows Server 2003, but even if I have multiple slaves
 the single master is a single point of failure.  Any suggestions or example
 configurations?

This article describes various configurations:
http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e410

Re: csv responsewriter and numfound

Great question.  But how would that get returned in the response?  

It is a drag that the header is lost when results are written in CSV, but there 
really isn't an obvious spot for that information to be returned.

Erik

On Aug 4, 2011, at 01:52 , Pooja Verlani wrote:

 Hi,
 
 Is there anyway to get numFound from csv response format? Some parameter?
 Or shall I change the code for csvResponseWriter for this?
 
 Thanks,
 Pooja

Re: bug in termfreq? was Re: is it possible to do a sort without query?


Dismax queries can. But

sort=termfreq(all_lists_text,'indie+music')

is not using dismax.  Apparenty termfreq function can not? I am not familiar 
with the termfreq function.

To understand why you'd need to reindex, you might want to read up on how 
lucene actually works, to get a basic understanding of how different indexing 
choices effect what is possible at query time. Lucene In Action is a pretty 
good book.



On 8/8/2011 5:02 PM, Jason Toy wrote:

Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase  searches, I
don't understand why I would need to reindex t be able to access phrases
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote:


Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if its
because of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work but..


All the results don't have the phrase indie music anywhere in their

data.

  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't know how
this function parses your input but it might not understand your + escape
and
think it's one term constisting of exactly that.


If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term using the
shingle filter. Take care, it can significantly increase your index size
and
number of unique terms.




On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko

ale...@superdownloads.com.br  wrote:

You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toyjason...@gmail.com


I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable to

do

it without passing in data to the q paramater.  Is it possible to get
a

sorted


list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533

Re: bug in termfreq? was Re: is it possible to do a sort without query?


 Are not  Dismax queries able to search for phrases using the default
 index(which is what I am using?) If I can already do phrase  searches, I
 don't understand why I would need to reindex t be able to access phrases
 from a function.

Executing a Lucene phrase query is not the same as term frequency (phrase != 
term). A phrase consists of multiple terms and Lucene has an inverted term 
index, not an inverted phrase index (unless your index your data that way).

 
 On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsma 
markus.jel...@openindex.iowrote:
   Aelexei, thank you , that does seem to work.
   
   My sort results seem to be totally wrong though, I'm not sure if its
   because of my sort function or something else.
   
   My query consists of:
   sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
   And I get back 4571232 hits.
  
  That's normal, you issue a catch all query. Sorting should work but..
  
   All the results don't have the phrase indie music anywhere in their
  
  data.
  
Does termfreq not support phrases?
  
  No, it is TERM frequency and indie music is not one term. I don't know
  how this function parses your input but it might not understand your +
  escape and
  think it's one term constisting of exactly that.
  
   If not, how can I sort specifically by termfreq of a phrase?
  
  You cannot. What you can do is index multiple terms as one term using the
  shingle filter. Take care, it can significantly increase your index size
  and
  number of unique terms.
  
   On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
   
   ale...@superdownloads.com.br wrote:
You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toy jason...@gmail.com

 I am trying to list some data based on a function I run ,
 specifically  termfreq(post_text,'indie music')  and I am unable to
  
  do
  
 it without passing in data to the q paramater.  Is it possible to
 get a

sorted

 list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533

Re: csv responsewriter and numfound

2011-08-08 Thread Yonik Seeley

On Mon, Aug 8, 2011 at 5:12 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 Great question.  But how would that get returned in the response?

 It is a drag that the header is lost when results are written in CSV, but 
 there really isn't an obvious spot for that information to be returned.

I guess a comment would be one option.

-Yonik
http://www.lucidimagination.com

Re: Can Solr with the StatsComponent analyze 20+ million files?


On 8/8/2011 5:10 PM, Markus Jelsma wrote:

Will the StatsComponent in Solr do what we need with minimal configuration?
Can the StatsComponent only be used on a subset of the data? For
example, only look at data from certain months?

If i remember correctly, it cannot.


Well, if you index things properly, you could an fq to only certain 
months, and then use StatsComponent on top.


But I'd agree with others that Solr is probably not the best tool for 
this job. Solr's primary area of competency is text indexing and text 
search, not mathematical calculation. If you need a whole lot of text 
indexing and a little bit of math too, you might be able to get 
StatsComponent to do what you need, although you'll probably run into 
some tricky parts becuase this isn't really Solr's focus.


But if you need a whole bunch of math and no text indexing at all -- use 
a tool that has math rather than text search as it's prime area of 
competency/focus, don't make things hard for yourself by using the wrong 
tool for the job.


(StatsComponent, incidentally, performs not-so-great on very large 
result sets, depending on what you ask it for).

Re: bug in termfreq? was Re: is it possible to do a sort without query?


 Dismax queries can. But
 
 sort=termfreq(all_lists_text,'indie+music')
 
 is not using dismax.  Apparenty termfreq function can not? I am not
 familiar with the termfreq function.

It simply returns the TF of the given _term_  as it is indexed of the current 
document. 

Sorting on TF like this seems strange as by default queries are already sorted 
that way since TF plays a big role in the final score.

 
 To understand why you'd need to reindex, you might want to read up on how
 lucene actually works, to get a basic understanding of how different
 indexing choices effect what is possible at query time. Lucene In Action
 is a pretty good book.
 
 On 8/8/2011 5:02 PM, Jason Toy wrote:
  Are not  Dismax queries able to search for phrases using the default
  index(which is what I am using?) If I can already do phrase  searches, I
  don't understand why I would need to reindex t be able to access phrases
  from a function.
  
  On Mon, Aug 8, 2011 at 1:49 PM, Markus 
Jelsmamarkus.jel...@openindex.iowrote:
  Aelexei, thank you , that does seem to work.
  
  My sort results seem to be totally wrong though, I'm not sure if its
  because of my sort function or something else.
  
  My query consists of:
  sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
  And I get back 4571232 hits.
  
  That's normal, you issue a catch all query. Sorting should work but..
  
  All the results don't have the phrase indie music anywhere in their
  
  data.
  
Does termfreq not support phrases?
  
  No, it is TERM frequency and indie music is not one term. I don't know
  how this function parses your input but it might not understand your +
  escape and
  think it's one term constisting of exactly that.
  
  If not, how can I sort specifically by termfreq of a phrase?
  
  You cannot. What you can do is index multiple terms as one term using
  the shingle filter. Take care, it can significantly increase your index
  size and
  number of unique terms.
  
  On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko
  
  ale...@superdownloads.com.br  wrote:
  You can use the standard query parser and pass q=*:*
  
  2011/8/8 Jason Toyjason...@gmail.com
  
  I am trying to list some data based on a function I run ,
  specifically  termfreq(post_text,'indie music')  and I am unable to
  
  do
  
  it without passing in data to the q paramater.  Is it possible to get
  a
  
  sorted
  
  list without searching for any terms?
  
  --
  
  *Alexei Martchenko* | *CEO* | Superdownloads
  ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
  5083.1018/5080.3535/5080.3533

Re: Can Master push data to slave

Hi,

 Hi
 
 I am using Solr 1.4. and doing a replication process where my slave is
 pulling data from Master. I have 2 questions
 
 a. Can Master push data to slave

Not in current versions. Not sure about exotic patches for this.

 b. How to make sure that lock file is not created while replication

What do you mean? 

 
 Please help
 
 thanks
 Pawan

Re: Example Solr Config on EC2

2011-08-08 Thread mbohlig

Matthew,

Here's another resource:
http://www.lucidimagination.com/blog/2010/02/01/solr-shines-through-the-cloud-lucidworks-solr-on-ec2/


Michael Bohlig
Lucid Imagination



- Original Message 
From: Matt Shields m...@mattshields.org
To: solr-user@lucene.apache.org
Sent: Mon, August 8, 2011 2:03:20 PM
Subject: Example Solr Config on EC2

I'm looking for some examples of how to setup Solr on EC2.  The
configuration I'm looking for would have multiple nodes for redundancy.
I've tested in-house with a single master and slave with replication
running in Tomcat on Windows Server 2003, but even if I have multiple slaves
the single master is a single point of failure.  Any suggestions or example
configurations?  The project I'm working on is a .NET setup, so ideally I'd
like to keep this search cluster on Windows Server, even though I prefer
Linux.

Matthew Shields
Owner
BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, Colocation,
Managed Services
www.beantownhost.com
www.sysadminvalley.com
www.jeeprally.com

Re: Can Solr with the StatsComponent analyze 20+ million files?

2011-08-08 Thread Fred Smith

Thank you Walter, Markus and Jonathan for your fast responses and help!
We will be looking into CouchDB (and Hadoop if necessary) to process our
data.
Thanks again,
Fred

Re: Is anobdy using lotsofcores feature in production?

2011-08-08 Thread Uomesh

Hi Shalin,

Is this means if I apply the patch mention at below link still Solr does not
support lots of core?
https://issues.apache.org/jira/browse/SOLR-1293

Are you saying this is just a concept and the patch is not an
implementation? We are planning to use lots of core in our commerce system
to separate products for each client in search and provide customization for
each client. So could you please let us know if this is feasible and if we
want to create around 500 cores and have around 8-10 load balancing solr
slaves?

Please let us know. Based on your feedback our approach will be decided.

Thanks Regards,
Umesh

On Mon, Jul 25, 2011 at 3:36 AM, Markus Jelsma-2 [via Lucene]
ml-node+3196893-77535491-416...@n3.nabble.com wrote:

No i missed something and interpreted the question as using a lot of cores.

LotsOfCores does not exist as a feature. It is just a write-up, some jira

issues and a couple of patches. Did I miss something?

On Sun, Jul 24, 2011 at 8:26 PM, Markus Jelsma

[hidden email]
http://user/SendEmail.jtp?type=nodenode=3196893i=0wrote:

It works fine but you would keep an eye on additional overhead, cores
`stealing` too much CPU from others, trouble with cores that merge
segments stealing I/O and of course RAM. It can also result in quite a
high number of
open file descriptors.

There are more, but these seem most common to me.

Hi,

Is anbody using lots of core feature in production? Is this feature
scalable. I have around 1000 core and want to use this feature. Will

there

be any issue in production?

http://wiki.apache.org/solr/LotsOfCores

Thanks,
Umesh

View this message in context:

http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in
-

production-tp3193798p3193798.html Sent from the Solr - User mailing
list archive at Nabble.com.

--
If you reply to this email, your message will be added to the discussion
below:

http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in-production-tp3193798p3196893.html
To unsubscribe from Is anobdy using lotsofcores feature in production?, click
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3193798code=VW9tZXNoQGdtYWlsLmNvbXwzMTkzNzk4fDIyODkyODYxMg==.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in-production-tp3193798p3236957.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Master push data to slave

2011-08-08 Thread simon

You could configure a PostCommit event listener on the master which
would send a HTTP fetchindex request to the slave you want to carry
out replication  - see
http://wiki.apache.org/solr/SolrReplication#HTTP_API

But why do you want the master to push to the slave ?

-Simon

On Mon, Aug 8, 2011 at 5:26 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 Hi

 I am using Solr 1.4. and doing a replication process where my slave is
 pulling data from Master. I have 2 questions

 a. Can Master push data to slave

 Not in current versions. Not sure about exotic patches for this.

 b. How to make sure that lock file is not created while replication

 What do you mean?


 Please help

 thanks
 Pawan

Re: Is anobdy using lotsofcores feature in production?

2011-08-08 Thread Uomesh

Hi Shalin,

Is this means if I apply the patch mention at below link still Solr does not
support lots of core?
https://issues.apache.org/jira/browse/SOLR-1293

Are you saying this is just a concept and the patch is not an
implementation? We are planning to use lots of core in our commerce system
to separate products for each client in search and provide customization for
each client. So could you please let us know if this is feasible and if we
want to create around 500 cores and have around 8-10 load balancing solr
slaves?

Please let us know. Based on your feedback our approach will be decided.

Thanks  Regards,
Umesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-anobdy-using-lotsofcores-feature-in-production-tp3193798p3236958.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Same id on two shards

2011-08-08 Thread simon

Only one should be returned, but it's non-deterministic. See
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

-Simon

On Sat, Aug 6, 2011 at 6:27 AM, Pooja Verlani pooja.verl...@gmail.com wrote:
 Hi,

 We have a multicore solr with 6 cores. We merge the results using shards
 parameter or distrib handler.
 I have a problem, I might post one document on one of the cores and then
 post it after some days on another core, as I have a time-sliced multicore
 setup!

 The question is if I retrieve a document which is posted on both the shards,
 will solr return me only one document or both. And if only one document will
 be return, which one?

 Regards,
 Pooja

Re: bug in termfreq? was Re: is it possible to do a sort without query?

I am trying to test out and compare different sorts and scoring.

 When I use dismax to search for indie music
with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100
I see some stuff that seems irrelevant, meaning in top results I see only
1 or 2 mentions of indie music, but when I look further down the list I do
see other docs that have more occurrences of indie music.
So I a want to test by comparing the the different queries versus seeing a
list of docs ranked specifically by the count of occurrences of the phrase
indie music

On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma markus.jel...@openindex.iowrote:


  Dismax queries can. But
 
  sort=termfreq(all_lists_text,'indie+music')
 
  is not using dismax.  Apparenty termfreq function can not? I am not
  familiar with the termfreq function.

 It simply returns the TF of the given _term_  as it is indexed of the
 current
 document.

 Sorting on TF like this seems strange as by default queries are already
 sorted
 that way since TF plays a big role in the final score.

 
  To understand why you'd need to reindex, you might want to read up on how
  lucene actually works, to get a basic understanding of how different
  indexing choices effect what is possible at query time. Lucene In Action
  is a pretty good book.
 
  On 8/8/2011 5:02 PM, Jason Toy wrote:
   Are not  Dismax queries able to search for phrases using the default
   index(which is what I am using?) If I can already do phrase  searches,
 I
   don't understand why I would need to reindex t be able to access
 phrases
   from a function.
  
   On Mon, Aug 8, 2011 at 1:49 PM, Markus
 Jelsmamarkus.jel...@openindex.iowrote:
   Aelexei, thank you , that does seem to work.
  
   My sort results seem to be totally wrong though, I'm not sure if its
   because of my sort function or something else.
  
   My query consists of:
   sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
   And I get back 4571232 hits.
  
   That's normal, you issue a catch all query. Sorting should work but..
  
   All the results don't have the phrase indie music anywhere in their
  
   data.
  
 Does termfreq not support phrases?
  
   No, it is TERM frequency and indie music is not one term. I don't know
   how this function parses your input but it might not understand your +
   escape and
   think it's one term constisting of exactly that.
  
   If not, how can I sort specifically by termfreq of a phrase?
  
   You cannot. What you can do is index multiple terms as one term using
   the shingle filter. Take care, it can significantly increase your
 index
   size and
   number of unique terms.
  
   On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko
  
   ale...@superdownloads.com.br  wrote:
   You can use the standard query parser and pass q=*:*
  
   2011/8/8 Jason Toyjason...@gmail.com
  
   I am trying to list some data based on a function I run ,
   specifically  termfreq(post_text,'indie music')  and I am unable to
  
   do
  
   it without passing in data to the q paramater.  Is it possible to
 get
   a
  
   sorted
  
   list without searching for any terms?
  
   --
  
   *Alexei Martchenko* | *CEO* | Superdownloads
   ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
   5083.1018/5080.3535/5080.3533




-- 
- sent from my mobile
6176064373

Re: bug in termfreq? was Re: is it possible to do a sort without query?

If your want to understand and debug the scoring you can use debugQuery=true 
to see how different documents score. Most of the time docs with both terms are 
on top of the result set unless norms are interferring.

To understand your should check the Solr relevancy wiki but the Lucene docs 
are much better although very low level.

http://wiki.apache.org/solr/SolrRelevancyCookbook
http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/Similarity.html

Your question is more a relevance question than about the termfreq function. 
To be short, don't use those kind of functions if you don't yet understand 
similarity as describe in the Lucene docs.

 I am trying to test out and compare different sorts and scoring.
 
  When I use dismax to search for indie music
 with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100
 I see some stuff that seems irrelevant, meaning in top results I see only
 1 or 2 mentions of indie music, but when I look further down the list I
 do see other docs that have more occurrences of indie music.
 So I a want to test by comparing the the different queries versus seeing a
 list of docs ranked specifically by the count of occurrences of the phrase
 indie music
 
 On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma 
markus.jel...@openindex.iowrote:
   Dismax queries can. But
   
   sort=termfreq(all_lists_text,'indie+music')
   
   is not using dismax.  Apparenty termfreq function can not? I am not
   familiar with the termfreq function.
  
  It simply returns the TF of the given _term_  as it is indexed of the
  current
  document.
  
  Sorting on TF like this seems strange as by default queries are already
  sorted
  that way since TF plays a big role in the final score.
  
   To understand why you'd need to reindex, you might want to read up on
   how lucene actually works, to get a basic understanding of how
   different indexing choices effect what is possible at query time.
   Lucene In Action is a pretty good book.
   
   On 8/8/2011 5:02 PM, Jason Toy wrote:
Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase 
searches,
  
  I
  
don't understand why I would need to reindex t be able to access
  
  phrases
  
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus
  
  Jelsmamarkus.jel...@openindex.iowrote:
Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if
its because of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work
but..

All the results don't have the phrase indie music anywhere in
their

data.

  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't
know how this function parses your input but it might not
understand your + escape and
think it's one term constisting of exactly that.

If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term
using the shingle filter. Take care, it can significantly increase
your
  
  index
  
size and
number of unique terms.

On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko

ale...@superdownloads.com.br  wrote:
You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toyjason...@gmail.com

I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable
to

do

it without passing in data to the q paramater.  Is it possible to
  
  get
  
a

sorted

list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533

Re: Same id on two shards

2011-08-08 Thread Shawn Heisey


On 8/8/2011 4:07 PM, simon wrote:

Only one should be returned, but it's non-deterministic. See
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations


I had heard it was based on which one responded first.  This is part of 
why we have a small index that contains the newest content and only 
distribute content to the other shards once a day.  The hope is that the 
small index (less than 1GB, fits into RAM on that virtual machine) will 
always respond faster than the other larger shards (over 18GB each).  Is 
this an incorrect assumption on our part?


The build system does do everything it can to ensure that periods of 
overlap are limited to the time it takes to commit a change across all 
of the shards, which should amount to just a few seconds once a day.  
There might be situations when the index gets out of whack and we have 
duplicate id values for a longer time period, but in practice it hasn't 
happened yet.


Thanks,
Shawn

Re: merge factor performance

What version of Solr are you using? And how are you sending your
docs to Solr?

Bumping your JVM size and bumping your RAM size
to 128M also might help..

How are you sending your docs to Solr? And where are you
getting them from? Are you sure that Solr is your problem or
is it your data acquisition? (hint, just comment out the call
to Solr if you're using SolrJ)...

Bottom line: There isn't much information to go on here...

And have you seen:
http://wiki.apache.org/solr/FAQ#How_can_indexing_be_accelerated.3F

Best
Erick

 also what about RAM Size (default is 32 MB) ?

 Which other factors we need to consider ?

 When should we consider optimize ?

 Any other deviation from default would help us in achieving the target.

 We are allocating JVM max heap size allocation 512 MB, default concurrent
 mark sweep is set for garbage collection.


 Thanks
 Naveen

Re: MultiSearcher/ParallelSearcher - searching over multiple cores?

I think you'll have to make this go yourself, I don't see how to make
Solr do it for you. And even if it could, the scores aren't comparable,
so combining them for presentation to the user will be interesting

Best
Erick

On Thu, Aug 4, 2011 at 2:27 PM, Ralf Musick ra...@gmx.de wrote:
 Hi Erik,

 I have several types with different properties, but they are supposed to
 be combined to one search.
 Imagine a book with property title and a journal with property name.
 (the types in my project have of course more complex properties.)

 So I created a new core with combined searchfields: field name is indexed,
 title is indexed, some shared properties are indexed like id.
 Further an additional solr field type is created.
 Of course there are several indexer, each per type. A specific type indexer
 stores only the fields of that type and stores further the type information
 eg book.
 After indexing, all types are in the same core.

 To search over all types, the query has to look like that ((title: bla) and
 (type: book)) or ((name: bla) and (type: journal)).

 At least you get books or journal sorted by boost factor - and you have the
 type information as return field to differ the search results.

 I hope it is coherent.

 Thanks for your answer,
  Best Ralf

Re: Records skipped when using DataImportHandler

Spend some time in the admin/analysis page, that'll show you what
part of the analysis chain is doing what to your data. It'll save you a world
of headache...

But at a guess WordDelimiterFilterFactory is your culprit...

Best
Erick

On Thu, Aug 4, 2011 at 6:08 PM, anand sridhar anand.for...@gmail.com wrote:
 Ok. After analysis, I narrowed the reduced results set to the fact that the
 zipcode field is not indexed 'as is'. i.e the zipcode field values are
 broken down into tokens and then stored. Hence, if there are 10 documents
 with zipcode fields varying from 91000-91009, then the zipcode fields are
 not stored as 91000, 91001 etc.. instead, the most common recurrences are
 grabbed together and stored as tokens  hence resulting in a reduced
 resultset.
 The net effect is I cannot search for a value like 91000  since its not
 stored as it is.

 I suspect this to do something with the type of field the zipcode is
 associated to. Right now , zipcode is a field of type text_general where the
 StandardTokenizerFactory may be breakign the values into tokens. However, I
 want to store them without tokenizing. Whats the best field type to do this.
 ?

 I already explored the String fieldtype which is supposed to store the
 values as is, but I see that the values are still being tokenized.


 Thanks,
 Anand
 On Wed, Aug 3, 2011 at 7:24 PM, Erick Erickson erickerick...@gmail.comwrote:

 Sorry, I'm on a restricted machine so can't get the precise URL. But,
 there's a debug page for DIH that might allow you to see what the query
 actually returns. I'd guess one of two things:
 1 you aren't getting the number of rows you think.
 2 you aren't committing the documents you add.

 But that's just a guess.

 Best
 Erick
 On Aug 3, 2011 2:15 PM, anand sridhar anand.for...@gmail.com wrote:
  Hi,
  I am a newbie to Solr and have been trying to learn using
  DataImportHandler.
  I have a query in data-config.xml that fetches about 5 records when i
 fire
  it in SQL Query manager.
  However, when Solr does a full import, it is skipping 4 records and only
  importing 1 record.
  What could be the reason for that. ?
 
  My data-config.xml looks like this -
 
  dataConfig
  dataSource type=JdbcDataSource
  name=GeoService
  driver=net.sourceforge.jtds.jdbc.Driver
  url=jdbc:jtds:sqlserver://10.168.50.104/ZipCodeLookup
  user=sa
  password=psiuser/
  document
  entity name=city
  query=select ll.cityId as id, ll.zip as zipCode, c.cityName as
  cityName, st.stateName as state, ct.countryName as country from
 latlonginfo
  ll,city c, state st, country ct where ll.cityId = c.cityID and
  c.stateID=st.stateID and st.countryID = ct.countryID
  order by ll.areacode
  dataSource=GeoService
  field column=zipCode name=zipCode/
  field column=cityName name=cityName/
  field column=state name=state/
  field column=country name=country/
  /entity
  /document
  /dataConfig
 
  My fields definition in schema.xml looks as below -
 
  field name=CityName type=text_general indexed=true stored=true
 /
  field name=zipCode type=text_general indexed=true stored=true/
  field name=state type=text_general indexed=true stored=true /
  field name=country type=text_general indexed=true stored=true /
 
  One observation I made was the 1 record that is being indexes is the last
  record in the result set. I have verified that there are no duplicate
  records being retreived.
 
  For eg, if the result set from Database is -
 
  zipcode CityName state country
  --- - - ---
  91324 Northridge CA USA
  91325 Northridge CA USA
  91327 Northridge CA USA
  91328 Northridge CA USA
  91329 Northridge CA USA
  91330 Northridge CA USA
 
  The record being indexed is the last record all the time.
 
  Any suggestions are welcome.
 
  Thanks,
  Anand

Re: Same id on two shards

2011-08-08 Thread simon

I think the first one to respond is indeed the way it works, but
that's only deterministic up to a point (if your small index is in the
throes of a commit and everything required for a response happens to
be  cached on the larger shard ... who knows ?)

On Mon, Aug 8, 2011 at 7:10 PM, Shawn Heisey s...@elyograg.org wrote:
 On 8/8/2011 4:07 PM, simon wrote:

 Only one should be returned, but it's non-deterministic. See

 http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

 I had heard it was based on which one responded first.  This is part of why
 we have a small index that contains the newest content and only distribute
 content to the other shards once a day.  The hope is that the small index
 (less than 1GB, fits into RAM on that virtual machine) will always respond
 faster than the other larger shards (over 18GB each).  Is this an incorrect
 assumption on our part?

 The build system does do everything it can to ensure that periods of overlap
 are limited to the time it takes to commit a change across all of the
 shards, which should amount to just a few seconds once a day.  There might
 be situations when the index gets out of whack and we have duplicate id
 values for a longer time period, but in practice it hasn't happened yet.

 Thanks,
 Shawn

Re: Suggestions for copying fields across cores...