Re: Best practice for Delta every 2 Minutes.

2010-12-01 Thread stockii

http://10.1.0.10:8983/solr/payment/dataimport?commad=delta-importdebug=on
dont work. no debug is started =(

thanks. i will try mergefactor=2
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1997595.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: distributed architecture

2010-12-01 Thread Upayavira
Okay, I'll see what I can do. 

Also for what it is worth, if anyone is in London tomorrow, I'm giving a
presentation which covers this topic at the (free) Online Information
2010 exhibition at Kensington Olympia, at 3:20pm. Anyone interested is
welcome to come along. I believe we're hoping to video it, so if
successful, I expect it'll get put online somewhere.

Upayavira

On Wed, 01 Dec 2010 03:44 +, Jayant Das jayan...@hotmail.com
wrote:
 
 Hi, A diagram will be very much appreciated.
 Thanks,
 Jayant
  
  From: u...@odoko.co.uk
  To: solr-user@lucene.apache.org
  Subject: Re: distributed architecture
  Date: Wed, 1 Dec 2010 00:39:40 +
  
  I cannot say how mature the code for B) is, but it is not yet included
  in a release.
  
  If you want the ability to distribute content across multiple nodes (due
  to volume) and want resilience, then use both.
  
  I've had one setup where we have two master servers, each with four
  cores. Then we have two pairs of slaves. Each pair mirrors the masters,
  so we have two hosts covering each of our cores.
  
  Then comes the complicated bit to explain...
  
  Each of these four slave hosts had a core that was configured with a
  hardwired shards request parameter, which pointed to each of our
  shards. Actually, it pointed to VIPs on a load balancer. Those two VIPs
  then balanced across each of our pair of hosts.
  
  Then, put all four of these servers behind another VIP, and we had a
  single address we could push requests to, for sharded, and resilient
  search.
  
  Now if that doesn't make any sense, let me know and I'll have another go
  at explaining it (or even attempt a diagram).
  
  Upayavira
  
  On Tue, 30 Nov 2010 13:27 -0800, Cinquini, Luca (3880)
  luca.cinqu...@jpl.nasa.gov wrote:
   Hi,
   I'd like to know if anybody has suggestions/opinions on what is currently 
   the best architecture for a distributed search system using Solr. The use 
   case is that of a system composed
   of N indexes, each hosted on a separate machine, each index containing
   unique content.
   
   Options that I know of are:
   
   A) Using Solr distributed search
   B) Using Solr + Zookeeper integration
   C) Using replication, i.e. each node replicates all the others
   
   It seems like options A) and B) would suffer from a fault-tolerance
   standpoint: if any of the nodes goes down, the search won't -at this
   time- return partial results, but instead report an exception.
   Option C) would provide fault tolerance, at least for any search
   initiated at a node that is available, but would incur into a large
   replication overhead.
   
   Did I get any of the above wrong, or does somebody have some insight on
   what is the best system architecture for this use case ?
   
   thanks in advance,
   Luca
 


Re: distributed architecture

2010-12-01 Thread Upayavira
On Tue, 30 Nov 2010 23:11 -0800, Dennis Gearon gear...@sbcglobal.net
wrote:
 Wow, would you put a diagram somewhere up on the Solr site?

 Or, here, and I will put it somewhere there.

I'll see what I can do to make a diagram.

 And, what is a VIP?

Virtual IP. It is what a load balancer uses. You assign a 'virtual IP'
to your load balancer, and it is responsible for forwarding traffic to
that IP to one of the hosts in that particular pool.

Upayavira


Re: Dinamically change master

2010-12-01 Thread Upayavira
Note, all extracted from http://wiki.apache.org/solr/SolrReplication

You'd put:

requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
!--Replicate on 'startup' and 'commit'. 'optimize' is also a
valid value for replicateAfter. --
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
/lst
/requestHandler

into every box you want to be able to act as a master, then use:

http://slave_host:port/solr/replication?command=fetchindexmasterUrl=your
master URL

As the above page says better than I can, It is possible to pass on
extra attribute 'masterUrl' or other attributes like 'compression' (or
any other parameter which is specified in the lst name=slave tag) to
do a one time replication from a master. This obviates the need for
hardcoding the master in the slave.

HTH, Upayavira

On Wed, 01 Dec 2010 06:24 +0100, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 Hi Upayavira,
 this is a good start for solving my problem, can you please tell how does
 such a replication URL look like?
 Thanks,
 Tommaso
 
 2010/12/1 Upayavira u...@odoko.co.uk
 
  Hi Tommaso,
 
  I believe you can tell each server to act as a master (which means it
  can have its indexes pulled from it).
 
  You can then include the master hostname in the URL that triggers a
  replication process. Thus, if you triggered replication from outside
  solr, you'd have control over which master you pull from.
 
  Does this answer your question?
 
  Upayavira
 
 
  On Tue, 30 Nov 2010 09:18 -0800, Ken Krugler
  kkrugler_li...@transpac.com wrote:
   Hi Tommaso,
  
   On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote:
  
Hi all,
   
in a replication environment if the host where the master is running
goes
down for some reason, is there a way to communicate to the slaves to
point
to a different (backup) master without manually changing
configuration (and
restarting the slaves or their cores)?
   
Basically I'd like to be able to change the replication master
dinamically
inside the slaves.
   
Do you have any idea of how this could be achieved?
  
   One common approach is to use VIP (virtual IP) support provided by
   load balancers.
  
   Your slaves are configured to use a VIP to talk to the master, so that
   it's easy to dynamically change which master they use, via updates to
   the load balancer config.
  
   -- Ken
  
   --
   Ken Krugler
   +1 530-210-6378
   http://bixolabs.com
   e l a s t i c   w e b   m i n i n g
  
  
  
  
  
  
 
 


Re: Dinamically change master

2010-12-01 Thread Tommaso Teofili
Thanks Upayavira, that sounds very good.

p.s.:
I read that page some weeks ago and didn't get back to check on it.


2010/12/1 Upayavira u...@odoko.co.uk

 Note, all extracted from http://wiki.apache.org/solr/SolrReplication

 You'd put:

 requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
!--Replicate on 'startup' and 'commit'. 'optimize' is also a
valid value for replicateAfter. --
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
/lst
 /requestHandler

 into every box you want to be able to act as a master, then use:

 http://slave_host:port/solr/replication?command=fetchindexmasterUrl=your
 master URL

 As the above page says better than I can, It is possible to pass on
 extra attribute 'masterUrl' or other attributes like 'compression' (or
 any other parameter which is specified in the lst name=slave tag) to
 do a one time replication from a master. This obviates the need for
 hardcoding the master in the slave.

 HTH, Upayavira

 On Wed, 01 Dec 2010 06:24 +0100, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Hi Upayavira,
  this is a good start for solving my problem, can you please tell how does
  such a replication URL look like?
  Thanks,
  Tommaso
 
  2010/12/1 Upayavira u...@odoko.co.uk
 
   Hi Tommaso,
  
   I believe you can tell each server to act as a master (which means it
   can have its indexes pulled from it).
  
   You can then include the master hostname in the URL that triggers a
   replication process. Thus, if you triggered replication from outside
   solr, you'd have control over which master you pull from.
  
   Does this answer your question?
  
   Upayavira
  
  
   On Tue, 30 Nov 2010 09:18 -0800, Ken Krugler
   kkrugler_li...@transpac.com wrote:
Hi Tommaso,
   
On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote:
   
 Hi all,

 in a replication environment if the host where the master is
 running
 goes
 down for some reason, is there a way to communicate to the slaves
 to
 point
 to a different (backup) master without manually changing
 configuration (and
 restarting the slaves or their cores)?

 Basically I'd like to be able to change the replication master
 dinamically
 inside the slaves.

 Do you have any idea of how this could be achieved?
   
One common approach is to use VIP (virtual IP) support provided by
load balancers.
   
Your slaves are configured to use a VIP to talk to the master, so
 that
it's easy to dynamically change which master they use, via updates to
the load balancer config.
   
-- Ken
   
--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
   
   
   
   
   
   
  
 



Re: ArrayIndexOutOfBoundsException in sort

2010-12-01 Thread Jerry Li
sorry for lost, following is my schema.xml config and I use IKTokenizer for
Chinese charactor



   fieldType name=text class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory
isMaxWordLength=false/
!-- tokenizer class=solr.WhitespaceTokenizerFactory/ --
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumb
ers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
  /analyzer
  analyzer type=query
tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory
isMaxWordLength=true/
!-- tokenizer class=solr.WhitespaceTokenizerFactory/ --
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumb
ers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
  /analyzer
/fieldType


   field name=id type=string indexed=true stored=true
required=true /
   field name=documentId type=tlong indexed=true stored=true
required=true /
   field name=headline type=text indexed=true stored=true
omitNorms=true required=true /
   field name=content type=text indexed=true stored=true
compressed=true omitNorms=true required=true /
   field name=author type=text indexed=true stored=true
required=true default= /
   field name=pubName type=text indexed=true stored=true
required=true default= /
   field name=pubType type=tint indexed=true stored=true
required=true /
   field name=section type=text indexed=true stored=true
required=true /
   field name=column type=text indexed=true stored=true
required=true /
   field name=folderId type=tint indexed=true stored=true
required=true/
   field name=userId type=string indexed=true stored=true
required=true/
   field name=readType type=tint indexed=true stored=true
required=true /
   field name=downloadType type=tint indexed=true stored=true
required=true /
   field name=hasImg type=tint indexed=false stored=true
required=true /
   field name=hasText type=tint indexed=false stored=true
required=true /
   field name=pubDate type=tint indexed=true stored=true
required=true/
   field name=trackingTime type=tint indexed=true stored=true
required=true /
   field name=text type=text indexed=true stored=false
multiValued=true/


uniqueKeyid/uniqueKey

defaultSearchFieldtext/defaultSearchField

   copyField source=headline dest=text/
   copyField source=content dest=text/


On Wed, Dec 1, 2010 at 2:50 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Wed, Dec 1, 2010 at 10:56 AM, Jerry Li zongjie...@gmail.com wrote:
  Hi team
 
  My solr version is 1.4
  There is an ArrayIndexOutOfBoundsException when i sort one field and the
  following is my code and log info,
  any help will be appreciated.
 
  Code:
 
 SolrQuery query = new SolrQuery();
 query.setSortField(author, ORDER.desc);
 [...]

 Please show us how the field author defined in your
 schema.xml. Sorting has to be done on a non-tokenized
 field, e.g., a StrField.

 Regards,
 Gora




-- 

Best Regards.
Jerry. Li | 李宗杰



Re: ArrayIndexOutOfBoundsException in sort

2010-12-01 Thread Jerry Li
Hi

It seems work fine again after I change author field type from text to
string, could anybody give some info about it? very appriciated.

field name=author type=string indexed=true stored=true
required=true default= /


On Wed, Dec 1, 2010 at 5:20 PM, Jerry Li zongjie...@gmail.com wrote:

 sorry for lost, following is my schema.xml config and I use IKTokenizer for
 Chinese charactor



fieldType name=text class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory
 isMaxWordLength=false/
 !-- tokenizer class=solr.WhitespaceTokenizerFactory/ --
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 !-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and query
   analyzers to leave a 'gap' for more accurate phrase queries.
 --
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumb
 ers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
   /analyzer
   analyzer type=query
 tokenizer class=org.wltea.analyzer.solr.IKTokenizerFactory
 isMaxWordLength=true/
 !-- tokenizer class=solr.WhitespaceTokenizerFactory/ --
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
 /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumb
 ers=0 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt/
   /analyzer
 /fieldType


field name=id type=string indexed=true stored=true
 required=true /
field name=documentId type=tlong indexed=true stored=true
 required=true /
field name=headline type=text indexed=true stored=true
 omitNorms=true required=true /
field name=content type=text indexed=true stored=true
 compressed=true omitNorms=true required=true /
field name=author type=text indexed=true stored=true
 required=true default= /
field name=pubName type=text indexed=true stored=true
 required=true default= /
field name=pubType type=tint indexed=true stored=true
 required=true /
field name=section type=text indexed=true stored=true
 required=true /
field name=column type=text indexed=true stored=true
 required=true /
field name=folderId type=tint indexed=true stored=true
 required=true/
field name=userId type=string indexed=true stored=true
 required=true/
field name=readType type=tint indexed=true stored=true
 required=true /
field name=downloadType type=tint indexed=true stored=true
 required=true /
field name=hasImg type=tint indexed=false stored=true
 required=true /
field name=hasText type=tint indexed=false stored=true
 required=true /
field name=pubDate type=tint indexed=true stored=true
 required=true/
field name=trackingTime type=tint indexed=true stored=true
 required=true /
field name=text type=text indexed=true stored=false
 multiValued=true/


 uniqueKeyid/uniqueKey

 defaultSearchFieldtext/defaultSearchField

copyField source=headline dest=text/
copyField source=content dest=text/



 On Wed, Dec 1, 2010 at 2:50 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Wed, Dec 1, 2010 at 10:56 AM, Jerry Li zongjie...@gmail.com wrote:
  Hi team
 
  My solr version is 1.4
  There is an ArrayIndexOutOfBoundsException when i sort one field and the
  following is my code and log info,
  any help will be appreciated.
 
  Code:
 
 SolrQuery query = new SolrQuery();
 query.setSortField(author, ORDER.desc);
 [...]

 Please show us how the field author defined in your
 schema.xml. Sorting has to be done on a non-tokenized
 field, e.g., a StrField.

 Regards,
 Gora




 --

 Best Regards.
 Jerry. Li | 李宗杰
 




-- 

Best Regards.
Jerry. Li | 李宗杰



Spatial Search

2010-12-01 Thread Aisha Zafar
Hi ,

I am a newbie of solr. I found it really interesting specially spetial search. 
I am very interested to go in its depth but i am facing some problem to use it 
as i have 1.4.1 version installed on my machine but the spetial search is a 
feature of 4.0 version which is not released yet. I have also read somewhere 
that we can use a patch for this purpose. As i am a newbie I dont know how to 
install the patch and from where to download it. If anyone could help me i'll 
be very thankful. 

thanks in advance and bye




  

Troubles with forming query for solr.

2010-12-01 Thread kolesman

Hi,

I have some troubles with forming query for solr.

Here is my task :
I'm indexing objects with 3 fields, for example {field1, field2, filed3}
In solr's response I want to get object in special order :
1. Firstly I want to get objects where all 3 fields are matched
2. Then I want to get objects where ONLY field1 and field2 are matched
3. And finnally I want to get objects where ONLY field2 and field3 are
matched.

Could your explain me how to form query for my task?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Troubles-with-forming-query-for-solr-tp1996630p1996630.html
Sent from the Solr - User mailing list archive at Nabble.com.


schema design for related fields

2010-12-01 Thread lee carroll
Hi

I've built a schema for a proof of concept and it is all working fairly
fine, niave maybe but fine.
However I think we might run into trouble in the future if we ever use
facets.

The data models train destination city routes from a origin city:
Doc:City
Name: cityname [uniq key]
CityType: city type values [nine possible values so good for faceting]
... [other city attricbutes which relate directy to the doc unique key]
all have limited vocab so good for faceting
FareJanStandard:cheapest standard fare in january(float value)
FareJanFirst:cheapest first class fare in january(float value)
FareFebStandard:cheapest standard fare in feb(float value)
FareFebFirst:cheapest first fare in feb(float value)
. etc

The question is how would i best facet fare price? The desire is to return

number of citys with jan prices in a set of ranges
etc
number of citys with first prices in a set of ranges
etc

install is 1.4.1 running in weblogic

Any ideas ?



Lee C


Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param

2010-12-01 Thread Martin Grotzke
On Tue, Nov 30, 2010 at 7:51 PM, Martin Grotzke
martin.grot...@googlemail.com wrote:
 On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke
 martin.grot...@googlemail.com wrote:
 Still I'm wondering, why this issue does not occur with the plain
 example solr setup with 2 indexed docs. Any explanation?

 It's an old option you have in your solrconfig.xml that causes a
 different code path to be followed in Solr:

   !-- An optimization that attempts to use a filter to satisfy a search.
         If the requested sort does not include score, then the filterCache
         will be checked for a filter matching the query. If found, the filter
         will be used as the source of document ids, and then the sort will be
         applied to that. --
    useFilterForSortedQuerytrue/useFilterForSortedQuery

 Most apps would be better off commenting that out or setting it to
 false.  It only makes sense when a high number of queries will be
 duplicated, but with different sorts.

 Great, this sounds really promising, would be a very easy fix. I need
 to check this tomorrow on our test/integration server if changing this
 does the trick for us.
I just verified this fix on our test/integration system and it works - cool!

Thanx a lot for this hint,
cheers,
Martin


Re: SOLR for Log analysis feasibility

2010-12-01 Thread phoey

my thoughts exactly that it may seem fairly straightforward but i fear for
when a client wants a perfectly reasonable new feature to be added to their
report and SOLR simply cannot support this feature. 

i am hoping we wont have any real issues with scalability as Loggly because
we dont index and store large documents of data within SOLR. Most of our
documents will be very small.

Does anyone have any experience with using field collapsing in a production
environment?

thank you for all your replies. 

Joe

 

 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-for-Log-analysis-feasibility-tp1992202p1998360.html
Sent from the Solr - User mailing list archive at Nabble.com.


send XML multiValued Field Solr-PHP-Client

2010-12-01 Thread stockii

Hello.

do anyone using Solr-PHP-Client ? 

how are you using mutltivalued fields with the method addFields() ?

solr says to me SCHWERWIEGEND: java.lang.NumberFormatException: empty String

when i send a raw xml like this:
doc
field name=uniquekey24038608/field
field name=user_id778/field
field name=reasonreason1/field
field name=reasonreason1/field
/doc

in schema i defined: 
field name=reason type=text indexed=true stored=false
multiValued=true /
dynamicField name=reason_* type=text indexed=true stored=false /

why dont work this ? =(
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/send-XML-multiValued-Field-Solr-PHP-Client-tp1998370p1998370.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: QueryNorm and FieldNorm

2010-12-01 Thread Gastone Penzo
Thanx for the answer.
Is it possible to remove the QueryNorm??
so all the bf boost became an add of the solr score??

omitNorm is about fieldNorm or queryNorm??

thanx

Gastone

2010/11/30 Jayendra Patil jayendra.patil@gmail.com

 fieldNorm is the combination of length of the field with index and query
 time boosts.

   1. lengthNorm = measure of the importance of a term according to the
  total number of terms in the field
 1. Implementation: 1/sqrt(numTerms)
 2. Implication: a term matched in fields with less terms have a
 higher score
 3. Rationale: a term in a field with less terms is more important
 than one with more
  2. boost (index) = boost of the field at index-time
 1. Index time boost specified. The fieldNorm value in the score
would include the same.
 3. boost (query) = boost of the field at query-time


 bf is the query time boost for a field and should affect fieldNorm value.

 queryNorm is just a normalization factor so that queries can be compared
 and
 will differ based on query and results

   1. queryNorm is not related to the relevance of the document, but rather
  tries to make scores between different queries comparable. It is
 implemented
  as 1/sqrt(sumOfSquaredWeights)


 You should not be bothered about queryNorm, as for a query it will have the
 same value for all the results.

 Regards,
 Jayendra

 On Tue, Nov 30, 2010 at 9:37 AM, Gastone Penzo gastone.pe...@gmail.com
 wrote:

  Hello,
  someone can explain the difference between queryNorm and FieldNorm in
  debugQuery??
  why if i push one bf boost up, the queryNorm goes down??
  i made some modifies..before the situation was different. why??
  thanx
 
  --
  Gastone Penzo
 




-- 
Gastone Penzo


Re: distributed architecture

2010-12-01 Thread Peter Karich

 Hi,

also take a look at solandra:

https://github.com/tjake/Lucandra/tree/solandra

I don't have it in prod yet but regarding administration overhead it 
looks very promising.
And you'll get some other neat features like (soft) real time, for free. 
So its same like A) +  C) + X) - Y) ;-)


Regards,
Peter.



Hi,
I'd like to know if anybody has suggestions/opinions on what is 
currently the best architecture for a distributed search system using Solr. The 
use case is that of a system composed
of N indexes, each hosted on a separate machine, each index containing unique 
content.

Options that I know of are:

A) Using Solr distributed search
B) Using Solr + Zookeeper integration
C) Using replication, i.e. each node replicates all the others

It seems like options A) and B) would suffer from a fault-tolerance standpoint: 
if any of the nodes goes down, the search won't -at this time- return partial 
results, but instead report an exception.
Option C) would provide fault tolerance, at least for any search initiated at a 
node that is available, but would incur into a large replication overhead.

Did I get any of the above wrong, or does somebody have some insight on what is 
the best system architecture for this use case ?

thanks in advance,
Luca



--
http://jetwick.com twitter search prototype



Re : Spatial Search

2010-12-01 Thread js . vachon
check jteam's spatial search plugin. 
very easy to install


Aisha Zafar aishazafar...@yahoo.com a écrit

 Hi ,
 
 I am a newbie of solr. I found it really interesting specially spetial 
 search. I am very interested to go in its depth but i am facing some problem 
 to use it as i have 1.4.1 version installed on my machine but the spetial 
 search is a feature of 4.0 version which is not released yet. I have also 
 read somewhere that we can use a patch for this purpose. As i am a newbie I 
 dont know how to install the patch and from where to download it. If anyone 
 could help me i'll be very thankful. 
 
 thanks in advance and bye
 
 
 
 


Cet e-mail a été envoyé depuis un Archos 7.


Re: Solr DataImportHandler (DIH) and Cassandra

2010-12-01 Thread David Stuart
This is good timing I am/was just to embark on a spike if anyone is keen to 
help out


On 30 Nov 2010, at 00:37, Mark wrote:

 The DataSource subclass route is what I will probably be interested in. Are 
 there are working examples of this already out there?
 
 On 11/29/10 12:32 PM, Aaron Morton wrote:
 AFAIK there is nothing pre-written to pull the data out for you.
 
 You should be able to create your DataSource sub class 
 http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/DataSource.html
  Using the Hector java library to pull data from Cassandra.
 
 I'm guessing you will need to consider how to perform delta imports. Perhaps 
 using the secondary indexes in 0.7* , or maintaining your own queues or 
 indexes to know what has changed.
 
 There is also the Lucandra project, not exactly what your after but may be 
 of interest anyway https://github.com/tjake/Lucandra
 
 Hope that helps.
 Aaron
 
 
 On 30 Nov, 2010,at 05:04 AM, Mark static.void@gmail.com wrote:
 
 Is there anyway to use DIH to import from Cassandra? Thanks



Re: ArrayIndexOutOfBoundsException in sort

2010-12-01 Thread Ahmet Arslan
 It seems work fine again after I change author field type
 from text to
 string, could anybody give some info about it? very
 appriciated.

http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F

And also see Erick's explanation 
http://search-lucene.com/m/7fnj1TtNde/sort+on+a+tokenized+fieldsubj=Re+Solr+sorting+problem


  


Re: [PECL-DEV] Re: PHP Solr API

2010-12-01 Thread Stefan Matheis
Hi again,

actually trying to implement spellcheck on a different way, and had the idea
to access /solr/spellcheck to get all required data, before executing the
final query to /solr/select - but, that seemed to be impossible - since
there is no configuration option to change the /select part of the url? the
part before can be configure through 'path', but nothing else.

maybe that will be an idea to allow this part of the url to be configured,
in what-ever way?

Regards
Stefan


Re: [PECL-DEV] Re: PHP Solr API

2010-12-01 Thread Stefan Matheis
oooh, sorry - used the wrong thread for my suggestion ... please, just
ignore this :)

On Wed, Dec 1, 2010 at 2:01 PM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 Hi again,

 actually trying to implement spellcheck on a different way, and had the
 idea to access /solr/spellcheck to get all required data, before executing
 the final query to /solr/select - but, that seemed to be impossible - since
 there is no configuration option to change the /select part of the url? the
 part before can be configure through 'path', but nothing else.

 maybe that will be an idea to allow this part of the url to be configured,
 in what-ever way?

 Regards
 Stefan



Re: Failover setup (is this a bad idea)

2010-12-01 Thread robo -
I agree with the Master with multiple slaves setup.  Very easy using
the built-in java setup in 1.4.1.  When we set this up it made our
developers think about how we were writing to Solr.  We were using a
Delta Import Handler (DIH?) for most writes but our app was also
writing 'deletes' directly to Solr.  Since we wanted to load balance
the Slaves we couldn't have the app writing to the Slaves.  Once we
discussed the Master/Slave setup with our developers we found all
areas where we were writing in our app and moved/centralized those
into the DIH. Now the app only does queries against the load balanced
slaves while the Master is used for DIH and backups only.

Thanks,

robo

On Tue, Nov 30, 2010 at 7:58 AM, Jayendra Patil
jayendra.patil@gmail.com wrote:
 Rather have a Master and multiple Slave combination, with master only being
 used for writes and slaves used for reads.
 Master to Slave replication is easily configurable.

 Two Solr instances sharing the same index is not at all good idea with both
 writing to the same index.

 Regards,
 Jayendra

 On Tue, Nov 30, 2010 at 7:13 AM, Keith Pope 
 keith.p...@inflightproductions.com wrote:

 Hi,

 I have a windows cluster that I would like to install Solr onto, there are
 two nodes that provide basic failover. I was thinking of this setup:

 Tomcat installed as win service
 Two solr instances sharing the same index

 The second instance would take over when the first fails, so you should
 never get two writes/reads at once.

 Is this a bad idea? Would I end up corrupting my index?

 Thx

 Keith



 -
 Registered Office: 15 Stukeley Street, London WC2B 5LT, England.
 Registered in England number 1421223

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the email by you is prohibited. Please note that
 the information provided in this e-mail is in any case not legally binding;
 all committing statements require legally binding signatures.


 http://www.inflightproductions.com







Re: schema design for related fields

2010-12-01 Thread Erick Erickson
I'd think that facet.query would work for you, something like:
facet=truefacet.query=FareJanStandard:[price1 TO
price2]facet.query:fareJanStandard[price2 TO price3]
You can string as many facet.query clauses as you want, across as many
fields as you want, they're all
independent and will get their own sections in the response.

Best
Erick

On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.comwrote:

 Hi

 I've built a schema for a proof of concept and it is all working fairly
 fine, niave maybe but fine.
 However I think we might run into trouble in the future if we ever use
 facets.

 The data models train destination city routes from a origin city:
 Doc:City
Name: cityname [uniq key]
CityType: city type values [nine possible values so good for faceting]
... [other city attricbutes which relate directy to the doc unique key]
 all have limited vocab so good for faceting
FareJanStandard:cheapest standard fare in january(float value)
FareJanFirst:cheapest first class fare in january(float value)
FareFebStandard:cheapest standard fare in feb(float value)
FareFebFirst:cheapest first fare in feb(float value)
. etc

 The question is how would i best facet fare price? The desire is to return

 number of citys with jan prices in a set of ranges
 etc
 number of citys with first prices in a set of ranges
 etc

 install is 1.4.1 running in weblogic

 Any ideas ?



 Lee C



Re: Spatial Search

2010-12-01 Thread Erick Erickson
1.4.1 spatial is pretty much superseded by geospatial in the current code,
you can
download a nightly build from here:
https://hudson.apache.org/hudson/

Scroll down to Solr-trunk and pick a nightly build that suits you. Follow
the link through
build artifacts and checkout/solr/dist and you'll find the zip/tar files.

Hudson is reporting some kinda flaky failures, but if you look at the
build results you
can determine whether you care. For instance, the Dec-1 build has a red
ball, but
all the tests pass!

Here's a good place to start with geospatial:
http://wiki.apache.org/solr/SpatialSearch

Best
Erick


On Wed, Dec 1, 2010 at 2:35 AM, Aisha Zafar aishazafar...@yahoo.com wrote:

 Hi ,

 I am a newbie of solr. I found it really interesting specially spetial
 search. I am very interested to go in its depth but i am facing some problem
 to use it as i have 1.4.1 version installed on my machine but the spetial
 search is a feature of 4.0 version which is not released yet. I have also
 read somewhere that we can use a patch for this purpose. As i am a newbie I
 dont know how to install the patch and from where to download it. If anyone
 could help me i'll be very thankful.

 thanks in advance and bye







Re: schema design for related fields

2010-12-01 Thread lee carroll
Hi Erick,
so if i understand you we could do something like:

if Jan is selected in the user interface and we have 10 price ranges

query would be 20 cluases in the query (10 * 2 fare clases)

if first is selected in the user interface and we have 10 price ranges
query would be 120 cluases (12 months * 10 price ranges)

if first and jan selected with 10 price ranges
query would be 10 cluases

if we required facets to be returned for all price combinations we'd need to
supply
240 cluases

the user interface would also need to collate the individual fields into
meaningful aggragates for the user (ie numbers by month, numbers by fare
class)

have I understood or missed the point (i usually have)




On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote:

 I'd think that facet.query would work for you, something like:
 facet=truefacet.query=FareJanStandard:[price1 TO
 price2]facet.query:fareJanStandard[price2 TO price3]
 You can string as many facet.query clauses as you want, across as many
 fields as you want, they're all
 independent and will get their own sections in the response.

 Best
 Erick

 On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.com
 wrote:

  Hi
 
  I've built a schema for a proof of concept and it is all working fairly
  fine, niave maybe but fine.
  However I think we might run into trouble in the future if we ever use
  facets.
 
  The data models train destination city routes from a origin city:
  Doc:City
 Name: cityname [uniq key]
 CityType: city type values [nine possible values so good for faceting]
 ... [other city attricbutes which relate directy to the doc unique
 key]
  all have limited vocab so good for faceting
 FareJanStandard:cheapest standard fare in january(float value)
 FareJanFirst:cheapest first class fare in january(float value)
 FareFebStandard:cheapest standard fare in feb(float value)
 FareFebFirst:cheapest first fare in feb(float value)
 . etc
 
  The question is how would i best facet fare price? The desire is to
 return
 
  number of citys with jan prices in a set of ranges
  etc
  number of citys with first prices in a set of ranges
  etc
 
  install is 1.4.1 running in weblogic
 
  Any ideas ?
 
 
 
  Lee C
 



Re: Best practice for Delta every 2 Minutes.

2010-12-01 Thread Jonathan Rochkind
If your index warmings take longer than two minutes, but you're doing a 
commit every two minutes -- you're going to run into trouble with 
overlapping index preperations, eventually leading to an OOM.  Could 
this be it?


On 11/30/2010 11:36 AM, Erick Erickson wrote:

I don't know, you'll have to debug it to see if it's the thing that takes so
long. Solr
should be able to handle 1,200 updates in a very short time unless there's
something
else going on, like you're committing after every update or something.

This may help you track down performance with DIH

http://wiki.apache.org/solr/DataImportHandler#interactive

http://wiki.apache.org/solr/DataImportHandler#interactiveBest
Erick

On Tue, Nov 30, 2010 at 9:01 AM, stockiist...@shopgate.com  wrote:


how do you think is the deltaQuery better ? XD
--
View this message in context:
http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jonathan Rochkind

On 11/29/2010 5:43 PM, Robert Muir wrote:

On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:

* As a tokenizer, I use the WhitespaceTokenizer.

* Then I apply a custom filter that looks for CJK chars, and re-tokenizes
any CJK chars into one-token-per-char. This custom filter was written by
someone other than me; it is open source; but I'm not sure if it's actually
in a public repo, or how well documented it is.  I can put you in touch with
the author to try and ask. There may also be a more standard filter other
than the custom one I'm using that does the same thing?


You are describing what standardtokenizer does.



Wait, standardtokenizer already handles CJK and will put each CJK char 
into it's own token?  Really? I had no idea!  Is that documented 
anywhere, or you just have to look at the source to see it?


I had assumed that standardtokenizer didn't have any special handling of 
bytes known to be UTF-8 CJK, because that wasn't mentioned in the 
documentation -- but it does?   That would be convenient and not require 
my custom code.


Jonathan



Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all)

On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Wait, standardtokenizer already handles CJK and will put each CJK char into
 it's own token?  Really? I had no idea!  Is that documented anywhere, or you
 just have to look at the source to see it?


Yes, you are right, the documentation should have been more explicit:
in previous releases it doesn't say anything about how it tokenizes
CJK in the documentation. But it does do them this way, and tagged
them as CJ token type.

I think the documentation issue is fixed in branch_3x and trunk:

 * As of Lucene version 3.1, this class implements the Word Break rules from the
 * Unicode Text Segmentation algorithm, as specified in
 * a href=http://unicode.org/reports/tr29/;Unicode Standard Annex #29/a.
(from 
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java)

So you can read the UAX#29 report and then you know how it tokenizes text
You can also just use this demo app to see how the new one works:
http://unicode.org/cldr/utility/breaks.jsp (choose Word)


Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
if first is selected in the user interface and we have 10 price ranges
query would be 120 cluases (12 months * 10 price ranges)

What would you intend to do with the returned facet-results in this
situation? I doubt you want to display 12 categories (1 for each month) ?

When a user hasn't selected a date, perhaps it would be more useful to show
the cheapest fare regardless of month and facet on that?

This would involve introducing 2 new fields:
FareDateDontCareStandard, FareDateDontCareFirst

Populate these fields on indexing time, by calculating the cheapest fares
over all months.

This then results in every query having to support at most 20 price ranges
(10 for normal and 10 for first class)

HTH,
Geert-Jan



2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Hi Erick,
 so if i understand you we could do something like:

 if Jan is selected in the user interface and we have 10 price ranges

 query would be 20 cluases in the query (10 * 2 fare clases)

 if first is selected in the user interface and we have 10 price ranges
 query would be 120 cluases (12 months * 10 price ranges)

 if first and jan selected with 10 price ranges
 query would be 10 cluases

 if we required facets to be returned for all price combinations we'd need
 to
 supply
 240 cluases

 the user interface would also need to collate the individual fields into
 meaningful aggragates for the user (ie numbers by month, numbers by fare
 class)

 have I understood or missed the point (i usually have)




 On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote:

  I'd think that facet.query would work for you, something like:
  facet=truefacet.query=FareJanStandard:[price1 TO
  price2]facet.query:fareJanStandard[price2 TO price3]
  You can string as many facet.query clauses as you want, across as many
  fields as you want, they're all
  independent and will get their own sections in the response.
 
  Best
  Erick
 
  On Wed, Dec 1, 2010 at 4:55 AM, lee carroll 
 lee.a.carr...@googlemail.com
  wrote:
 
   Hi
  
   I've built a schema for a proof of concept and it is all working fairly
   fine, niave maybe but fine.
   However I think we might run into trouble in the future if we ever use
   facets.
  
   The data models train destination city routes from a origin city:
   Doc:City
  Name: cityname [uniq key]
  CityType: city type values [nine possible values so good for
 faceting]
  ... [other city attricbutes which relate directy to the doc unique
  key]
   all have limited vocab so good for faceting
  FareJanStandard:cheapest standard fare in january(float value)
  FareJanFirst:cheapest first class fare in january(float value)
  FareFebStandard:cheapest standard fare in feb(float value)
  FareFebFirst:cheapest first fare in feb(float value)
  . etc
  
   The question is how would i best facet fare price? The desire is to
  return
  
   number of citys with jan prices in a set of ranges
   etc
   number of citys with first prices in a set of ranges
   etc
  
   install is 1.4.1 running in weblogic
  
   Any ideas ?
  
  
  
   Lee C
  
 



RE: how to set maxFieldLength to unlimitd

2010-12-01 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Does anyone know how to index a pdf file with very big size (more than 100MB)?

Thanks so much,
Xiaohui 
-Original Message-
From: Ma, Xiaohui (NIH/NLM/LHC) [C] 
Sent: Tuesday, November 30, 2010 4:22 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: how to set maxFieldLength to unlimitd

I set maxFieldLength to 2147483647, restarted tomcat and re-indexed pdf files 
again. I also commented out the one in the mainIndex section. Unfortunately 
the files are still chopped out if the size of file is more than 20MB.

Any suggestions? I really appreciate your help!
Xiaohui 

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, November 30, 2010 2:01 PM
To: solr-user@lucene.apache.org
Subject: Re: how to set maxFieldLength to unlimitd

Set the maxFieldLength value in solrconfig.xml to, say, 2147483647

Also, see this thread for a common gotcha:
http://lucene.472066.n3.nabble.com/Solr-ignoring-maxFieldLength-td473263.html
,
it appears you can just comment out the one in the mainIndex section.

Best
Erick

On Tue, Nov 30, 2010 at 1:48 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
xiao...@mail.nlm.nih.gov wrote:

 I need index and search some pdf files which are very big (around 1000
 pages each). How can I set maxFieldLength to unlimited?

 Thanks so much for your help in advance,
 Xiaohui



Re: schema design for related fields

2010-12-01 Thread Erick Erickson
Hmmm, that's getting to be a pretty clunky query sure enough. Now you're
going to
have to insure that HTTP request that long get through and stuff like
that

I'm reaching a bit here, but you can facet on a tokenized field. Although
that's not
often done there's no prohibition against it.

So, what if you had just one field for each city that contained some
abstract
information about your fares etc. Something like
janstdfareclass1 jancheapfareclass3 febstdfareclass6

Now just facet on that field? Not #values# in that field, just the field
itself. You'd then have to make those into human-readable text, but that
would considerably simplify your query. Probably only works if your user is
selecting from pre-defined ranges, if they expect to put in arbitrary ranges
this scheme probably wouldn't work...

Best
Erick

On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
lee.a.carr...@googlemail.comwrote:

 Hi Erick,
 so if i understand you we could do something like:

 if Jan is selected in the user interface and we have 10 price ranges

 query would be 20 cluases in the query (10 * 2 fare clases)

 if first is selected in the user interface and we have 10 price ranges
 query would be 120 cluases (12 months * 10 price ranges)

 if first and jan selected with 10 price ranges
 query would be 10 cluases

 if we required facets to be returned for all price combinations we'd need
 to
 supply
 240 cluases

 the user interface would also need to collate the individual fields into
 meaningful aggragates for the user (ie numbers by month, numbers by fare
 class)

 have I understood or missed the point (i usually have)




 On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote:

  I'd think that facet.query would work for you, something like:
  facet=truefacet.query=FareJanStandard:[price1 TO
  price2]facet.query:fareJanStandard[price2 TO price3]
  You can string as many facet.query clauses as you want, across as many
  fields as you want, they're all
  independent and will get their own sections in the response.
 
  Best
  Erick
 
  On Wed, Dec 1, 2010 at 4:55 AM, lee carroll 
 lee.a.carr...@googlemail.com
  wrote:
 
   Hi
  
   I've built a schema for a proof of concept and it is all working fairly
   fine, niave maybe but fine.
   However I think we might run into trouble in the future if we ever use
   facets.
  
   The data models train destination city routes from a origin city:
   Doc:City
  Name: cityname [uniq key]
  CityType: city type values [nine possible values so good for
 faceting]
  ... [other city attricbutes which relate directy to the doc unique
  key]
   all have limited vocab so good for faceting
  FareJanStandard:cheapest standard fare in january(float value)
  FareJanFirst:cheapest first class fare in january(float value)
  FareFebStandard:cheapest standard fare in feb(float value)
  FareFebFirst:cheapest first fare in feb(float value)
  . etc
  
   The question is how would i best facet fare price? The desire is to
  return
  
   number of citys with jan prices in a set of ranges
   etc
   number of citys with first prices in a set of ranges
   etc
  
   install is 1.4.1 running in weblogic
  
   Any ideas ?
  
  
  
   Lee C
  
 



${dataimporter.last_index_time} Format?

2010-12-01 Thread sahid
Hello All,

I have a simple problem;

In my conf/dataimport.properties i have last_index_time with this
format '%Y-%m-%d %H:%M:%S'
for example: last_index_time=2010-12-01 16\:53\:16.

But when i use this propertie in my data-config.conf the value format
began %Y-%m-%d;
for example:
url=http://server/_solr/?last_time=${dataimporter.last_index_time};
make: http://server/_solr/?last_time=2010-12-01

You have an idea for me?

Thank a lot!

-- 
~sahid


RE: how to set maxFieldLength to unlimitd

2010-12-01 Thread jan.kurella
You just can't set it to unlimited. What you could do, is ignoring the 
positions and put a filter in, that sets the token for all but the first token 
to 0 (means the field length will be just 1, all tokens stacked on the first 
position)
You could also break per page, so you put each page on a new position.

Jan

-Original Message-
From: ext Ma, Xiaohui (NIH/NLM/LHC) [C] [mailto:xiao...@mail.nlm.nih.gov]
Sent: Dienstag, 30. November 2010 19:49
To: solr-user@lucene.apache.org; 'solr-user-i...@lucene.apache.org'; 
'solr-user-...@lucene.apache.org'
Subject: how to set maxFieldLength to unlimitd

I need index and search some pdf files which are very big (around 1000 pages 
each). How can I set maxFieldLength to unlimited?

Thanks so much for your help in advance,
Xiaohui


Re: schema design for related fields

2010-12-01 Thread lee carroll
Geert

The UI would be something like:
user selections
for the facet price
max price: £100
fare class: any

city attributes facet
cityattribute1 etc: xxx

results displayed something like

Facet price
Standard fares [10]
First fares [3]
in Jan [9]
in feb [10]
in march [1]
etc
is this compatible with your approach ?

Erick the price is an interval scale ie a fare can be any value (not high,
low, medium etc)

How sensible would the following approach be
index city docs with fields only related to the city unique key
in the same index also index fare docs which would be something like:
Fare:
cityID: xxx
Fareclass:standard
FareMonth: Jan
FarePrice: 100

the query would be something like:
q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
returning facets for FareClass and FareMonth. hold on this will not facet
city docs correctly. sorry thasts not going to work.







On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote:

 Hmmm, that's getting to be a pretty clunky query sure enough. Now you're
 going to
 have to insure that HTTP request that long get through and stuff like
 that

 I'm reaching a bit here, but you can facet on a tokenized field. Although
 that's not
 often done there's no prohibition against it.

 So, what if you had just one field for each city that contained some
 abstract
 information about your fares etc. Something like
 janstdfareclass1 jancheapfareclass3 febstdfareclass6

 Now just facet on that field? Not #values# in that field, just the field
 itself. You'd then have to make those into human-readable text, but that
 would considerably simplify your query. Probably only works if your user is
 selecting from pre-defined ranges, if they expect to put in arbitrary
 ranges
 this scheme probably wouldn't work...

 Best
 Erick

 On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
 lee.a.carr...@googlemail.comwrote:

  Hi Erick,
  so if i understand you we could do something like:
 
  if Jan is selected in the user interface and we have 10 price ranges
 
  query would be 20 cluases in the query (10 * 2 fare clases)
 
  if first is selected in the user interface and we have 10 price ranges
  query would be 120 cluases (12 months * 10 price ranges)
 
  if first and jan selected with 10 price ranges
  query would be 10 cluases
 
  if we required facets to be returned for all price combinations we'd need
  to
  supply
  240 cluases
 
  the user interface would also need to collate the individual fields into
  meaningful aggragates for the user (ie numbers by month, numbers by fare
  class)
 
  have I understood or missed the point (i usually have)
 
 
 
 
  On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com
 wrote:
 
   I'd think that facet.query would work for you, something like:
   facet=truefacet.query=FareJanStandard:[price1 TO
   price2]facet.query:fareJanStandard[price2 TO price3]
   You can string as many facet.query clauses as you want, across as many
   fields as you want, they're all
   independent and will get their own sections in the response.
  
   Best
   Erick
  
   On Wed, Dec 1, 2010 at 4:55 AM, lee carroll 
  lee.a.carr...@googlemail.com
   wrote:
  
Hi
   
I've built a schema for a proof of concept and it is all working
 fairly
fine, niave maybe but fine.
However I think we might run into trouble in the future if we ever
 use
facets.
   
The data models train destination city routes from a origin city:
Doc:City
   Name: cityname [uniq key]
   CityType: city type values [nine possible values so good for
  faceting]
   ... [other city attricbutes which relate directy to the doc unique
   key]
all have limited vocab so good for faceting
   FareJanStandard:cheapest standard fare in january(float value)
   FareJanFirst:cheapest first class fare in january(float value)
   FareFebStandard:cheapest standard fare in feb(float value)
   FareFebFirst:cheapest first fare in feb(float value)
   . etc
   
The question is how would i best facet fare price? The desire is to
   return
   
number of citys with jan prices in a set of ranges
etc
number of citys with first prices in a set of ranges
etc
   
install is 1.4.1 running in weblogic
   
Any ideas ?
   
   
   
Lee C
   
  
 



Solr 3x segments file and deleting index

2010-12-01 Thread Burton-West, Tom
If I want to delete an entire index and start over, in previous versions of 
Solr, you could stop Solr, delete all files in the index directory and restart 
Solr.  Solr would then create empty segments files and you could start 
indexing.   In Solr 3x if I delete all the files in the index  directory I get 
a large stack trace with this error:

org.apache.lucene.index.IndexNotFoundException: no segments* file found

As a workaround, whenever I delete an index (by deleting all files in the index 
directory), I copy the segments files that come with the Solr example to the 
index directory and then restart Solr.

Is this a feature or a bug?   What is the rationale?

Tom

Tom Burton-West



RE: how to set maxFieldLength to unlimitd

2010-12-01 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your replay, Jan. I just found I cannot index pdf files with 
the file size more than 20MB.

I use curl index them, didn't get any error either. Do you have any suggestions 
to index pdf files with more than 20MB?

Thanks,
Xiaohui 

-Original Message-
From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com] 
Sent: Wednesday, December 01, 2010 11:30 AM
To: solr-user@lucene.apache.org; solr-user-i...@lucene.apache.org; 
solr-user-...@lucene.apache.org
Subject: RE: how to set maxFieldLength to unlimitd

You just can't set it to unlimited. What you could do, is ignoring the 
positions and put a filter in, that sets the token for all but the first token 
to 0 (means the field length will be just 1, all tokens stacked on the first 
position)
You could also break per page, so you put each page on a new position.

Jan

-Original Message-
From: ext Ma, Xiaohui (NIH/NLM/LHC) [C] [mailto:xiao...@mail.nlm.nih.gov]
Sent: Dienstag, 30. November 2010 19:49
To: solr-user@lucene.apache.org; 'solr-user-i...@lucene.apache.org'; 
'solr-user-...@lucene.apache.org'
Subject: how to set maxFieldLength to unlimitd

I need index and search some pdf files which are very big (around 1000 pages 
each). How can I set maxFieldLength to unlimited?

Thanks so much for your help in advance,
Xiaohui


Re: schema design for related fields

2010-12-01 Thread lee carroll
Sorry Geert missed of the price value bit from the user interface so we'd
display

Facet price
Standard fares [10]
First fares [3]

When traveling
in Jan [9]
in feb [10]
in march [1]

Fare Price
0 - 25 :  [20]
25 - 50: [10]
50 - 100 [2]

cheers lee c


On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote:

 Geert

 The UI would be something like:
 user selections
 for the facet price
 max price: £100
 fare class: any

 city attributes facet
 cityattribute1 etc: xxx

 results displayed something like

 Facet price
 Standard fares [10]
 First fares [3]
 in Jan [9]
 in feb [10]
 in march [1]
 etc
 is this compatible with your approach ?

 Erick the price is an interval scale ie a fare can be any value (not high,
 low, medium etc)

 How sensible would the following approach be
 index city docs with fields only related to the city unique key
 in the same index also index fare docs which would be something like:
 Fare:
 cityID: xxx
 Fareclass:standard
 FareMonth: Jan
 FarePrice: 100

 the query would be something like:
 q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
 returning facets for FareClass and FareMonth. hold on this will not facet
 city docs correctly. sorry thasts not going to work.








 On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote:

 Hmmm, that's getting to be a pretty clunky query sure enough. Now you're
 going to
 have to insure that HTTP request that long get through and stuff like
 that

 I'm reaching a bit here, but you can facet on a tokenized field. Although
 that's not
 often done there's no prohibition against it.

 So, what if you had just one field for each city that contained some
 abstract
 information about your fares etc. Something like
 janstdfareclass1 jancheapfareclass3 febstdfareclass6

 Now just facet on that field? Not #values# in that field, just the field
 itself. You'd then have to make those into human-readable text, but that
 would considerably simplify your query. Probably only works if your user
 is
 selecting from pre-defined ranges, if they expect to put in arbitrary
 ranges
 this scheme probably wouldn't work...

 Best
 Erick

 On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
 lee.a.carr...@googlemail.comwrote:

  Hi Erick,
  so if i understand you we could do something like:
 
  if Jan is selected in the user interface and we have 10 price ranges
 
  query would be 20 cluases in the query (10 * 2 fare clases)
 
  if first is selected in the user interface and we have 10 price ranges
  query would be 120 cluases (12 months * 10 price ranges)
 
  if first and jan selected with 10 price ranges
  query would be 10 cluases
 
  if we required facets to be returned for all price combinations we'd
 need
  to
  supply
  240 cluases
 
  the user interface would also need to collate the individual fields into
  meaningful aggragates for the user (ie numbers by month, numbers by fare
  class)
 
  have I understood or missed the point (i usually have)
 
 
 
 
  On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com
 wrote:
 
   I'd think that facet.query would work for you, something like:
   facet=truefacet.query=FareJanStandard:[price1 TO
   price2]facet.query:fareJanStandard[price2 TO price3]
   You can string as many facet.query clauses as you want, across as many
   fields as you want, they're all
   independent and will get their own sections in the response.
  
   Best
   Erick
  
   On Wed, Dec 1, 2010 at 4:55 AM, lee carroll 
  lee.a.carr...@googlemail.com
   wrote:
  
Hi
   
I've built a schema for a proof of concept and it is all working
 fairly
fine, niave maybe but fine.
However I think we might run into trouble in the future if we ever
 use
facets.
   
The data models train destination city routes from a origin city:
Doc:City
   Name: cityname [uniq key]
   CityType: city type values [nine possible values so good for
  faceting]
   ... [other city attricbutes which relate directy to the doc
 unique
   key]
all have limited vocab so good for faceting
   FareJanStandard:cheapest standard fare in january(float value)
   FareJanFirst:cheapest first class fare in january(float value)
   FareFebStandard:cheapest standard fare in feb(float value)
   FareFebFirst:cheapest first fare in feb(float value)
   . etc
   
The question is how would i best facet fare price? The desire is to
   return
   
number of citys with jan prices in a set of ranges
etc
number of citys with first prices in a set of ranges
etc
   
install is 1.4.1 running in weblogic
   
Any ideas ?
   
   
   
Lee C
   
  
 





RE: entire farm fails at the same time with OOM issues

2010-12-01 Thread Robert Petersen
It has typically been when query traffic was lowest!  We are at 12 GB heap, so 
I will try to bump it to 14 GB.  We have 64GB main memory installed now.  Here 
is our settings, do these look OK?

export JAVA_OPTS=-Xmx12228m -Xms12228m -XX:+UseConcMarkSweepGC 
-XX:+CMSIncrementalMode



-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, November 30, 2010 6:44 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen rober...@buy.com wrote:
 My question is this.  Why in the world would all of my slaves, after
 running fine for some days, suddenly all at the exact same minute
 experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com


Re: distributed architecture

2010-12-01 Thread Cinquini, Luca (3880)
Hi,
thanks all, this has been very instructive. It looks like in the short 
term using a combination of replication and sharding, based on Upayavira's 
setup, might be the safest thing to do, while in the longer term following the 
zookeeper integration and solandra development might provide a more dynamic 
environment and perhaps easier setup.
Please keep coming the good suggestions if you feel like.
thanks again,
Luca

On Dec 1, 2010, at 4:17 AM, Peter Karich wrote:

  Hi,
 
 also take a look at solandra:
 
 https://github.com/tjake/Lucandra/tree/solandra
 
 I don't have it in prod yet but regarding administration overhead it 
 looks very promising.
 And you'll get some other neat features like (soft) real time, for free. 
 So its same like A) +  C) + X) - Y) ;-)
 
 Regards,
 Peter.
 
 
 Hi,
  I'd like to know if anybody has suggestions/opinions on what is 
 currently the best architecture for a distributed search system using Solr. 
 The use case is that of a system composed
 of N indexes, each hosted on a separate machine, each index containing 
 unique content.
 
 Options that I know of are:
 
 A) Using Solr distributed search
 B) Using Solr + Zookeeper integration
 C) Using replication, i.e. each node replicates all the others
 
 It seems like options A) and B) would suffer from a fault-tolerance 
 standpoint: if any of the nodes goes down, the search won't -at this time- 
 return partial results, but instead report an exception.
 Option C) would provide fault tolerance, at least for any search initiated 
 at a node that is available, but would incur into a large replication 
 overhead.
 
 Did I get any of the above wrong, or does somebody have some insight on what 
 is the best system architecture for this use case ?
 
 thanks in advance,
 Luca
 
 
 -- 
 http://jetwick.com twitter search prototype
 



Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir rcm...@gmail.com wrote:

 (Jonathan, I apologize for emailing you twice, i meant to hit reply-all)

 On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
 
  Wait, standardtokenizer already handles CJK and will put each CJK char
 into
  it's own token?  Really? I had no idea!  Is that documented anywhere, or
 you
  just have to look at the source to see it?
 

 Yes, you are right, the documentation should have been more explicit:
 in previous releases it doesn't say anything about how it tokenizes
 CJK in the documentation. But it does do them this way, and tagged
 them as CJ token type.

 I think the documentation issue is fixed in branch_3x and trunk:

  * As of Lucene version 3.1, this class implements the Word Break rules
 from the
  * Unicode Text Segmentation algorithm, as specified in
  * a href=http://unicode.org/reports/tr29/;Unicode Standard Annex
 #29/a.
 (from
 http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
 )

 So you can read the UAX#29 report and then you know how it tokenizes text
 You can also just use this demo app to see how the new one works:
 http://unicode.org/cldr/utility/breaks.jsp (choose Word)


What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
current stable StandardTokenizer handle CJK?

-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder jel...@locamoda.com wrote:

 What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
 current stable StandardTokenizer handle CJK?


yes


Re: entire farm fails at the same time with OOM issues

2010-12-01 Thread Ken Krugler


On Nov 30, 2010, at 5:16pm, Robert Petersen wrote:


What would I do with the heap dump though?  Run one of those java heap
analyzers looking for memory leaks or something?  I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte  
memory
leak occurring on each commit, but it would take thousands of  
commits to

make that add up to anything right?


Typically when I run out of memory in Solr, it's during an index  
update, when the new index searcher is getting warmed up.


Looking at the heap often shows ways to reduce memory requirements,  
e.g. you'll see a really big chunk used for a sorted field.


See http://wiki.apache.org/solr/SolrCaching and http://wiki.apache.org/solr/SolrPerformanceFactors 
 for more details.


-- Ken




-Original Message-
From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Tuesday, November 30, 2010 3:12 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues

Hi Robert,

I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError
and -XX:HeapDumpPath=path to where you want the file to go, so then
you have something to look at versus a Gedankenexperiment :)

-- Ken

On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:


Greetings, we are running one master and four slaves of our multicore
solr setup.  We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat!
Index
size is about 28GB.

However, twice now recently during a time of low load we have had a
fire
drill where I have seen tomcat/solr fail and become unresponsive  
after

some OOM heap errors.  Solr wouldn't even serve up its admin pages.
I've had to go in and manually knock tomcat out of memory and then
restart it.  These solr slaves are load balanced and the load
balancers
always probe the solr slaves so if they stop serving up searches they
are automatically removed from the load balancer.  When all four
fail at
the same time we have an issue!

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?  The load balancer kicks them
all out at the same time each time.  Each slave only talks to the
master
and not to each other, but the master show no errors in the logs at
all.
Something must be triggering this though.  The only other odd thing I
saw in the logs was after the first OOM errors were recorded, the
slaves
started occasionally not being able to get to the master.

This behavior makes me a little nervous...=:-o  eek!





Environment:  Lucid Imagination distro of Solr 1.4 on Tomcat



Platform: RHEL with Sun JRE 1.6.0_18 on dual quad xeon machines with
64GB memory etc etc










http://ken-blog.krugler.org
+1 530-265-2225






--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Solr 3x segments file and deleting index

2010-12-01 Thread Shawn Heisey

On 12/1/2010 10:12 AM, Burton-West, Tom wrote:

If I want to delete an entire index and start over, in previous versions of 
Solr, you could stop Solr, delete all files in the index directory and restart 
Solr.  Solr would then create empty segments files and you could start 
indexing.   In Solr 3x if I delete all the files in the index  directory I get 
a large stack trace with this error:


You have to delete the index directory entirely.  This looks like a 
change in Lucene, not Solr specificially.  If the directory exists, but 
has nothing in it, it throws an exception.  I'll leave the rationale 
question that you also asked to someone who might actually know.  I 
personally think it shouldn't behave this way, but the dev team may 
encountered something that required that the directory either be a valid 
index or not exist at all.


Shawn



Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Ok longer answer than anticipated (and good conceptual practice ;-)

Yeah I belief that would work if I understand correctly that:

'in Jan [9]
in feb [10]
in march [1]'

has nothing to do with pricing, but only with availability?

If so you could seperate it out as two seperate issues:

1. ) showing pricing (based on context)
2. ) showing availabilities (based on context)

For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc])
note: 'dc' indicates 'don't care.

depending on the context you query the correct pricefield to populate the
price facet-values.
for discussion lets call the fields: _p[fare][date].
IN other words the price field for no preference at all would become: _pdcdc


For 2.) define a multivalued field 'FaresPerDate 'which indicate
availability, which is used to display:

A)
Standard fares [10]
First fares [3]

B)
in Jan [9]
in feb [10]
in march [1]

A) depends on your selection (or dont caring) about a month
B) vice versa depends on your selection (or dont caring)  about a fare type

given all possible date values: [jan,feb,..dec,dontcare]
given all possible fare values:[standard,first,dontcare]

FaresPerDate consists of multiple values per document where each value
indicates the availability of a combination of 'fare' and 'date':
(standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
Note that the nr of possible values = 39.

Example:
1. ) the user hasn't selected any preference:

q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO
20]facet.query=_pdcdc:[20 TO 40], etc.

in the client you have to make sure to select the correct values of
'FaresPerDate' for display:
in this case:

Standard fares [10] -- FaresPerDate.standardDC
First fares [3] -- FaresPerDate.firstDC

in Jan [9] - FaresPerDate.DCJan
in feb [10] - FaresPerDate.DCFeb
in march [1]- FaresPerDate.DCMarch

2) the user has selected January
q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0
TO 20]facet.query=_pDCJan:[20 TO 40]

Standard fares [10] -- FaresPerDate.standardJan
First fares [3] -- FaresPerDate.firstJan

in Jan [9] - FaresPerDate.DCJan
in feb [10] - FaresPerDate.DCFeb
in march [1]- FaresPerDate.DCMarch

Hope that helps,
Geert-Jan


2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Sorry Geert missed of the price value bit from the user interface so we'd
 display

 Facet price
 Standard fares [10]
 First fares [3]

 When traveling
 in Jan [9]
 in feb [10]
 in march [1]

 Fare Price
 0 - 25 :  [20]
 25 - 50: [10]
 50 - 100 [2]

 cheers lee c


 On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com
 wrote:

  Geert
 
  The UI would be something like:
  user selections
  for the facet price
  max price: £100
  fare class: any
 
  city attributes facet
  cityattribute1 etc: xxx
 
  results displayed something like
 
  Facet price
  Standard fares [10]
  First fares [3]
  in Jan [9]
  in feb [10]
  in march [1]
  etc
  is this compatible with your approach ?
 
  Erick the price is an interval scale ie a fare can be any value (not
 high,
  low, medium etc)
 
  How sensible would the following approach be
  index city docs with fields only related to the city unique key
  in the same index also index fare docs which would be something like:
  Fare:
  cityID: xxx
  Fareclass:standard
  FareMonth: Jan
  FarePrice: 100
 
  the query would be something like:
  q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
  returning facets for FareClass and FareMonth. hold on this will not facet
  city docs correctly. sorry thasts not going to work.
 
 
 
 
 
 
 
 
  On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Hmmm, that's getting to be a pretty clunky query sure enough. Now you're
  going to
  have to insure that HTTP request that long get through and stuff like
  that
 
  I'm reaching a bit here, but you can facet on a tokenized field.
 Although
  that's not
  often done there's no prohibition against it.
 
  So, what if you had just one field for each city that contained some
  abstract
  information about your fares etc. Something like
  janstdfareclass1 jancheapfareclass3 febstdfareclass6
 
  Now just facet on that field? Not #values# in that field, just the field
  itself. You'd then have to make those into human-readable text, but that
  would considerably simplify your query. Probably only works if your user
  is
  selecting from pre-defined ranges, if they expect to put in arbitrary
  ranges
  this scheme probably wouldn't work...
 
  Best
  Erick
 
  On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
  lee.a.carr...@googlemail.comwrote:
 
   Hi Erick,
   so if i understand you we could do something like:
  
   if Jan is selected in the user interface and we have 10 price ranges
  
   query would be 20 cluases in the query (10 * 2 fare clases)
  
   if first is selected in the user interface and we have 10 price ranges
   query would be 120 cluases (12 months * 10 price ranges)
  
   if first and jan selected 

Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Also, filtering and sorting on price can be done as well. Just be sure to
use the correct price- field.
Geert-Jan

2010/12/1 Geert-Jan Brits gbr...@gmail.com

 Ok longer answer than anticipated (and good conceptual practice ;-)

 Yeah I belief that would work if I understand correctly that:

 'in Jan [9]
 in feb [10]
 in march [1]'

 has nothing to do with pricing, but only with availability?

 If so you could seperate it out as two seperate issues:

 1. ) showing pricing (based on context)
 2. ) showing availabilities (based on context)

 For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] *
 [standard,first,dc])
 note: 'dc' indicates 'don't care.

 depending on the context you query the correct pricefield to populate the
 price facet-values.
 for discussion lets call the fields: _p[fare][date].
 IN other words the price field for no preference at all would become:
 _pdcdc


 For 2.) define a multivalued field 'FaresPerDate 'which indicate
 availability, which is used to display:

 A)
 Standard fares [10]
 First fares [3]

 B)
 in Jan [9]
 in feb [10]
 in march [1]

 A) depends on your selection (or dont caring) about a month
 B) vice versa depends on your selection (or dont caring)  about a fare type

 given all possible date values: [jan,feb,..dec,dontcare]
 given all possible fare values:[standard,first,dontcare]

 FaresPerDate consists of multiple values per document where each value
 indicates the availability of a combination of 'fare' and 'date':

 (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
 Note that the nr of possible values = 39.

 Example:
 1. ) the user hasn't selected any preference:

 q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO
 20]facet.query=_pdcdc:[20 TO 40], etc.

 in the client you have to make sure to select the correct values of
 'FaresPerDate' for display:
 in this case:

 Standard fares [10] -- FaresPerDate.standardDC
 First fares [3] -- FaresPerDate.firstDC

 in Jan [9] - FaresPerDate.DCJan
 in feb [10] - FaresPerDate.DCFeb
 in march [1]- FaresPerDate.DCMarch

 2) the user has selected January
 q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0
 TO 20]facet.query=_pDCJan:[20 TO 40]

 Standard fares [10] -- FaresPerDate.standardJan
 First fares [3] -- FaresPerDate.firstJan

 in Jan [9] - FaresPerDate.DCJan
 in feb [10] - FaresPerDate.DCFeb
 in march [1]- FaresPerDate.DCMarch

 Hope that helps,
 Geert-Jan


 2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Sorry Geert missed of the price value bit from the user interface so we'd
 display

 Facet price
 Standard fares [10]
 First fares [3]

 When traveling
 in Jan [9]
 in feb [10]
 in march [1]

 Fare Price
 0 - 25 :  [20]
 25 - 50: [10]
 50 - 100 [2]

 cheers lee c


 On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com
 wrote:

  Geert
 
  The UI would be something like:
  user selections
  for the facet price
  max price: £100
  fare class: any
 
  city attributes facet
  cityattribute1 etc: xxx
 
  results displayed something like
 
  Facet price
  Standard fares [10]
  First fares [3]
  in Jan [9]
  in feb [10]
  in march [1]
  etc
  is this compatible with your approach ?
 
  Erick the price is an interval scale ie a fare can be any value (not
 high,
  low, medium etc)
 
  How sensible would the following approach be
  index city docs with fields only related to the city unique key
  in the same index also index fare docs which would be something like:
  Fare:
  cityID: xxx
  Fareclass:standard
  FareMonth: Jan
  FarePrice: 100
 
  the query would be something like:
  q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
  returning facets for FareClass and FareMonth. hold on this will not
 facet
  city docs correctly. sorry thasts not going to work.
 
 
 
 
 
 
 
 
  On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Hmmm, that's getting to be a pretty clunky query sure enough. Now
 you're
  going to
  have to insure that HTTP request that long get through and stuff like
  that
 
  I'm reaching a bit here, but you can facet on a tokenized field.
 Although
  that's not
  often done there's no prohibition against it.
 
  So, what if you had just one field for each city that contained some
  abstract
  information about your fares etc. Something like
  janstdfareclass1 jancheapfareclass3 febstdfareclass6
 
  Now just facet on that field? Not #values# in that field, just the
 field
  itself. You'd then have to make those into human-readable text, but
 that
  would considerably simplify your query. Probably only works if your
 user
  is
  selecting from pre-defined ranges, if they expect to put in arbitrary
  ranges
  this scheme probably wouldn't work...
 
  Best
  Erick
 
  On Wed, Dec 1, 2010 at 10:22 AM, lee carroll
  lee.a.carr...@googlemail.comwrote:
 
   Hi Erick,
   so if i understand you we could do something like:
  
   if Jan is selected in the user interface and we have 10 price ranges
  
   query 

Re: schema design for related fields

2010-12-01 Thread lee carroll
Hi Geert,

Ok I think I follow. the magic is in the multi-valued field.

The only danger would be complexity if we allow users to multi select
months/prices/fare classes. For example they can search for first prices in
jan, april and november. I think what you describe is possible in this case
just complicated. I'll see if i can hack some facets into the proto type
tommorrow. Thanks for your help

Lee C

On 1 December 2010 17:57, Geert-Jan Brits gbr...@gmail.com wrote:

 Ok longer answer than anticipated (and good conceptual practice ;-)

 Yeah I belief that would work if I understand correctly that:

 'in Jan [9]
 in feb [10]
 in march [1]'

 has nothing to do with pricing, but only with availability?

 If so you could seperate it out as two seperate issues:

 1. ) showing pricing (based on context)
 2. ) showing availabilities (based on context)

 For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc])
 note: 'dc' indicates 'don't care.

 depending on the context you query the correct pricefield to populate the
 price facet-values.
 for discussion lets call the fields: _p[fare][date].
 IN other words the price field for no preference at all would become:
 _pdcdc


 For 2.) define a multivalued field 'FaresPerDate 'which indicate
 availability, which is used to display:

 A)
 Standard fares [10]
 First fares [3]

 B)
 in Jan [9]
 in feb [10]
 in march [1]

 A) depends on your selection (or dont caring) about a month
 B) vice versa depends on your selection (or dont caring)  about a fare type

 given all possible date values: [jan,feb,..dec,dontcare]
 given all possible fare values:[standard,first,dontcare]

 FaresPerDate consists of multiple values per document where each value
 indicates the availability of a combination of 'fare' and 'date':

 (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
 Note that the nr of possible values = 39.

 Example:
 1. ) the user hasn't selected any preference:

 q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO
 20]facet.query=_pdcdc:[20 TO 40], etc.

 in the client you have to make sure to select the correct values of
 'FaresPerDate' for display:
 in this case:

 Standard fares [10] -- FaresPerDate.standardDC
 First fares [3] -- FaresPerDate.firstDC

 in Jan [9] - FaresPerDate.DCJan
 in feb [10] - FaresPerDate.DCFeb
 in march [1]- FaresPerDate.DCMarch

 2) the user has selected January
 q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0
 TO 20]facet.query=_pDCJan:[20 TO 40]

 Standard fares [10] -- FaresPerDate.standardJan
 First fares [3] -- FaresPerDate.firstJan

 in Jan [9] - FaresPerDate.DCJan
 in feb [10] - FaresPerDate.DCFeb
 in march [1]- FaresPerDate.DCMarch

 Hope that helps,
 Geert-Jan


 2010/12/1 lee carroll lee.a.carr...@googlemail.com

  Sorry Geert missed of the price value bit from the user interface so we'd
  display
 
  Facet price
  Standard fares [10]
  First fares [3]
 
  When traveling
  in Jan [9]
  in feb [10]
  in march [1]
 
  Fare Price
  0 - 25 :  [20]
  25 - 50: [10]
  50 - 100 [2]
 
  cheers lee c
 
 
  On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com
  wrote:
 
   Geert
  
   The UI would be something like:
   user selections
   for the facet price
   max price: £100
   fare class: any
  
   city attributes facet
   cityattribute1 etc: xxx
  
   results displayed something like
  
   Facet price
   Standard fares [10]
   First fares [3]
   in Jan [9]
   in feb [10]
   in march [1]
   etc
   is this compatible with your approach ?
  
   Erick the price is an interval scale ie a fare can be any value (not
  high,
   low, medium etc)
  
   How sensible would the following approach be
   index city docs with fields only related to the city unique key
   in the same index also index fare docs which would be something like:
   Fare:
   cityID: xxx
   Fareclass:standard
   FareMonth: Jan
   FarePrice: 100
  
   the query would be something like:
   q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
   returning facets for FareClass and FareMonth. hold on this will not
 facet
   city docs correctly. sorry thasts not going to work.
  
  
  
  
  
  
  
  
   On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com
  wrote:
  
   Hmmm, that's getting to be a pretty clunky query sure enough. Now
 you're
   going to
   have to insure that HTTP request that long get through and stuff like
   that
  
   I'm reaching a bit here, but you can facet on a tokenized field.
  Although
   that's not
   often done there's no prohibition against it.
  
   So, what if you had just one field for each city that contained some
   abstract
   information about your fares etc. Something like
   janstdfareclass1 jancheapfareclass3 febstdfareclass6
  
   Now just facet on that field? Not #values# in that field, just the
 field
   itself. You'd then have to make those into human-readable text, but
 that
   would considerably simplify your query. Probably only 

ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
We are using a recent Solr 3.x (See below for exact version).

We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
mainIndex sections of our solrconfig.xml:

ramBufferSizeMB320/ramBufferSizeMB
mergeFactor20/mergeFactor

We expected that this would mean that the index would not write to disk until 
it reached somewhere approximately over 300MB in size.
However, we see many small segments that look to be around 80MB in size.

We have not yet issued a single commit so nothing else should force a write to 
disk.

With a merge factor of 20 we also expected to see larger segments somewhere 
around 320 * 20 = 6GB in size, however we see several around 1GB.

We understand that the sizes are approximate, but these seem nowhere near what 
we expected.

Can anyone explain what is going on?

BTW
maxBufferedDocs is commented out, so this should not be affecting the buffer 
flushes
!--maxBufferedDocs1000/maxBufferedDocs--


Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification 
Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 
2010-11-19 16:01:10

Tom Burton-West



Re: how to set maxFieldLength to unlimitd

2010-12-01 Thread jan.kurella
I don't know about upload limitations, but for sure there are some in  
the default settings, this could explain the limit of 20MB. Which  
upload mechanism on solr side do you use? I guess this is not a lucene  
problem but rather the http-layer of solr.

If you manage to stream your PDF and start parsing it on the stream  
you then should go for the filter, that sets the positionIncrement to  
0 as mentioned.

What we did once for PDF files, we parsed them befor into plain text  
and where indexing this (but we were using lucene directly) with a  
streamReader.


Grüße, Jan

Am 01.12.2010 um 18:13 schrieb ext Ma, Xiaohui (NIH/NLM/LHC) [C] 
xiao...@mail.nlm.nih.gov 
 :

 Thanks so much for your replay, Jan. I just found I cannot index pdf  
 files with the file size more than 20MB.

 I use curl index them, didn't get any error either. Do you have any  
 suggestions to index pdf files with more than 20MB?

 Thanks,
 Xiaohui

 -Original Message-
 From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com]
 Sent: Wednesday, December 01, 2010 11:30 AM
 To: solr-user@lucene.apache.org; solr-user-i...@lucene.apache.org; 
 solr-user-...@lucene.apache.org
 Subject: RE: how to set maxFieldLength to unlimitd

 You just can't set it to unlimited. What you could do, is ignoring  
 the positions and put a filter in, that sets the token for all but  
 the first token to 0 (means the field length will be just 1, all  
 tokens stacked on the first position)
 You could also break per page, so you put each page on a new  
 position.

 Jan

 -Original Message-
 From: ext Ma, Xiaohui (NIH/NLM/LHC) [C]  
 [mailto:xiao...@mail.nlm.nih.gov]
 Sent: Dienstag, 30. November 2010 19:49
 To: solr-user@lucene.apache.org; 'solr-user- 
 i...@lucene.apache.org'; 'solr-user-...@lucene.apache.org'
 Subject: how to set maxFieldLength to unlimitd

 I need index and search some pdf files which are very big (around  
 1000 pages each). How can I set maxFieldLength to unlimited?

 Thanks so much for your help in advance,
 Xiaohui


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Michael McCandless
The ram efficiency (= size of segment once flushed divided by size of
RAM buffer) can vary drastically.

Because the in-RAM data structures must be growable (to append new
docs to the postings as they are encountered), the efficiency is never
100%.  I think 50% is actually a good ram efficiency, and lower than
that (even down to 27%) I think is still normal.

Do you have many unique or low-doc-freq terms?  That brings the efficiency down.

If you turn on IndexWriter's infoStream and post the output we can see
if anything odd is going on...

80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments.
Do you do any deletions in this run?  A merged segment size will often
be less than the sum of the parts, especially if there are many terms
but across segments these terms are shared but the infoStream will
also show what merges are taking place.

Mike

On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom tburt...@umich.edu wrote:
 We are using a recent Solr 3.x (See below for exact version).

 We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
 mainIndex sections of our solrconfig.xml:

 ramBufferSizeMB320/ramBufferSizeMB
 mergeFactor20/mergeFactor

 We expected that this would mean that the index would not write to disk until 
 it reached somewhere approximately over 300MB in size.
 However, we see many small segments that look to be around 80MB in size.

 We have not yet issued a single commit so nothing else should force a write 
 to disk.

 With a merge factor of 20 we also expected to see larger segments somewhere 
 around 320 * 20 = 6GB in size, however we see several around 1GB.

 We understand that the sizes are approximate, but these seem nowhere near 
 what we expected.

 Can anyone explain what is going on?

 BTW
 maxBufferedDocs is commented out, so this should not be affecting the buffer 
 flushes
 !--maxBufferedDocs1000/maxBufferedDocs--


 Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
 Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene 
 Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 
 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

 Tom Burton-West




Solr highlighting is double-quotes-aware?

2010-12-01 Thread Scott Gonyea
Not sure how to write that subject line.  I'm getting some weird behavior out 
of the highlighter in Solr.  It seems like an edge case, but I'm curious to 
hear if this is known about, or if it's something worth looking into further.

Background:

I'm using Solr's highlighting facility to tag words, found in content crawled 
via Nutch. I split up the content based on those tags, which is later fed into 
a moderation process.

Sample Data (snippet from larger content):
[url=\http://www.sampleurl.com/baffle_prices.html\]baffle[/url]

(My hl.simple.pre is set to TEST_KEYWORD_START and my hl.simple.post is 
set to TEST_KEYWORD_END)

Query for baffle, and solr highlights it thus:

TEST_KEYWORD_STARTbaffle_prices.html\]baffleTEST_KEYWORD_END

What should be happening, is this:

TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END_prices.html\]TEST_KEYWORD_STARTbaffleTEST_KEYWORD_END


Is there something about this data that makes the highlighter not want to split 
it up? Do I have to have Solr tokenize the words by some character that I 
somehow excluded?

Thank you,
Scott Gonyea

Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Shawn Heisey

On 12/1/2010 12:13 PM, Burton-West, Tom wrote:

We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
mainIndex sections of our solrconfig.xml:

ramBufferSizeMB320/ramBufferSizeMB
mergeFactor20/mergeFactor

We expected that this would mean that the index would not write to disk until 
it reached somewhere approximately over 300MB in size.
However, we see many small segments that look to be around 80MB in size.

We have not yet issued a single commit so nothing else should force a write to 
disk.

With a merge factor of 20 we also expected to see larger segments somewhere 
around 320 * 20 = 6GB in size, however we see several around 1GB.

We understand that the sizes are approximate, but these seem nowhere near what 
we expected.


I have seen this.  In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do 
not segment, but all the other files do.  I can't remember whether it 
behaves the same under 3.1, or whether it also creates these files in 
each segment.


Here's the first segment created during a test reindex I just started, 
excluding the previously mentioned files, which will be prefixed by _57 
until I choose to optimize the index:


-rw-r--r-- 1 ncindex ncindex315 Dec  1 12:40 _58.fnm
-rw-r--r-- 1 ncindex ncindex   26000115 Dec  1 12:40 _58.frq
-rw-r--r-- 1 ncindex ncindex 399124 Dec  1 12:40 _58.nrm
-rw-r--r-- 1 ncindex ncindex   23879227 Dec  1 12:40 _58.prx
-rw-r--r-- 1 ncindex ncindex 205874 Dec  1 12:40 _58.tii
-rw-r--r-- 1 ncindex ncindex   16000953 Dec  1 12:40 _58.tis

My ramBufferSize is 256MB, and those files add up to about 66MB.  My 
guess is that it takes  256MB of RAM to represent what condenses down to 
66MB on the disk.


When it had accumulated 16 segments, it merged them down to this, all 
the while continuing to index.  This is about 870MB:


-rw-r--r-- 1 ncindex ncindex338 Dec  1 12:56 _5n.fnm
-rw-r--r-- 1 ncindex ncindex  376423659 Dec  1 12:58 _5n.frq
-rw-r--r-- 1 ncindex ncindex5726860 Dec  1 12:58 _5n.nrm
-rw-r--r-- 1 ncindex ncindex  331890058 Dec  1 12:58 _5n.prx
-rw-r--r-- 1 ncindex ncindex2037072 Dec  1 12:58 _5n.tii
-rw-r--r-- 1 ncindex ncindex  154470775 Dec  1 12:58 _5n.tis

If this merge were to happen 16 more times (256 segments created), it 
would then do a super-merge down to one very large segment.  In your 
case, with a mergeFactor of 20, that would take 400 segments.  I only 
ever saw this happen once - when I built a single index with all 49 
million documents in it.


Shawn



Re: schema design for related fields

2010-12-01 Thread Geert-Jan Brits
Indeed, selecting the best price for January OR April OR November and
sorting on it isn't possible with this solution (if that's what you mean).
However, any combination of selecting 1 month and/or 1 price-range and/or 1
fare-type IS possible.

2010/12/1 lee carroll lee.a.carr...@googlemail.com

 Hi Geert,

 Ok I think I follow. the magic is in the multi-valued field.

 The only danger would be complexity if we allow users to multi select
 months/prices/fare classes. For example they can search for first prices in
 jan, april and november. I think what you describe is possible in this case
 just complicated. I'll see if i can hack some facets into the proto type
 tommorrow. Thanks for your help

 Lee C

 On 1 December 2010 17:57, Geert-Jan Brits gbr...@gmail.com wrote:

  Ok longer answer than anticipated (and good conceptual practice ;-)
 
  Yeah I belief that would work if I understand correctly that:
 
  'in Jan [9]
  in feb [10]
  in march [1]'
 
  has nothing to do with pricing, but only with availability?
 
  If so you could seperate it out as two seperate issues:
 
  1. ) showing pricing (based on context)
  2. ) showing availabilities (based on context)
 
  For 1.)  you get 39 pricefields ([jan,feb,..,dec,dc] *
 [standard,first,dc])
  note: 'dc' indicates 'don't care.
 
  depending on the context you query the correct pricefield to populate the
  price facet-values.
  for discussion lets call the fields: _p[fare][date].
  IN other words the price field for no preference at all would become:
  _pdcdc
 
 
  For 2.) define a multivalued field 'FaresPerDate 'which indicate
  availability, which is used to display:
 
  A)
  Standard fares [10]
  First fares [3]
 
  B)
  in Jan [9]
  in feb [10]
  in march [1]
 
  A) depends on your selection (or dont caring) about a month
  B) vice versa depends on your selection (or dont caring)  about a fare
 type
 
  given all possible date values: [jan,feb,..dec,dontcare]
  given all possible fare values:[standard,first,dontcare]
 
  FaresPerDate consists of multiple values per document where each value
  indicates the availability of a combination of 'fare' and 'date':
 
 
 (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC)
  Note that the nr of possible values = 39.
 
  Example:
  1. ) the user hasn't selected any preference:
 
  q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO
  20]facet.query=_pdcdc:[20 TO 40], etc.
 
  in the client you have to make sure to select the correct values of
  'FaresPerDate' for display:
  in this case:
 
  Standard fares [10] -- FaresPerDate.standardDC
  First fares [3] -- FaresPerDate.firstDC
 
  in Jan [9] - FaresPerDate.DCJan
  in feb [10] - FaresPerDate.DCFeb
  in march [1]- FaresPerDate.DCMarch
 
  2) the user has selected January
 
 q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0
  TO 20]facet.query=_pDCJan:[20 TO 40]
 
  Standard fares [10] -- FaresPerDate.standardJan
  First fares [3] -- FaresPerDate.firstJan
 
  in Jan [9] - FaresPerDate.DCJan
  in feb [10] - FaresPerDate.DCFeb
  in march [1]- FaresPerDate.DCMarch
 
  Hope that helps,
  Geert-Jan
 
 
  2010/12/1 lee carroll lee.a.carr...@googlemail.com
 
   Sorry Geert missed of the price value bit from the user interface so
 we'd
   display
  
   Facet price
   Standard fares [10]
   First fares [3]
  
   When traveling
   in Jan [9]
   in feb [10]
   in march [1]
  
   Fare Price
   0 - 25 :  [20]
   25 - 50: [10]
   50 - 100 [2]
  
   cheers lee c
  
  
   On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com
   wrote:
  
Geert
   
The UI would be something like:
user selections
for the facet price
max price: £100
fare class: any
   
city attributes facet
cityattribute1 etc: xxx
   
results displayed something like
   
Facet price
Standard fares [10]
First fares [3]
in Jan [9]
in feb [10]
in march [1]
etc
is this compatible with your approach ?
   
Erick the price is an interval scale ie a fare can be any value (not
   high,
low, medium etc)
   
How sensible would the following approach be
index city docs with fields only related to the city unique key
in the same index also index fare docs which would be something like:
Fare:
cityID: xxx
Fareclass:standard
FareMonth: Jan
FarePrice: 100
   
the query would be something like:
q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID
returning facets for FareClass and FareMonth. hold on this will not
  facet
city docs correctly. sorry thasts not going to work.
   
   
   
   
   
   
   
   
On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com
   wrote:
   
Hmmm, that's getting to be a pretty clunky query sure enough. Now
  you're
going to
have to insure that HTTP request that long get through and stuff
 like
that
   
I'm reaching a bit here, but you can facet on a tokenized field.
   Although
 

RE: how to set maxFieldLength to unlimitd

2010-12-01 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much, Jan. I use curl to index pdf files. Is there other way to do it?

I changed it the positionIncrement to 0, I didn't get it work either.

Thanks,
Xiaohui 

-Original Message-
From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com] 
Sent: Wednesday, December 01, 2010 2:34 PM
To: solr-user@lucene.apache.org
Subject: Re: how to set maxFieldLength to unlimitd

I don't know about upload limitations, but for sure there are some in  
the default settings, this could explain the limit of 20MB. Which  
upload mechanism on solr side do you use? I guess this is not a lucene  
problem but rather the http-layer of solr.

If you manage to stream your PDF and start parsing it on the stream  
you then should go for the filter, that sets the positionIncrement to  
0 as mentioned.

What we did once for PDF files, we parsed them befor into plain text  
and where indexing this (but we were using lucene directly) with a  
streamReader.


Grüße, Jan

Am 01.12.2010 um 18:13 schrieb ext Ma, Xiaohui (NIH/NLM/LHC) [C] 
xiao...@mail.nlm.nih.gov 
 :

 Thanks so much for your replay, Jan. I just found I cannot index pdf  
 files with the file size more than 20MB.

 I use curl index them, didn't get any error either. Do you have any  
 suggestions to index pdf files with more than 20MB?

 Thanks,
 Xiaohui

 -Original Message-
 From: jan.kure...@nokia.com [mailto:jan.kure...@nokia.com]
 Sent: Wednesday, December 01, 2010 11:30 AM
 To: solr-user@lucene.apache.org; solr-user-i...@lucene.apache.org; 
 solr-user-...@lucene.apache.org
 Subject: RE: how to set maxFieldLength to unlimitd

 You just can't set it to unlimited. What you could do, is ignoring  
 the positions and put a filter in, that sets the token for all but  
 the first token to 0 (means the field length will be just 1, all  
 tokens stacked on the first position)
 You could also break per page, so you put each page on a new  
 position.

 Jan

 -Original Message-
 From: ext Ma, Xiaohui (NIH/NLM/LHC) [C]  
 [mailto:xiao...@mail.nlm.nih.gov]
 Sent: Dienstag, 30. November 2010 19:49
 To: solr-user@lucene.apache.org; 'solr-user- 
 i...@lucene.apache.org'; 'solr-user-...@lucene.apache.org'
 Subject: how to set maxFieldLength to unlimitd

 I need index and search some pdf files which are very big (around  
 1000 pages each). How can I set maxFieldLength to unlimited?

 Thanks so much for your help in advance,
 Xiaohui


RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
Thanks Mike,

Yes we have many unique terms due to dirty OCR and 400 languages and probably 
lots of low doc freq terms as well (although with the ICUTokenizer and 
ICUFoldingFilter we should get fewer terms due to bad tokenization and 
normalization.)

Is this additional overhead because each unique term takes a certain amount of 
space compared to adding entries to a list for an existing term?

Does turning on IndexWriters infostream have a significant impact on memory use 
or indexing speed?  

If it does, I'll reproduce this on our test server rather than turning it on 
for a bit on the production indexer.  If it doesn't I'll turn it on and post 
here.

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 2:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

The ram efficiency (= size of segment once flushed divided by size of
RAM buffer) can vary drastically.

Because the in-RAM data structures must be growable (to append new
docs to the postings as they are encountered), the efficiency is never
100%.  I think 50% is actually a good ram efficiency, and lower than
that (even down to 27%) I think is still normal.

Do you have many unique or low-doc-freq terms?  That brings the efficiency down.

If you turn on IndexWriter's infoStream and post the output we can see
if anything odd is going on...

80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments.
Do you do any deletions in this run?  A merged segment size will often
be less than the sum of the parts, especially if there are many terms
but across segments these terms are shared but the infoStream will
also show what merges are taking place.

Mike

On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom tburt...@umich.edu wrote:
 We are using a recent Solr 3.x (See below for exact version).

 We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
 mainIndex sections of our solrconfig.xml:

 ramBufferSizeMB320/ramBufferSizeMB
 mergeFactor20/mergeFactor

 We expected that this would mean that the index would not write to disk until 
 it reached somewhere approximately over 300MB in size.
 However, we see many small segments that look to be around 80MB in size.

 We have not yet issued a single commit so nothing else should force a write 
 to disk.

 With a merge factor of 20 we also expected to see larger segments somewhere 
 around 320 * 20 = 6GB in size, however we see several around 1GB.

 We understand that the sizes are approximate, but these seem nowhere near 
 what we expected.

 Can anyone explain what is going on?

 BTW
 maxBufferedDocs is commented out, so this should not be affecting the buffer 
 flushes
 !--maxBufferedDocs1000/maxBufferedDocs--


 Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
 Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene 
 Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 
 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10

 Tom Burton-West




Re: entire farm fails at the same time with OOM issues

2010-12-01 Thread Peter Karich

 also try to minimize maxWarming searchers to 1(?) or 2.
And decrease cache usage (especially autowarming) if possible at all. 
But again: only if it doesn't affect performance ...


Regards,
Peter.


On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersenrober...@buy.com  wrote:

My question is this.  Why in the world would all of my slaves, after
running fine for some days, suddenly all at the exact same minute
experience OOM heap errors and go dead?

If there is no change in query traffic when this happens, then it's
due to what the index looks like.

My guess is a large index merge happened, which means that when the
searchers re-open on the new index, it requires more memory than
normal (much less can be shared with the previous index).

I'd try bumping the heap a little bit, and then optimizing once a day
during off-peak hours.
If you still get OOM errors, bump the heap a little more.

-Yonik
http://www.lucidimagination.com




Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Michael McCandless
On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Mike,

 Yes we have many unique terms due to dirty OCR and 400 languages and probably 
 lots of low doc freq terms as well (although with the ICUTokenizer and 
 ICUFoldingFilter we should get fewer terms due to bad tokenization and 
 normalization.)

OK likely this explains the lowish RAM efficiency.

 Is this additional overhead because each unique term takes a certain amount 
 of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish startup cost for each term but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

 Does turning on IndexWriters infostream have a significant impact on memory 
 use or indexing speed?

I don't believe so

Mike


RE: entire farm fails at the same time with OOM issues

2010-12-01 Thread Robert Petersen
Good idea.  Our farm is behind Akamai so that should be ok to do.

-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Wednesday, December 01, 2010 12:21 PM
To: solr-user@lucene.apache.org
Subject: Re: entire farm fails at the same time with OOM issues


  also try to minimize maxWarming searchers to 1(?) or 2.
And decrease cache usage (especially autowarming) if possible at all. 
But again: only if it doesn't affect performance ...

Regards,
Peter.

 On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersenrober...@buy.com
wrote:
 My question is this.  Why in the world would all of my slaves, after
 running fine for some days, suddenly all at the exact same minute
 experience OOM heap errors and go dead?
 If there is no change in query traffic when this happens, then it's
 due to what the index looks like.

 My guess is a large index merge happened, which means that when the
 searchers re-open on the new index, it requires more memory than
 normal (much less can be shared with the previous index).

 I'd try bumping the heap a little bit, and then optimizing once a day
 during off-peak hours.
 If you still get OOM errors, bump the heap a little more.

 -Yonik
 http://www.lucidimagination.com



Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir rcm...@gmail.com wrote:

 On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder jel...@locamoda.com wrote:
  Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
  past, we were using a patched version of StandardTokenizer which treated
  @twitteruser and #hashtag better, but this became a release engineering
  nightmare so we switched to Whitespace.

 in this case, have you considered using a CharFilter (e.g.
 MappingCharFilter) before the tokenizer?

 This way you could map your special things such as @ and # to some
 other string that the tokenizer doesnt split on,
 e.g. # = HASH_.

 then your #foobar goes to HASH_foobar.
 If you want searches of #foobar to only match #foobar and not also
 foobar itself, and vice versa, you are done.
 Maybe you want searches of #foobar to only match #foobar, but searches
 of foobar to match both #foobar and foobar.
 In this case, you would probably use a worddelimiterfilter w/
 preserveOriginal at index-time only , followed by a StopFilter
 containing HASH, so you index HASH_foobar and foobar.

 anyway i think you have a lot of flexibility to reuse
 standardtokenizer but customize things like this without maintaining
 your own tokenizer, this is the purpose of CharFilters.


That worked brilliantly. Thank you very much, Robert.

-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Return Lucene DocId in Solr Results

2010-12-01 Thread Sasank Mudunuri
Take this with a sizeable grain of salt as I haven't actually tried doing
this. But you might try using an IndexReader which it looks like you can get
from this class:

http://lucene.apache.org/solr/api/org/apache/solr/core/StandardIndexReaderFactory.html

sasank

On Tue, Nov 30, 2010 at 6:45 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 Hmm, I found some similar queries on stackoverflow and they did not
 recommend exposing the lucene docId.

 So, I guess my question becomes: What is the best way, from within my
 custom QParser, to take a list of solr primary keys (that were retrieved
 from elsewhere) and turn them into docIds? I also saw something about
 cacheing them using a Field Cache - how would I do that?

 Thanks,
 Steve

 -Original Message-
 From: Lohrenz, Steven [mailto:steven.lohr...@hmhpub.com]
 Sent: 30 November 2010 11:57
 To: solr-user@lucene.apache.org
 Subject: Return Lucene DocId in Solr Results

 Hi,

 I was wondering how I would go about getting the lucene docid included in
 the results from a solr query?

 I've built a QueryParser to query another solr instance and and join the
 results of the two instances through the use of a Filter.  The Filter needs
 the lucene docid to work. This is the only bit I'm missing right now.

 Thanks,
 Steve




Re: Return Lucene DocId in Solr Results

2010-12-01 Thread Erick Erickson
On the face of it, this doesn't make sense, so perhaps you can explain a
bit.The doc IDs
from one Solr instance have no relation to the doc IDs from another Solr
instance. So anything
that uses doc IDs from one Solr instance to create a filter on another
instance doesn't seem
to be something you'd want to do...

Which may just mean I don't understand what you're trying to do. Can you
back up a bit
and describe the higher-level problem? This seems like it may be an XY
problem, see:
http://people.apache.org/~hossman/#xyproblem

Best
Erick

On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 Hi,

 I was wondering how I would go about getting the lucene docid included in
 the results from a solr query?

 I've built a QueryParser to query another solr instance and and join the
 results of the two instances through the use of a Filter.  The Filter needs
 the lucene docid to work. This is the only bit I'm missing right now.

 Thanks,
 Steve




Re: ArrayIndexOutOfBoundsException in sort

2010-12-01 Thread Jerry Li
Got it with thanks.

On Wed, Dec 1, 2010 at 8:02 PM, Ahmet Arslan iori...@yahoo.com wrote:

  It seems work fine again after I change author field type
  from text to
  string, could anybody give some info about it? very
  appriciated.


 http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F

 And also see Erick's explanation

 http://search-lucene.com/m/7fnj1TtNde/sort+on+a+tokenized+fieldsubj=Re+Solr+sorting+problem






-- 

Best Regards.
Jerry. Li



Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-01 Thread Jean-Sebastien Vachon

Try this...

http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft

- Original Message - 
From: Dennis Gearon gear...@sbcglobal.net

To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException



I am trying to get spatial search to work on my Solr installation. I am 
running

version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}


The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered  RANGEEX_GOOP lng=-121.892639  at line 1,
column 38. Was expecting: }

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude 
field in

my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better
idea to learn from others’ mistakes, so you do not have to make them 
yourself.

from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: Preventing index segment corruption when windows crashes

2010-12-01 Thread Lance Norskog
Is there any way that Windows 7 and disk drivers are not honoring the
fsync() calls? That would cause files and/or blocks to get saved out
of order.

On Tue, Nov 30, 2010 at 3:24 PM, Peter Sturge peter.stu...@gmail.com wrote:
 After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
 LockObtainFailedException errors: (excerpt)

   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
 obtain timed out:
 nativefsl...@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock


 When I run CheckIndex, I get: (excerpt)

  30 of 30: name=_2fi docCount=857
    compound=false
    hasProx=true
    numFiles=8
    size (MB)=0.769
    diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev 
 ${svnver
 sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, 
 java.version=1.6.0_18,
 java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.FAILED
    WARNING: fixIndex() would remove reference to this segment; full exception:
 org.apache.lucene.index.CorruptIndexException: did not read all bytes from 
 file
 _2fi.fnm: read 1 vs size 512
        at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
        at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
        at 
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReade
 r.java:119)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)

 WARNING: 1 broken segments (containing 857 documents) detected


 This seems to happen every time Windows 7 crashes, and it would seem
 extraordinary bad luck for this tiny test index to be in the middle of
 a commit every time.
 (it is set to commit every 40secs, but for such a small index it only
 takes millis to complete)

 Does this seem right? I don't remember seeing so many corruptions in
 the index - maybe it is the world of Win7 dodgy drivers, but it would
 be worth investigating if there's something amiss in Solr/Lucene when
 things go down unexpectedly...

 Thanks,
 Peter


 On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge peter.stu...@gmail.com wrote:
 The index itself isn't corrupt - just one of the segment files. This
 means you can read the index (less the offending segment(s)), but once
 this happens it's no longer possible to
 access the documents that were in that segment (they're gone forever),
 nor write/commit to the index (depending on the env/request, you get
 'Error reading from index file..' and/or WriteLockError)
 (note that for my use case, documents are dynamically created so can't
 be re-indexed).

 Restarting Solr fixes the write lock errors (an indirect environmental
 symptom of the problem), and running CheckIndex -fix is the only way
 I've found to repair the index so it can be written to (rewrites the
 corrupted segment(s)).

 I guess I was wondering if there's a mechanism that would support
 something akin to a transactional rollback for segments.

 Thanks,
 Peter



 On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com 
 wrote:
 If a Solr index is running at the time of a system halt, this can
 often corrupt a segments file, requiring the index to be -fix'ed by
 rewriting the offending file.

 Really?  That shouldn't be possible (if you mean the index is truly
 corrupt - i.e. you can't open it).

 -Yonik
 http://www.lucidimagination.com






-- 
Lance Norskog
goks...@gmail.com


Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-01 Thread Dennis Gearon
Thanks Jean-Sebastion. I forwarded it to my partner. His membership is still 
being held up.

I'll be the go between until he has access.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jean-Sebastien Vachon js.vac...@videotron.ca
To: solr-user@lucene.apache.org
Sent: Wed, December 1, 2010 7:12:20 PM
Subject: Re: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException

Try this...

http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft


- Original Message - From: Dennis Gearon gear...@sbcglobal.net
To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException


I am trying to get spatial search to work on my Solr installation. I am running
version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}



The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered  RANGEEX_GOOP lng=-121.892639  at line 1,
column 38. Was expecting: }

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude field in
my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.


Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-01 Thread Jean-Sebastien Vachon
I just saw the parameter 'lng' in your query... I believe it should be 
'long'. Give it a try if the link I sent you is not working


- Original Message - 
From: Dennis Gearon gear...@sbcglobal.net

To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 11:39 PM
Subject: Re: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException



Thanks Jean-Sebastion. I forwarded it to my partner. His membership is still
being held up.

I'll be the go between until he has access.

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better
idea to learn from others’ mistakes, so you do not have to make them 
yourself.

from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jean-Sebastien Vachon js.vac...@videotron.ca
To: solr-user@lucene.apache.org
Sent: Wed, December 1, 2010 7:12:20 PM
Subject: Re: spatial query parinsg error:
org.apache.lucene.queryParser.ParseException

Try this...

http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft


- Original Message - From: Dennis Gearon gear...@sbcglobal.net
To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error:
org.apache.lucene.queryParser.ParseException


I am trying to get spatial search to work on my Solr installation. I am 
running

version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}



The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered  RANGEEX_GOOP lng=-121.892639  at line 1,
column 38. Was expecting: }

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude 
field in

my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better
idea to learn from others’ mistakes, so you do not have to make them 
yourself.

from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die. 



best way to get maxDocs in java (i.e. as on stats.jsp page).

2010-12-01 Thread Will Milspec
hi all,

What's the best way to programmatically-in-java get the 'maxDoc' attribute
(as seen on the stats.jsp page).

I don't see any hooks on the solrj api.

Currently I plan to use an http client to get stats.jsp (which returns xml)
and parse it using xpath.

If anyone can recommend a better approach, please opine.

thanks

will


Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-01 Thread Dennis Gearon
Forwarded to my partner, thx, will let you know.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jean-Sebastien Vachon js.vac...@videotron.ca
To: solr-user@lucene.apache.org
Sent: Wed, December 1, 2010 8:50:58 PM
Subject: Re: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException

I just saw the parameter 'lng' in your query... I believe it should be 'long'. 
Give it a try if the link I sent you is not working

- Original Message - From: Dennis Gearon gear...@sbcglobal.net
To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 11:39 PM
Subject: Re: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException


Thanks Jean-Sebastion. I forwarded it to my partner. His membership is still
being held up.

I'll be the go between until he has access.

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Jean-Sebastien Vachon js.vac...@videotron.ca
To: solr-user@lucene.apache.org
Sent: Wed, December 1, 2010 7:12:20 PM
Subject: Re: spatial query parinsg error:
org.apache.lucene.queryParser.ParseException

Try this...

http://localhost:8080/solr/select?wt=jsonindent=trueq={!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}title:Art%20Loft



- Original Message - From: Dennis Gearon gear...@sbcglobal.net
To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error:
org.apache.lucene.queryParser.ParseException


I am trying to get spatial search to work on my Solr installation. I am running
version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}




The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered  RANGEEX_GOOP lng=-121.892639  at line 1,
column 38. Was expecting: }

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude field in
my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die. 



Re: best way to get maxDocs in java (i.e. as on stats.jsp page).

2010-12-01 Thread Koji Sekiguchi

(10/12/02 13:51), Will Milspec wrote:

hi all,

What's the best way to programmatically-in-java get the 'maxDoc' attribute
(as seen on the stats.jsp page).

I don't see any hooks on the solrj api.

Currently I plan to use an http client to get stats.jsp (which returns xml)
and parse it using xpath.

If anyone can recommend a better approach, please opine.

thanks

will


Will,

Try:
http://localhost:8983/solr/admin/luke

LukeRequestHandler
http://wiki.apache.org/solr/LukeRequestHandler

Koji
--
http://www.rondhuit.com/en/


problems with custom SolrCache.init() - fails on startup

2010-12-01 Thread Kevin Osborn
My project has a couple custom caches that descend from FastLRUCache. These 
worked fine in Solr 1.3. Then I started migrating my project to Solr 1.4.1 and 
had problems during startup.

I believe the problem is that I attempt to access the core in the init process. 
I currently use the deprecated SolrCore.getSolrCore(), but had the same problem 
when attempting to use CoreContainer. During some initialization process, I 
need 
access to the IndexSchema object. I assume the problem is because startup must 
create objects in a different order now.

Does anyone have any suggestions on how to get access to the core 
infrastructure 
at the startup of the caches?


  

Restrict access to localhost

2010-12-01 Thread Ganesh
Hello all,

1)
I want to restrict access to Solr only in localhost. How to acheive that? 

2)
If i want to allow the clients to search but not to delete? How to restric the 
access?

Any thoughts?

Regards
Ganesh.
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php