date:20110616

Interesting. You guessed right. I changed multivalued to multiValued and 
all of a sudden I get Strings. But, doesn't multivalued default to false? In my 
schema, I originally did not set multivalued. I only put in multivalued=false 
after I experienced this issue. 

-Rich

For the record, I had a number of fields which had never settings for 
multivalued because none of them were multivalued and I expected the default to 
be false. When I experienced this problem, I added multivalued=false to all 
of them. I still had the problem. So, I added a method to deal with the 
returned ArrayLists:

private Object getFieldValue(String field, SolrDocument document) {

ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);

}


I deliberately did not test if the return Object was an ArrayList because I 
wanted to get an exception if any of them were Strings; I got no exceptions, so 
they were all returned as ArrayLists. 

I then changed one of the fields to use multiValued=false, and I got an 
exception, trying to cast String to ArrayList! So, I changed all the 
troublesome fields to use multiValued, and changed my helper method to look 
like this:

private Object getFieldValue(String field, SolrDocument document) {
Object o = document.getFieldValue(field);

if (o instanceof ArrayList) {
System.out.println(### Field  + field +  is an 
instance of ArrayList.);
ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);
} else {
if (!(o instanceof String)) {
System.out.println(## ERROR);
} else {
System.out.println(### Field  + field +  
is an instance of String.);
}
return o;
}

}


Here's the output, interspersed with the schema definitions of the fields:

field name=uri type=string indexed=true stored=true 
multiValued=false required=true /
### Field uri is an instance of String.

field name=entity_label type=string indexed=false stored=true 
required=false /
### Field entity_label is an instance of ArrayList.

field name=institution_uri type=string indexed=true stored=true 
required=false /
### Field institution_uri is an instance of ArrayList.

field name=asserted_type_uri type=string indexed=true stored=true 
required=false /
### Field asserted_type_uri is an instance of ArrayList.

field name=asserted_type_label type=text_eaglei indexed=true 
stored=true required=false /
### Field asserted_type_label is an instance of ArrayList.

 field name=provider_uri type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_uri is an instance of String.

field name=provider_label type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_label is an instance of String.


As you can see, the ones with no declaration for multivalued are returned as 
ArrayLists, while the ones with multiValued=false are returned as Strings. 

So, it looks like there are two problems here: multivalued (small v) is not 
recognized, since using that in the schema still causes all fields to be 
returned as ArrayLists; and, multivalued does not default to false (or, at 
least, not setting it causes a field to be returned as an ArrayList, as though 
it were set to true).

-Rich


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 15, 2011 4:25 PM
To: solr-user@lucene.apache.org
Subject: Re: getFieldValue always returns an ArrayList?

Hmmm, I admit I'm not using embedded, and I'm using 3.2, but I'm
not seeing the behavior you are.

My question about reindexing could have been better stated, I
was just making sure you didn't have some leftover cruft where
your field was multi-valued from previous experiments, but if
you're reindexing each time that's not the problem.

Arrrh, camel case may be striking again. Try multiValued, not
multivalued

If that's still not it, can we see the code?

Best
Erick

On Wed, Jun 15, 2011 at 3:47 PM, Simon, Richard T
richard_si...@hms.harvard.edu wrote:
 We rebuild the index from scratch each time we start (for now). The fields in 
 question are not multi-valued; in fact, I explicitly set multi-valued to 
 false, just to be sure.

 Yes, this is SolrJ, using the embedded server, if that matters.

 Using Solr/Lucene 3.1.0.

 -Rich

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wednesday, June 15, 2011 3:44 PM
 To: solr-user@lucene.apache.org
 Subject: Re: getFieldValue always returns an ArrayList?

 Did you perhaps change the schema but not re-index? I'm

Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread François Schiettecatte

I am assuming that you are running on linux here, I have found atop to be very 
useful to see what is going on.

http://freshmeat.net/projects/atop/

dstat is also very useful too but needs a little more work to 'decode'.

Obviously there is contention going on, you just need to figure out where it 
is, most likely it is disk I/O but it could also be the number of cores you 
have. Also I would not say that performance is decreasing rapidly, probably 
more of a gentle slope down if you plot it (your double the number of cores 
every time).

I would be very interested in hearing about what you find.

Cheers

François

On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote:

 On 6/16/11 3:22 PM, Mark Schoy wrote:
 Hi,
 
 I set up a Solr instance with 512 cores. Each core has 100k documents and 15
 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.
 
 Now I've done some benchmarks with JMeter. On each thread iteration JMeter
 queriing another Core by random. Here are the results (Duration:  each with
 180 second):
 
 Randomly queried cores | queries per second
 1| 2016
 2 | 2001
 4 | 1978
 8 | 1958
 16 | 2047
 32 | 1959
 64 | 1879
 128 | 1446
 256 | 1009
 512 | 428
 
 Why are the queries per second until 64 constant and then the performance is
 degreasing rapidly?
 
 Solr only uses 10GB of the 16GB memory so I think it is not a memory issue.
 
 
 This may be an OS-level disk buffer issue. With a limited disk buffer space 
 the more random IO occurs from different files, the higher is the churn rate, 
 and if the buffers are full then the churn rate may increase dramatically 
 (and the performance will drop then). Modern OS-es try to keep as much data 
 in memory as possible, so the memory usage itself is not that informative - 
 but check what are the pagein/pageout rates when you start hitting the 32 vs 
 64 cores.
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com

RE: getFieldValue always returns an ArrayList?

FYI: Using multiValued=false for all string fields results in the following 
output:

### Field uri is an instance of String.
### Field entity_label is an instance of String.
### Field institution_uri is an instance of String.
### Field asserted_type_uri is an instance of String.
### Field asserted_type_label is an instance of String.
### Field provider_uri is an instance of String.
### Field provider_label is an instance of String.

-Rich

-Original Message-
From: Simon, Richard T 
Sent: Thursday, June 16, 2011 10:08 AM
To: solr-user@lucene.apache.org
Cc: Simon, Richard T
Subject: RE: getFieldValue always returns an ArrayList?

Interesting. You guessed right. I changed multivalued to multiValued and 
all of a sudden I get Strings. But, doesn't multivalued default to false? In my 
schema, I originally did not set multivalued. I only put in multivalued=false 
after I experienced this issue. 

-Rich

For the record, I had a number of fields which had never settings for 
multivalued because none of them were multivalued and I expected the default to 
be false. When I experienced this problem, I added multivalued=false to all 
of them. I still had the problem. So, I added a method to deal with the 
returned ArrayLists:

private Object getFieldValue(String field, SolrDocument document) {

ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);

}


I deliberately did not test if the return Object was an ArrayList because I 
wanted to get an exception if any of them were Strings; I got no exceptions, so 
they were all returned as ArrayLists. 

I then changed one of the fields to use multiValued=false, and I got an 
exception, trying to cast String to ArrayList! So, I changed all the 
troublesome fields to use multiValued, and changed my helper method to look 
like this:

private Object getFieldValue(String field, SolrDocument document) {
Object o = document.getFieldValue(field);

if (o instanceof ArrayList) {
System.out.println(### Field  + field +  is an 
instance of ArrayList.);
ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);
} else {
if (!(o instanceof String)) {
System.out.println(## ERROR);
} else {
System.out.println(### Field  + field +  
is an instance of String.);
}
return o;
}

}


Here's the output, interspersed with the schema definitions of the fields:

field name=uri type=string indexed=true stored=true 
multiValued=false required=true /
### Field uri is an instance of String.

field name=entity_label type=string indexed=false stored=true 
required=false /
### Field entity_label is an instance of ArrayList.

field name=institution_uri type=string indexed=true stored=true 
required=false /
### Field institution_uri is an instance of ArrayList.

field name=asserted_type_uri type=string indexed=true stored=true 
required=false /
### Field asserted_type_uri is an instance of ArrayList.

field name=asserted_type_label type=text_eaglei indexed=true 
stored=true required=false /
### Field asserted_type_label is an instance of ArrayList.

 field name=provider_uri type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_uri is an instance of String.

field name=provider_label type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_label is an instance of String.


As you can see, the ones with no declaration for multivalued are returned as 
ArrayLists, while the ones with multiValued=false are returned as Strings. 

So, it looks like there are two problems here: multivalued (small v) is not 
recognized, since using that in the schema still causes all fields to be 
returned as ArrayLists; and, multivalued does not default to false (or, at 
least, not setting it causes a field to be returned as an ArrayList, as though 
it were set to true).

-Rich


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 15, 2011 4:25 PM
To: solr-user@lucene.apache.org
Subject: Re: getFieldValue always returns an ArrayList?

Hmmm, I admit I'm not using embedded, and I'm using 3.2, but I'm
not seeing the behavior you are.

My question about reindexing could have been better stated, I
was just making sure you didn't have some leftover cruft where
your field was multi-valued from previous experiments, but if
you're reindexing each time that's not the problem.

Arrrh, camel case may be striking again. Try multiValued, not
multivalued

If that's still not it, can

Re: Showing facet of first N docs

2011-06-16 Thread Tommaso Teofili

Thanks Dmitry, but maybe I didn't explain correctly as I am not sure
facet.offset is the right solution, I'd like not to page but to filter
facets.
I'll try to explain better with an example.
Imagine I make a query and first 2 docs in results have both 'xyz' and 'abc'
as values for field 'lemmas' while also other docs in the results have 'xyz'
or 'abc' as values of field 'lemmas' then I would like to show facets
coming from only the first 2 docs in the results thus having :
lst name=lemmas
  str name=xyz2/str
  str name=abc2/str
/lst
You can imagine this like a 'give me only facets related to the most
relevant docs in the results' functionality.
Any idea on how to do that?
Tommaso


2011/6/16 Dmitry Kan dmitry@gmail.com

 http://wiki.apache.org/solr/SimpleFacetParameters
 facet.offset

 This param indicates an offset into the list of constraints to allow
 paging.

 The default value is 0.

 This parameter can be specified on a per field basis.


 Dmitry


 On Thu, Jun 16, 2011 at 1:39 PM, Tommaso Teofili
 tommaso.teof...@gmail.comwrote:

  Hi all,
  Do you know if it is possible to show the facets for a particular field
  related only to the first N docs of the total number of results?
  It seems facet.limit doesn't help with it as it defines a window in the
  facet constraints returned.
  Thanks in advance,
  Tommaso
 



 --
 Regards,

 Dmitry Kan

Re: How to index correctly a text save with tinyMCE

2011-06-16 Thread Ariel

I have the following problem: I am using the spanish analyzer to index and
query, but due to I am using tinymce some charactes of the text are changed
codified in html, for example the text: En españa ...  it is changed to
En espantilde;a so I need a way to recodify that text to make queries
correctly.

Could you help me please ???
Regards
Ariel

On Wed, Jun 15, 2011 at 9:49 PM, Erick Erickson erickerick...@gmail.comwrote:

 Please review this page:
 http://wiki.apache.org/solr/UsingMailingLists

 You haven't stated what your problem is. Some
 examples of what your inputs and desired outputs
 are would be helpful

 Meanwhile, see this page:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 but that's a wild guess.

 Best
 Erick

 On Wed, Jun 15, 2011 at 2:30 PM, Ariel isaacr...@gmail.com wrote:
  Hi everybody, I am using tinyMCE to save the text I am indexing, but as
 you
  know the characters whith accents are changed. Could anybody tell me how
 to
  solve that problem ? Is there any analyzers that recognize rich text ???
 
  I would appreciate your help.
  Regards,
  Ariel

Re: query routing with shards

2011-06-16 Thread Dmitry Kan

Hi Otis,

I have fixed it by assigning the value to rb same as assigned to sreq:

rb.shards = shards.toString().split(,);


not tested that fully yet, but distributed faceting works at least on my pc
_3 shards 1 router_ setup.

Dmitry


On Thu, Jun 16, 2011 at 4:53 PM, Dmitry Kan dmitry@gmail.com wrote:

 Hi Otis,

 I followed your recommendation and decided to implement the
 SearchComponent::modifyRequest(ResponseBuilder rb, SearchComponent who,
 ShardRequest sreq) method, where the query routing happens. So far it is
 working OK for the non-facet search, this is good news. The bad news is that
 it fails on the facet search.

 This is how request modification happens:

 [code_snippet, SearchComponent::modifyRequest]
 SolrQueryRequest req_routed = rb.req;
 req_routed = routeRequest(req_routed);
 rb.req = req_routed;
 sreq.shards = shards.toString().split(,);
 [/code_snippet]

 where shards is StringBuilder, that accumulates the shards the request
 should go to. req_routed also contains the target shards. Those are set like
 this:


 [code_snippet, my function routeRequest(SolrQueryRequest req)]
 // could not find clone(), used ref reassignment
 SolrQueryRequest req_local = req;
 ModifiableSolrParams params = new
 ModifiableSolrParams(req_local.getParams());
 ...
 params.remove(ShardParams.SHARDS);
 params.set(ShardParams.SHARDS, getShardsParams(yearToQuarterMap));
 params.remove(ShardParams.IS_SHARD);
 params.set(ShardParams.IS_SHARD, true);
 req_local.setParams(params);
 ...
 return req_local;
 [/code_snippet]

 The NPE happens down the road during the facet search, in the
 FacetComponent::countFacets(), the cause of which is that OpenBitSet obs is
 null for shardNum=0.

 Do you have any idea why this happens, should some other field
 of ResponseBuilder, SearchComponent or ShardRequest be changed?

 BTW, I have tried to call FacetInfo::parse method inside
 FacetComponent::modifyRequest() and countFacets(). Where do
 the fi.facets.values() get initiated, is there some method to call?

 Thanks,
 Dmitry

 On Fri, Jun 3, 2011 at 8:00 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:

 Nah, if you can quickly figure out which shard a given query maps to, then
 all
 this component needs to do is stick the appropriate shards param value in
 the
 request and let the request pass through to the other SearchComponents in
 the
 chain,  including QueryComponent, which will know what to do with the
 shards
 param.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Dmitry Kan dmitry@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, June 3, 2011 12:56:15 PM
  Subject: Re: query routing with shards
 
  Hi Otis,
 
  Thanks! This sounds promising. This custom implementation, will  it hurt
 in
  any way the stability of the front end SOLR? After implementing  it, can
 I
  run some tests to verify the stability /  performance?
 
  Dmitry
  On Fri, Jun 3, 2011 at 4:49 PM, Otis Gospodnetic  
 otis_gospodne...@yahoo.com
wrote:
 
   Hi Dmitry,
  
   Yes, you could also implement your  own custom SearchComponent.  In
 this
   component you could grab the  query param, examine the query value,
 and
   based on
   that add the  shards URL param with appropriate value, so that when
 the
regular
   QueryComponent grabs stuff from the request, it has the correct  shard
 in
   there
   already.
  
   Otis

   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem  search :: http://search-lucene.com/
  
  
  
   - Original  Message 
From: Dmitry Kan dmitry@gmail.com
To: solr-user@lucene.apache.org
  Sent: Fri, June 3, 2011 2:47:00 AM
Subject: Re: query routing  with shards
   
Hi Otis,
   
I  merely followed on the gmail's suggestion to include other
  people
 into
   the
recipients list, Yonik was the first one :) I  won't do it  next
 time.
   
Thanks for a rapid reply.  The reason for doing this query  routing
 is
   that we
 abstract the distributed SOLR from the client code for  security
  reasons
(that is, we don't want to expose the entire shard farm  to  the
 world,
   but
only the frontend SOLR) and for  better decoupling.
   
Is  it possible to implement a  plugin to SOLR that would map
 queries  to
shards?

We have other choices too, they'll take quite some time,   that's
 why I
decided to quickly ask, if I was missing something  from the SOLR
  main
components design and  configuration.
   
Dmitry
   
On  Fri, Jun 3,  2011 at 8:25 AM, Otis Gospodnetic 
   otis_gospodne...@yahoo.com
   wrote:
   
 Hi Dmitry (you may not  want to additionally copy Yonik, he's
subscribed to
  this
 list, too)

 
 It sounds  like you have the knowledge of which  query maps to
 which
   shard.
   If
  so, why not control/change the value of shards

Encoding of alternate fields in highlighting

2011-06-16 Thread Massimo Schiavon

I have an index with various fields and I want to highlight query 
matchings on title and content fields.
These fields could contain html tags so I've configured HtmlFormatter 
for highlighting. The problem is that if the query doesn't match the 
text of the field, solr returns the value of configured alternate field 
without encoding it.
Is there any way to get encoded value also for alternate fields? And in 
general there is a way to do html escaping on values returned from a 
response writer?


I'm using solr 3.1 and here is an excerpt from requestHandler configuration

[...]
str name=wtjson/str
str name=hltrue/str
str name=hl.fltitle,content/str
str name=hl.simple.pre![CDATA[b]]/str
str name=hl.simple.post![CDATA[/b]]/str
str name=f.title.hl.fragsize1024/str
str name=f.title.hl.alternateFieldtitle/str
str name=f.title.hl.maxAlternateFieldLength512/str
int name=f.title.hl.snippets1/int
str name=f.content.hl.alternateFieldcontent/str
str name=f.content.hl.maxAlternateFieldLength512/str
int name=f.content.hl.snippets2/int
[...]

and from highlighting configuration

[...]
highlighting
formatter name=html class=org.apache.solr.highlight.HtmlFormatter 
default=true

/formatter
encoder name=html class=org.apache.solr.highlight.HtmlEncoder 
default=true /
fragmentsBuilder name=default 
class=org.apache.solr.highlight.ScoreOrderFragmentsBuilder

default=true /
/highlighting
[...]

Thanks
Massimo

--
DISCLAIMER: This e-mail and any attachment is for authorised use by
the intended recipient(s) only. It may contain proprietary material,
confidential information and/or be subject to legal privilege. It
should not be copied, disclosed to, retained or used by, any other
party. If you are not an intended recipient then please promptly
delete this e-mail and any attachment and all copies and inform
the sender. Thank you.

Re: Complex situation

Am I right that you are only interested in results / facets for
current season? If it's so then you can index start/end dates as a
separate number fields and build your search filters like this
fq=+start_date_month:[* TO 6] +start_date_day:[* TO 17]
+end_date_month:[* TO 6] +end_date_day:[16 TO *] where 6/16 is
current month/day.

On Thu, Jun 16, 2011 at 5:20 PM, roySolr royrutten1...@gmail.com wrote:
 Hello,

 First i will try to explain the situation:

 I have some companies with openinghours. Some companies has multiple seasons
 with different openinghours. I wil show some example data :

 Companyid          Startdate(d-m)  Enddate(d-m)     Openinghours_end
 1                        01-01                01-04                 17:00
 1                        01-04                01-08                 18:00
 1                        01-08                31-12                 17:30

 2                        01-01                31-12                 20:00

 3                        01-01                01-06                 17:00
 3                        01-06                31-12                 18:00

 What i want is some facets on the left site of my page. They have to look
 like this:

 Closing today on:
 17:00(23)
 18:00(2)
 20:00(1)

 So i need to get the NOW to know which openinghours(seasons) i need in my
 facet results. How should my index look like?
 Can anybody helps me how i can save this data in the solr index?





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Showing facet of first N docs

2011-06-16 Thread karsten-solr

Hi Tommaso,

the FacetComponent works with the DocListAndSet#docSet.
It should be easy to switch to DocListAndSet#docList (which contains all 
documents for result list (default: TOP-10, but possible 15-25 (if start=15, 
rows=11). Which means to change the source code.

Instead of changing the source-code the easier way should be to send a second 
request with relevance-Filter (if your sort-criteria is relevance):
 http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html

Best regards
  Karsten

http://lucene.472066.n3.nabble.com/Showing-facet-of-first-N-docs-td3071395.html
 Original-Nachricht 
 Datum: Thu, 16 Jun 2011 12:39:32 +0200
 Von: Tommaso Teofili tommaso.teof...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Showing facet of first N docs

 Hi all,
 Do you know if it is possible to show the facets for a particular field
 related only to the first N docs of the total number of results?
 It seems facet.limit doesn't help with it as it defines a window in the
 facet constraints returned.
 Thanks in advance,
 Tommaso

Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Mark Schoy

Thanks for your answers.

Andrzej was right with his assumption. Solr only needs about 9GB memory but
the system needs the rest of it for disc IO:

64 Cores:  64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS =
16GB

Conclusion: My system can exactly buffer the data of 64 Cores. Every
additional core cant be buffered and the performance is decreasing.



2011/6/16 François Schiettecatte fschietteca...@gmail.com

 I am assuming that you are running on linux here, I have found atop to be
 very useful to see what is going on.

http://freshmeat.net/projects/atop/

 dstat is also very useful too but needs a little more work to 'decode'.

 Obviously there is contention going on, you just need to figure out where
 it is, most likely it is disk I/O but it could also be the number of cores
 you have. Also I would not say that performance is decreasing rapidly,
 probably more of a gentle slope down if you plot it (your double the number
 of cores every time).

 I would be very interested in hearing about what you find.

 Cheers

 François

 On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote:

  On 6/16/11 3:22 PM, Mark Schoy wrote:
  Hi,
 
  I set up a Solr instance with 512 cores. Each core has 100k documents
 and 15
  fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.
 
  Now I've done some benchmarks with JMeter. On each thread iteration
 JMeter
  queriing another Core by random. Here are the results (Duration:  each
 with
  180 second):
 
  Randomly queried cores | queries per second
  1| 2016
  2 | 2001
  4 | 1978
  8 | 1958
  16 | 2047
  32 | 1959
  64 | 1879
  128 | 1446
  256 | 1009
  512 | 428
 
  Why are the queries per second until 64 constant and then the
 performance is
  degreasing rapidly?
 
  Solr only uses 10GB of the 16GB memory so I think it is not a memory
 issue.
 
 
  This may be an OS-level disk buffer issue. With a limited disk buffer
 space the more random IO occurs from different files, the higher is the
 churn rate, and if the buffers are full then the churn rate may increase
 dramatically (and the performance will drop then). Modern OS-es try to keep
 as much data in memory as possible, so the memory usage itself is not that
 informative - but check what are the pagein/pageout rates when you start
 hitting the 32 vs 64 cores.
 
  --
  Best regards,
  Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com

Document Scoring

2011-06-16 Thread zarni aung

Hi,

I am designing my indexes to have 1 write-only master core, 2 read-only
slave cores.  That means the read-only cores will only have snapshots pulled
from the master and will not have near real time changes.  I was thinking
about adding a hybrid read and write master core that will have the most
recent changes from my primary data source.  I am thinking to query the
hybrid master and the read-only slaves and somehow try to intersect the
results in order to support near real time full text search.  Is this
feasible?

Thank you,

Zarni

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

 So a search for a product once the user logs in and searches for only the
 products that he has access to Will translate to something like this . ,the
 product ids are obtained form the db  for a particular user and can run
 into  n  number.

 search term fq=product_id(100 10001  ..n number)

 but we are currently running into too many Boolean expansion error .We are
 not able to tie the user also into roles as each user is mainly any one who
 comes to site and purchases a product .

I'm wondering if new trunk Solr join functionality can help here.

* http://wiki.apache.org/solr/Join

In theory you can index your products (product_id, ...) and
user_id-product many-to-many relation (user_product_id, user_id) into
signle/different cores and then do join, like
f=search termsfq={!join from=product_id to=user_product_id}user_id:10101

But I haven't tried that, so I'm just speculating.

RE: How to index correctly a text save with tinyMCE

2011-06-16 Thread Steven A Rowe

Hi Ariel,

On 6/16/2011 at 10:45 AM, Ariel wrote:
 I have the following problem: I am using the spanish analyzer to index
 and query, but due to I am using tinymce some charactes of the text are
 changed codified in html, for example the text: En españa ...  it is
 changed to En espantilde;a so I need a way to recodify that text to
 make queries correctly.

HTMLStripCharFilterFactory, which strips out HTML tags, also converts named 
character entities like ntilde; to their equivalent character.

Steve

Re: How to index correctly a text save with tinyMCE

2011-06-16 Thread Ariel

Thanks for your answer, I have just put the filter in my schema.xml but it
doesn't work I am using solr 1.4 and my conf is:

code
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.HTMLStripCharFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=Spanish/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/code


But it doesn't work in tomcat 6 logs I get this error:

 java.lang.ClassCastException:
org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to
org.apache.solr.analysis.TokenFilterFactory
at org.apache.solr.schema.IndexSchema$6.init(IndexSchema.java:831)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:149)
at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:835)
at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:58)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:424)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:447)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:141)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:456)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:426)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117)
...

Any Idea ? How can I solve that problem ???

Regards
Ariel



On Thu, Jun 16, 2011 at 6:24 PM, Steven A Rowe sar...@syr.edu wrote:

 Hi Ariel,

 On 6/16/2011 at 10:45 AM, Ariel wrote:
  I have the following problem: I am using the spanish analyzer to index
  and query, but due to I am using tinymce some charactes of the text are
  changed codified in html, for example the text: En españa ...  it is
  changed to En espantilde;a so I need a way to recodify that text to
  make queries correctly.

 HTMLStripCharFilterFactory, which strips out HTML tags, also converts named
 character entities like ntilde; to their equivalent character.

 Steve

Re: How to index correctly a text save with tinyMCE

2011-06-16 Thread Shawn Heisey


On 6/16/2011 11:12 AM, Ariel wrote:

Thanks for your answer, I have just put the filter in my schema.xml but it
doesn't work I am using solr 1.4 and my conf is:

code
analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.HTMLStripCharFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=Spanish/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/code


But it doesn't work in tomcat 6 logs I get this error:

  java.lang.ClassCastException:
org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to
org.apache.solr.analysis.TokenFilterFactory


According to the wiki, the output of that filter must be passed to 
either another CharFilter or a Tokenizer.  Try moving it before 
WhitespaceTokenizerFactory.


Shawn

RE: getFieldValue always returns an ArrayList?

2011-06-16 Thread Chris Hostetter


: and all of a sudden I get Strings. But, doesn't multivalued default to 
: false? In my schema, I originally did not set multivalued. I only put in 
: multivalued=false after I experienced this issue.

That's dependent on the version of Solr, and it's is where the 
version property of the schema comes in.  (as the default behavior in 
solr changes, it does so dependent on what version you specify in your 
schema to prevent radical behavior changes if you upgrade but keep the 
same configs)...

schema name=example version=1.4
  !-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --



-Hoss

RE: getFieldValue always returns an ArrayList?

We haven't changed Solr versions. We've been using 3.1.0 all along.

Plus, I have some code that runs during indexing and retrieves the fields from 
a SolrInputDocument, rather than a SolrDocument. That code gets Strings without 
any problem, and always has, even without saying multiValued=false.

-Rich

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, June 16, 2011 2:18 PM
To: solr-user@lucene.apache.org
Cc: Simon, Richard T
Subject: RE: getFieldValue always returns an ArrayList?


: and all of a sudden I get Strings. But, doesn't multivalued default to 
: false? In my schema, I originally did not set multivalued. I only put in 
: multivalued=false after I experienced this issue.

That's dependent on the version of Solr, and it's is where the 
version property of the schema comes in.  (as the default behavior in 
solr changes, it does so dependent on what version you specify in your 
schema to prevent radical behavior changes if you upgrade but keep the 
same configs)...

schema name=example version=1.4
  !-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --



-Hoss

Re: Strange behavior

Have you stopped Solr before manually copying the data? This way you
can be sure that index is the same and you didn't have any new docs on
the fly.

2011/6/14 Denis Kuzmenok forward...@ukr.net:
 What  should  i provide, OS is the same, environment is the same, solr
 is  completely  copied,  searches  work,  except that one, and that is
 strange..

 I think you will need to provide more information than this, no-one on this 
 list is omniscient AFAIK.

 François

 On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:

 Hi.

 I've  debugged search on test machine, after copying to production server
 the  entire  directory  (entire solr directory), i've noticed that one
 query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
 production.
 How can that be?

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-16 Thread Sujatha Arun

Peter ,

Thanks for the clarification.

Why  I specifically asked was because, we have  many search instances
(200+) on a single JVM.

Each of these instaces could have  n users and each user can subscribe  to
n products .Now accordng to your suggestion , I need to maintain an
in-memory list  of all users and their subscribed products  for each of the
instances and use this list to fllter for a given query.We are maintaining
the user and  subscrption details in a DB.

 I was wondering ,instead if it would make  more sense(with respect to
memory) to  dynamically  get the subscribed product ids when ever a user
logs in (as   access is only for the user session) and  use this data to
flter the query ?

And we really do not have budget and hence wont be able to contract  LI  for
this ,though I will certanly need to get some JAVA experts help wthin my
org.

Thanks for your time

Regards
Sujatha



On Wed, Jun 15, 2011 at 11:29 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 By in-memory, I mean you hold a list of users (+ some other parameters
 like order number, expiry, what ever else you need) in one of those
 Greek HashMaps, and use this list to determine what query
 parameters/results will be processed for a given search request
 (SOLR-1872 reads an acl file to populate such a list). So if you had
 500 users who had purchased stuff at a given moment, you'd have 500
 entries in the table that hold the relevant data to filter/not filter
 searches/results.
 This won't cause a memory problem unless you have a million users and
 stored their autobiography in each entry.
 I wouldn't call this sort of thing a novice or even journeyman's task,
 you would definitely need to know about using and maintaining tables
 etc.
 Would you be able to contract someone to do the work on your behalf?
 There are some excellent resources around, and Lucid would certainly
 do a great job, but of course you'd need budget for this approach.
 Alternatively, maybe you can tap some java expertise within your
 organization to help out?

 HTH,
 Peter


 On Wed, Jun 15, 2011 at 6:17 PM, Sujatha Arun suja.a...@gmail.com wrote:
  Thanks ,Peter.
 
  I am not a Java  Programmer  and hence the code seems all Greek and Latin
 to
  me .I do have a basic knowledge ,but all this Map,hashMap
  ,Hashlist,NamedList  , I dont understand.
 
  However  I would like to implement the solution that you have mentoned
  ,so
  if you have any pointers for me ,would be great .I would also try to dig
  deep into JAVA.
 
  What s meant by  in-memory?Is it the Ram memory ,So If i  have n
  concurrent users ,each having n products subscrbed,what would be the
  Impact on memory ?
 
 
 
  Regards
  Sujatha
 
 
  On Tue, Jun 14, 2011 at 5:43 PM, Peter Sturge peter.stu...@gmail.com
 wrote:
 
  SOLR-1872 doesn't add discrete booleans to the query, it does it
  programmatically, so you shouldn't see this problem. (if you have a
  look at the code, you'll see how it filters queries)
  I suppose you could modify SOLR-1872 to use an in-memory,
  dynamically-updated user list (+ associated filters) instead of using
  the acl file.
  This would give you the 'changing users' and 'expiry' functionailty you
  need.
 
 
 
  On Tue, Jun 14, 2011 at 10:08 AM, Sujatha Arun suja.a...@gmail.com
  wrote:
   Thanks Peter , for your input .
  
   I really  would like a document and schema agnostic   solution as  in
  solr
   1872.
  
Am I right  in my assumption that SOLR1872  is same as the solution
 that
   we currently have where we add a flter query of the products  to
 orignal
   query and hence (SOLR 1872) will also run into  TOO many boolean
 clause
   expanson error?
  
   Regards
   Sujatha
  
  
   On Tue, Jun 14, 2011 at 1:53 PM, Peter Sturge peter.stu...@gmail.com
  wrote:
  
   Hi,
  
   SOLR-1834 is good when the original documents' ACL is accessible.
   SOLR-1872 is good where the usernames are persistent - neither of
   these really fit your use case.
   It sounds like you need more of an 'in-memory', transient access
   control mechanism. Does the access have to exist beyond the user's
   session (or the Solr vm session)?
   Your best bet is probably something like a custom SearchComponent or
   similar, that keeps track of user purchases, and either
 adjusts/limits
   the query or the results to suit.
   With your own module in the query chain, you can then decide when the
   'expiry' is, and limit results accordingly.
  
   SearchComponent's are pretty easy to write and integrate. Have a look
  at:
 http://wiki.apache.org/solr/SearchComponent
   for info on SearchComponent and its usage.
  
  
  
  
   On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun suja.a...@gmail.com
  wrote:
Hello,
   
   
Our Use Case is as follows
   
Several solr webapps (one JVM) ,Each webapp catering to one client
  .Each
client has their users who can purchase products from the  site
 .Once
   they
purchase ,they have full access to the products ,other wise

RE: getFieldValue always returns an ArrayList?

Ah! That was the problem. The version was 1.0. I'll change it to 1.2. Thanks!

-Rich

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, June 16, 2011 2:33 PM
To: Simon, Richard T
Cc: solr-user@lucene.apache.org
Subject: RE: getFieldValue always returns an ArrayList?

: We haven't changed Solr versions. We've been using 3.1.0 all along.

but that's not what i'm talking about.  I'm talking about the schema 
version ... a specific property declared in your schema.xml file.

did you check it?

(even when people start with Solr X, they sometimes are using schema.xml 
files provided by external packages -- Drupal, wordpress, etc... -- and 
don't notice that those are from older versions)

: Plus, I have some code that runs during indexing and retrieves the 
: fields from a SolrInputDocument, rather than a SolrDocument. That code 
: gets Strings without any problem, and always has, even without saying 
: multiValued=false.

SolrInputDocument's are irelevant.  they are used to index data, but they 
don't know anything about the schema.  A SolrInputDocument might be 
completely invalid because of multiple values for singled value fields, or 
missing values for required fields, etc...   what comes back from a search 
*is* consistent with the schema (even when there is only one value stored 
in a multiValued field)

-Hoss

Re: Updating only one indexed field for all documents quickly.

 with the integer field. If you just want to influence the
 score, then just plain external field fields should work for
 you.

 Is this an appropriate solution, give our use case?

Yes, check out ExternalFileField

* http://search.lucidimagination.com/search/document/CDRG_ch04_4.4.4
* 
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
* http://www.slideshare.net/greggdonovan/solr-lucene-etsy-by-gregg-donovan/28

It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Gabriele Kahlout

Hello,

I'm testing out different Similarity implementations, and to do that I
restart Solr each time I want to try a different similarity class I change
the class attributed of the similiary element in schema.xml. Beside running
multiple-cores, each with its own schema, is there a way to tell the
RequestHandler which similarity class to use?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Minimum Should Match + External Field + Function Query with boost

2011-06-16 Thread Chris Hostetter

: Seem to have a solution but I am still trying to figure out how/why it works. 
: 
: Addition of defType=edismax in the boost query seem to honor MM and
: correct boosting based on external file source. 

You didn't bost enough details in your original question to be 100% 
certain (would have needed to see the *full* solr url, including path, and 
your requestHandler declaration from solrconfig.xml to be sure) but i 
suspect the problem you were having is that you weren't actually using 
dismax (or edismax) at all until you added the explicit defType you 
mentioned...

: The new query syntax
: q={!boost b=dishRating v=$qq defType=edismax}qq=hot chicken wings 

compare the parsedquery_toString in the debug output of your previous 
message with the debug output you get now and i think you'll see a clear 
indication of when a DisjunctionMaxQuery is used (and what the mm is set 
to)


-Hoss

RE: HTMLStripTransformer will remove the content in XML??

2011-06-16 Thread Chris Hostetter


FYI: There's a new patch specificly for dealing with xml tags and entities 
that handles the CDATA case...

https://issues.apache.org/jira/browse/SOLR-2597

: Date: Fri, 27 May 2011 17:01:26 +0800
: From: Ellery Leung elleryle...@be-o.com
: Reply-To: solr-user@lucene.apache.org, elleryle...@be-o.com
: To: solr-user@lucene.apache.org
: Subject: RE: HTMLStripTransformer will remove the content in XML??
: 
: Got it.  Actually I use solr.MappingCharFilterFactory to replace the 
![CDATA[ and ]] to empty first, and use HTMLStripCharFilterFactory to get 
hello and solr.
: 
: For future reference, here is part of schema.xml
: 
: fieldType name=textMaxWord class=solr.TextField 
:   analyzer type=index
:   charFilter class=solr.MappingCharFilterFactory 
mapping=mappings.txt/
:   charFilter class=solr.HTMLStripCharFilterFactory /
: ...
: 
: In mappings.txt (2 lines)
: 
: ![CDATA[ = 
: 
: ]] = 
: 
: Restart Solr
: 
: It works.
: 
: Thank you
: 
: -Original Message-
: From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] 
: Sent: 2011年5月27日 4:20 下午
: To: solr-user@lucene.apache.org; elleryle...@be-o.com
: Subject: Re: HTMLStripTransformer will remove the content in XML??
: 
: I would expect that it doesn't understand CDATA and thinks of
: everything between  and  as a 'tag'.
: 
: Best Regards,
: Bryan Rasmussen
: 
: On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote:
:  I have an XML string like this:
: 
: 
: 
:  ?xml version=1.0
:  encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr
:  ]]/loc/language
: 
: 
: 
:  By using HTMLStripTransformer, I expect to get 'hello,solr'.
: 
: 
: 
:  But actual this transformer will remove ALL THE TEXT INSIDE!
: 
: 
: 
:  Did I do something silly, or is it a bug?
: 
: 
: 
:  Thank you
: 
: 
: 
: 

-Hoss

Re: It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Erik Hatcher

No, there's not a way to control Similarity on a per-request basis.  

Some factors from Similarity are computed at index-time though.

What factors are you trying to tweak that way and why?  Maybe doing boosting 
using some other mechanism (boosting functions, boosting clauses) would be a 
better way to go?

Erik




On Jun 16, 2011, at 14:55 , Gabriele Kahlout wrote:

 Hello,
 
 I'm testing out different Similarity implementations, and to do that I
 restart Solr each time I want to try a different similarity class I change
 the class attributed of the similiary element in schema.xml. Beside running
 multiple-cores, each with its own schema, is there a way to tell the
 RequestHandler which similarity class to use?
 
 -- 
 Regards,
 K. Gabriele
 
 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).
 
 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).

Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Andrzej Bialecki


On 6/16/11 5:31 PM, Mark Schoy wrote:

Thanks for your answers.

Andrzej was right with his assumption. Solr only needs about 9GB memory but
the system needs the rest of it for disc IO:

64 Cores:  64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS =
16GB

Conclusion: My system can exactly buffer the data of 64 Cores. Every
additional core cant be buffered and the performance is decreasing.


Glad to be of help... You could formulate this conclusion in a different 
way, too: if you specify too large a heap size then you stifle the OS 
disk buffers - Solr won't be able to use that excess of memory, but it 
won't be available for OS-level disk IO. Therefore reducing the heap 
size may actually increase your performance.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Gabriele Kahlout

On Thu, Jun 16, 2011 at 9:14 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 No, there's not a way to control Similarity on a per-request basis.

 Some factors from Similarity are computed at index-time though.


You got me on this.


 What factors are you trying to tweak that way and why?  Maybe doing
 boosting using some other mechanism (boosting functions, boosting clauses)
 would be a better way to go?

 I'm trying to assess the impact of coord (search-time) on Qtime. In one
implementation coord returns 1, while in another it's actually computed.

Running multiple cores adds considerable complication (must specify to share
data but not conf).
Patching the request handler to change similarity (didn't yet look into
this) will only change 'search-time' similarity. How about breaking up
similarity into run-time and compile-time? So requesthandler could take a
parameter to 'safely' set the run-time similarity?
I think many would welcome such responsibility distinction.


Erik




 On Jun 16, 2011, at 14:55 , Gabriele Kahlout wrote:

  Hello,
 
  I'm testing out different Similarity implementations, and to do that I
  restart Solr each time I want to try a different similarity class I
 change
  the class attributed of the similiary element in schema.xml. Beside
 running
  multiple-cores, each with its own schema, is there a way to tell the
  RequestHandler which similarity class to use?
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

RE: How to index correctly a text save with tinyMCE

2011-06-16 Thread Steven A Rowe

Hi Ariel,

As Shawn says, char filters come before tokenizers.

You need to use a charFilter tag instead of filter tag.

I've updated the HTMLStripCharFilter documentation on the Solr wiki to include 
this information: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Steve

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Thursday, June 16, 2011 1:32 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index correctly a text save with tinyMCE
 
 On 6/16/2011 11:12 AM, Ariel wrote:
  Thanks for your answer, I have just put the filter in my schema.xml but
 it
  doesn't work I am using solr 1.4 and my conf is:
 
  code
  analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.HTMLStripCharFilterFactory/
   filter class=solr.SnowballPorterFilterFactory
 language=Spanish/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
  /code
 
 
  But it doesn't work in tomcat 6 logs I get this error:
 
java.lang.ClassCastException:
  org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to
  org.apache.solr.analysis.TokenFilterFactory
 
 According to the wiki, the output of that filter must be passed to
 either another CharFilter or a Tokenizer.  Try moving it before
 WhitespaceTokenizerFactory.
 
 Shawn

getting started

2011-06-16 Thread Mari Masuda

Hello,

I am new to Solr and am in the beginning planning stage of a large project and 
could use some advice so as not to make a huge design blunder that I will 
regret down the road.

Currently I have about 10 MySQL databases that store information about 
different archival collections.  For example, we have data and metadata about a 
political poster collection, a television program, documents and photographs of 
and about a famous author, etc.  My job is to work with the staff archivists to 
come up with a standard metadata template so the 10 databases can be 
consolidated into one.  

Currently the info in these databases is accessed through 10 different sets of 
PHP pages that were written a long time ago for PHP 4.  My plan is to write a 
new Java application that will handle both public display of the info as well 
as an administrative interface so that staff members can add or edit the 
records.

I have decided to use Solr as the search mechanism for this project.  Because 
the info in each of our 10 collections is slightly different (e.g., a record 
about a poster does not contain duration information, but a record about a TV 
show does) I was thinking it would be good to separate each collection's index 
into a separate Solr core so that commits coming from one collection do not bog 
down the other unrelated collections.  One reservation I have is that 
eventually we would like to be able to type in Iraq and find records across 
all of the collections at once instead of having to search each collection 
separately.  Although I don't know anything about it at this stage, I did 
Google sharding after reading someone's recent post on this list and it 
sounds like that may be a potential answer to my question.  Does anyone have 
any advice on how I should initially set up Solr for my situation?  I am slowly 
making my way through the wiki and RTFMing, but I wanted to see what the 
experts have to say because at this point I don't really know where to start.

Thank you very much,
Mari

Re: It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Robert Muir

On Thu, Jun 16, 2011 at 3:23 PM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
 I'm trying to assess the impact of coord (search-time) on Qtime. In one
 implementation coord returns 1, while in another it's actually computed.

On query time?

coord should be really cheap (unless your impl does something like
calculate a million digits of pi), as it is not actually computed
per-document.
instead, the result of all possible coord factors (e.g. 1/5, 2/5, 3/5,
4/5, 5/5) is computed up-front by BooleanQuery's scorers into a table.

See 
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer.java
and 
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer2.java

Re: getting started

2011-06-16 Thread Jonathan Rochkind

On 6/16/2011 4:41 PM, Mari Masuda wrote:

One reservation I have is that eventually we would like to be able to type in Iraq and
find records across all of the collections at once instead of having to search each collection
separately. Although I don't know anything about it at this stage, I did Google
sharding after reading someone's recent post on this list and it sounds like that may
be a potential answer to my question.

So this kind of stuff can be tricky, but with that eventual requirement
I would NOT put these in seperate cores. Sharding isn't (IMO, if someone
disagrees, they will hopefully say so!) a good answer to searching
accross entirely different 'schemas', or avoiding frequent-commit issues
-- sharding is really just for scaling/performance when your index gets
very very large. (Which it doesn't sound like yours will be, but you can
deal with that as a separate issue if it becomes so).

If you're going to want to search across all the collections, put them
all in the same core. Either in the exact same indexed fields, or using
certain common indexed fields -- those common ones are the ones you'll
be able to search across all collections on. It's okay if some
collections have unique indexed fields too --- documents in the core
that don't belong to that collection just won't have any terms in that
indexed field that is only used by a certain collection, no problem.
(Then you can distribute this single core into shards if you need to for
performance reasons related to number of documents/size of index).

You're right to be thinking about the fact that very frequent commits
can be performance issues in Solr. But separating in different cores is
going to create more problems for yourself (if you want to be able to
search accross all collections), in an attempt to solve that one.
(Among other things, not every Solr feature works in a
distributed/sharded environment, it's just a more complicated and
somewhat less mature setup for Solr).

The way I deal with the frequent-commit issue is by NOT doing frequent
commits to my production Solr. Instead, I use Solr replication to have a
'master' Solr index that I do commits to whenever I want, and a 'slave'
Solr index that serves the production searches, and which only
replicates from master periodically -- not too often to be
too-frequent-commits. That seems to be a somewhat common solution, if
that use pattern works for you.

There are also some near real time features in more recent versions of
Solr, that I'm not very familiar with. (not sure if any are included in
the current latest release, or if they are all only still in the repo)
My sense is that they too only work for certain use patterns, they
aren't magic bullets for commit whatever you want as often as you want
to Solr. In general Solr isn't so great at very frequent major changes
to the index. Depending on exactly what sort of use pattern you are
predicting/planning for your commits, maybe people can give you advice
on how (or if) to do it.

But I personally don't think your idea of splitting your collections
(that you'll eventually want to search accross into a single search)
into shards is a good solution to frequent-commit issues. You'd be
complicating your setup and causing other problems for yourself, and not
really even entirely addressing the too-frequent-commit issue with that
setup.

Re: getting started

2011-06-16 Thread Sascha SZOTT

Hi Mari,

it depends ...

* How many records are stored in your MySQL databases?
* How often will updates occur?
* How many db records / index documents are changed per update?

I would suggest to start with a single Solr core first. Thereby, you can
concentrate on the basics and do not need to deal with more advanced
things like sharding. In case you encounter performance issues later on,
you can switch to a multi-core setup.

-Sascha

Mari Masuda wrote:

Hello,

I am new to Solr and am in the beginning planning stage of a large project and
could use some advice so as not to make a huge design blunder that I will
regret down the road.

Currently I have about 10 MySQL databases that store information about
different archival collections. For example, we have data and metadata about a
political poster collection, a television program, documents and photographs of
and about a famous author, etc. My job is to work with the staff archivists to
come up with a standard metadata template so the 10 databases can be
consolidated into one.

Currently the info in these databases is accessed through 10 different sets of
PHP pages that were written a long time ago for PHP 4. My plan is to write a
new Java application that will handle both public display of the info as well
as an administrative interface so that staff members can add or edit the
records.

I have decided to use Solr as the search mechanism for this project. Because the info in each of
our 10 collections is slightly different (e.g., a record about a poster does not contain duration
information, but a record about a TV show does) I was thinking it would be good to separate each
collection's index into a separate Solr core so that commits coming from one collection do not bog
down the other unrelated collections. One reservation I have is that eventually we would like to
be able to type in Iraq and find records across all of the collections at once instead
of having to search each collection separately. Although I don't know anything about it at this
stage, I did Google sharding after reading someone's recent post on this list and it
sounds like that may be a potential answer to my question. Does anyone have any advice on how I
should initially set up Solr for my situation? I am slowly making my way through the wiki and
RTFMing, but I wanted to see what

the experts have to say because at this point I don't really know where to
start.

Thank you very much,
Mari

sending results of function query to range query

2011-06-16 Thread Kevin Osborn

I am not sure if I can use function queries this way. I have a query like 
thisattributeX:[* TO ?] in my DB. I replace the ? with input from the front 
end. Obviously, this works fine. However, what I really want to do is 
attributeX:[* TO (3 * ?)] Is there anyway to embed the results of a function 
query inside the query?

Re: Encoding of alternate fields in highlighting

2011-06-16 Thread Koji Sekiguchi


(11/06/17 0:15), Massimo Schiavon wrote:

I have an index with various fields and I want to highlight query matchings on title 
and content
fields.
These fields could contain html tags so I've configured HtmlFormatter for 
highlighting. The problem
is that if the query doesn't match the text of the field, solr returns the 
value of configured
alternate field without encoding it.
Is there any way to get encoded value also for alternate fields? And in general 
there is a way to do
html escaping on values returned from a response writer?


Massimo,

At first impression, I think the requirement is reasonable. As long as we 
support HtmlEncoder,
we had better support it with alternateField option. Please open a jira issue, 
and if possible,
suggest appropriate option and attach a patch (patch is not required, but it is 
very helpful).

koji
--
http://www.rondhuit.com/en/

SOlR -- Out of Memory exception

2011-06-16 Thread jyn7

We just started using SOLR. I am trying to load a single file with 20 million
records into SOLR using the CSV uploader. I keep getting and out of Memory
after loading 7 million records. Here is the config:

autoCommit 
 maxDocs1/maxDocs
 maxTime6/maxTime 
I also  encountered a LockObtainFailedException
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
NativeFSLock@D:\work\solr\.\data\index\write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at
org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:1097)

So I changed the  lockType to SIngle, now again I am getting an Out of
Memory Exception. I also increased the JVM heap space to 2048M but still
getting an Out of Memory.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3074636.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: fieldCache problem OOM exception

Well, if my theory is right, you should be able to generate OOMs at will by
sorting and faceting on all your fields in one query.

But Lucene's cache should be garbage collected, can you take some memory
snapshots during the week? It should hit a point and stay steady there.

How much memory are you giving your JVM? It looks like a lot given your
memory snapshot.

Best
Erick

On Thu, Jun 16, 2011 at 3:01 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Hi Erik,

 yes I'm sorting and faceting.

 1) Fields for sorting:
   sort=f_dccreator_sort, sort=f_dctitle, sort=f_dcyear
   The parameter facet.sort= is empty, only using parameter sort=.

 2) Fields for faceting:
   f_dcperson, f_dcsubject, f_dcyear, f_dccollection, f_dclang, f_dctypenorm,
 f_dccontenttype
   Other faceting parameters:

 ...facet=truefacet.mincount=1facet.limit=100facet.sort=facet.prefix=...

 3) The LukeRequestHandler takes too long for my huge index so this is from
   the standalone luke (compiled for solr3.2):
   f_dccreator_sort = 10.029.196
   f_dctitle        = 21.514.939
   f_dcyear         =      1.471
   f_dcperson       = 14.138.165
   f_dcsubject      =  8.012.319
   f_dccollection   =      1.863
   f_dclang         =        299
   f_dctypenorm     =         14
   f_dccontenttype  =        497

 numDocs:    28.940.964
 numTerms:  686.813.235
 optimized:        true
 hasDeletions:    false

 What can you read/calculate from this values?

 Is my index to big for Lucene/Solr?

 What I don't understand, why fieldCache is not garbage collected
 and therefore reduced in size from time to time.

 Regards
 Bernd

 Am 15.06.2011 17:50, schrieb Erick Erickson:

 The first question I have is whether you're sorting and/or
 faceting on many unique string values? I'm guessing
 that sometime you are. So, some questions to help
 pin it down:
 1  what fields are you sorting on?
 2  what fields are you faceting on?
 3  how many unique terms in each (see the solr admin page).

 Best
 Erick

 On Wed, Jun 15, 2011 at 8:22 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de  wrote:

 Dear list,

 after getting OOM exception after one week of operation with
 solr 3.2 I used MemoryAnalyzer for the heapdumpfile.
 It looks like the fieldCache eats up all memory.

                                                    Objects     Shalow
 Heap
   Retained Heap
 org.apache.lucene.search.FieldCache                       0
 0

 = 14,636,950,632

 org.apache.lucene.search.FieldCacheImpl                   1
  32

 = 14,636,950,384

 org.apache.lucene.search.FieldCacheImpl$StringIndexCache  1
  32

 = 14,636,947,080

 org.apache.lucene.search.FieldCache$StringIndex          10
 320

 = 14,636,944,352

 java.lang.String[]                                      519
 567,811,040

 = 13,503,733,312

 char[]                                           81,766,595
  11,604,293,712

 = 11,604,293,712

 fieldCache retains over 14g of heap.

 When looking on stats page under fieldCache the description says:
 Provides introspection of the Lucene FieldCache, this is **NOT** a cache
 that is managed by Solr.

 So is this a jetty problem and not solr?

 Why is fieldCache growing and growing until OOM?

 Regards
 Bernd

Re: Boost Strangeness

Right, if you've only changed WordDelimiterFilterFactory in the query, then
then tokens you're analyzing may be split up. Try running some of the
terms through the admin/analysis page Unless you have
catenateAll=1, in the definition, the whole term won't be there

It becomes a question of why you even want WDFF in there in the first
place, do you ever want to split these fields up this way? Maybe start
by just taking it out completely?

Best
Erick

On Thu, Jun 16, 2011 at 9:55 AM, Judioo cont...@judioo.com wrote:
 fascinating

 Thank you so much Erik, I'm slowly beginning to understand.

 SO I've discovered that by defining 'splitOnNumerics=0' on the filter
 class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
 can get *closer* to my required goal!

 Now something else odd is occuring.

 It only returns 2 results where there is over 70?

 Why is that? I can't find were this is explained :(

 query

 /solr/select?omitNorms=trueq=b006m86ddefType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=onomitNorms=true

 output

 {

   - -
   responseHeader: {
      - status: 0
      - QTime: 51
      - -
      params: {
         - debugQuery: on
         - fl:
         
 type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
         - indent: on
         - q: b006m86d
         - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
         subseries_container_id^8 clip_container_id^1 clip_episode_id^1
         - wt: json
         - -
         omitNorms: [
            - true
            - true
         ]
         - defType: dismax
      }
   }
   - -
   response: {
      - numFound: 2
      - start: 0
      - maxScore: 13.473297
      - -
      docs: [
         - -
         {
            - parent_id: 
            - id: b006m86d
            - type: brand
            - score: 13.473297
         }
         - -
         {
            - series_container_id: 
            - id: b00y1w9h
            - type: episode
            - brand_container_id: b006m86d
            - subseries_container_id: 
            - clip_episode_id: 
            - score: 11.437143
         }
      ]
   }
   - -
   debug: {
      - rawquerystring: b006m86d
      - querystring: b006m86d
      - parsedquery: +DisjunctionMaxQuery((id:b006m86d^10.0 |
      clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
      series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
      brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()
      - parsedquery_toString: +(id:b006m86d^10.0 | clip_episode_id:b006m86d
      | subseries_container_id:b006m86d^8.0 |
 series_container_id:b006m86d^8.0 |
      clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
      parent_id:b006m86d^9.0) ()
      - -
      explain: {
         - b006m86d:  13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
         of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
 product of: 1.0 =
         tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
 maxDocs=783800) 1.0 =
         fieldNorm(field=id, doc=27636) 
         - b00y1w9h:  11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
         of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
         product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
         product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
         0.007422088 = queryNorm 13.878762 = (MATCH)
         fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
         tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
         maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) 
      }
      - QParser: DisMaxQParser
      - altquerystring: null
      - boostfuncs: null
      - -
      timing: {
         - time: 51
         - -
         prepare: {
            - time: 6
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 5
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 1
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 0
            }
         }
         - -
         process: {
            - time: 45
            - -

Re: Document Scoring

I really wouldn't go there, it sounds like there are endless
opportunities for errors!

How real-time is real-time? Could you fix this entirely
by
1 adjusting expectations for, say, 5 minutes.
2 adjusting your commit (on the master) and poll (on the slave) appropriately?

Best
Erick

On Thu, Jun 16, 2011 at 11:41 AM, zarni aung zau...@gmail.com wrote:
 Hi,

 I am designing my indexes to have 1 write-only master core, 2 read-only
 slave cores.  That means the read-only cores will only have snapshots pulled
 from the master and will not have near real time changes.  I was thinking
 about adding a hybrid read and write master core that will have the most
 recent changes from my primary data source.  I am thinking to query the
 hybrid master and the read-only slaves and somehow try to intersect the
 results in order to support near real time full text search.  Is this
 feasible?

 Thank you,

 Zarni

Re: SOlR -- Out of Memory exception