Re: Need help with DIH dataconfig.xml

2011-06-16 Thread Noble Paul നോബിള്‍ नोब्ळ्
Use TemplateTransformer

dataConfig
   dataSource
   name = wld
   type=JdbcDataSource
   driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost/wld
   user=root
   password=pass/
   document name=variants
 entity name=III_1_1 query=SELECT * FROM `wld`.`III_1_1`
transformer=TemplateTransformer
   field column=id  template='${III_1_1.id}III_1_1}'/
   field column=lemmatitel name=lemma /
   field column=vraagtekst name=vraagtekst /
   field column=lexical_variant name=variant /
 /entity
 entity name=III_1_2 query=SELECT * FROM `wld`.`III_1_2`
   field column=id name='${III_1_2_ + id}'/
   field column=lemmatitel name=lemma /
   field column=vraagtekst name=vraagtekst /
   field column=lexical_variant name=variant /
 /entity
   /document
/dataConfig


On Wed, Jun 15, 2011 at 4:41 PM, MartinS martin.snijd...@gmail.com wrote:
 Hello,

 I want to perform a data import from a relational database.
 That all works well.
 However, i want to dynamically create a unique id for my solr documents
 while importing by using my data config file. I cant get it to work, maybe
 its not possible this way, but i thought i would ask you ll.
 (I set up schema.xml to use the field id as the unique id for solr
 documents)

 My solr config looks like this :

 dataConfig
        dataSource
                name = wld
                type=JdbcDataSource
                driver=com.mysql.jdbc.Driver
                url=jdbc:mysql://localhost/wld
                user=root
                password=pass/
        document name=variants
          entity name=III_1_1 query=SELECT * FROM `wld`.`III_1_1`
            field column=id name='${variants.name + id}'/
            field column=lemmatitel name=lemma /
            field column=vraagtekst name=vraagtekst /
            field column=lexical_variant name=variant /
          /entity
          entity name=III_1_2 query=SELECT * FROM `wld`.`III_1_2`
            field column=id name='${III_1_2_ + id}'/
            field column=lemmatitel name=lemma /
            field column=vraagtekst name=vraagtekst /
            field column=lexical_variant name=variant /
          /entity
    /document
 /dataConfig

 For a unique id I would like the concatenate the primary key of the table
 (Column id) with the table name.
 How can I do this ? Both ways as shown in the example data config don't work
 while importing.

 Any help is appreciated.
 Martin

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Need-help-with-DIH-dataconfig-xml-tp3066855p3066855.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
-
Noble Paul


Re: fieldCache problem OOM exception

2011-06-16 Thread Bernd Fehling

Hi Erik,

yes I'm sorting and faceting.

1) Fields for sorting:
   sort=f_dccreator_sort, sort=f_dctitle, sort=f_dcyear
   The parameter facet.sort= is empty, only using parameter sort=.

2) Fields for faceting:
   f_dcperson, f_dcsubject, f_dcyear, f_dccollection, f_dclang, f_dctypenorm, 
f_dccontenttype
   Other faceting parameters:
   ...facet=truefacet.mincount=1facet.limit=100facet.sort=facet.prefix=...

3) The LukeRequestHandler takes too long for my huge index so this is from
   the standalone luke (compiled for solr3.2):
   f_dccreator_sort = 10.029.196
   f_dctitle= 21.514.939
   f_dcyear =  1.471
   f_dcperson   = 14.138.165
   f_dcsubject  =  8.012.319
   f_dccollection   =  1.863
   f_dclang =299
   f_dctypenorm = 14
   f_dccontenttype  =497

numDocs:28.940.964
numTerms:  686.813.235
optimized:true
hasDeletions:false

What can you read/calculate from this values?

Is my index to big for Lucene/Solr?

What I don't understand, why fieldCache is not garbage collected
and therefore reduced in size from time to time.

Regards
Bernd

Am 15.06.2011 17:50, schrieb Erick Erickson:

The first question I have is whether you're sorting and/or
faceting on many unique string values? I'm guessing
that sometime you are. So, some questions to help
pin it down:
1  what fields are you sorting on?
2  what fields are you faceting on?
3  how many unique terms in each (see the solr admin page).

Best
Erick

On Wed, Jun 15, 2011 at 8:22 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de  wrote:

Dear list,

after getting OOM exception after one week of operation with
solr 3.2 I used MemoryAnalyzer for the heapdumpfile.
It looks like the fieldCache eats up all memory.

Objects Shalow Heap
   Retained Heap
org.apache.lucene.search.FieldCache   0   0

= 14,636,950,632

org.apache.lucene.search.FieldCacheImpl   1  32

= 14,636,950,384

org.apache.lucene.search.FieldCacheImpl$StringIndexCache  1  32

= 14,636,947,080

org.apache.lucene.search.FieldCache$StringIndex  10 320

= 14,636,944,352

java.lang.String[]  519 567,811,040

= 13,503,733,312

char[]   81,766,595  11,604,293,712

= 11,604,293,712


fieldCache retains over 14g of heap.

When looking on stats page under fieldCache the description says:
Provides introspection of the Lucene FieldCache, this is **NOT** a cache
that is managed by Solr.

So is this a jetty problem and not solr?

Why is fieldCache growing and growing until OOM?

Regards
Bernd



Re: Copying few field using copyField to non multiValued field

2011-06-16 Thread Michael Kuhlmann
Hi Omri,

there are two limitations:
1. You can't sort on a multiValued field. (Anyway, on which of the
copied fields would you want to sort first?)
2. You can't make the multiValued field the unique key.

Both are no real limitations:
1. Better sort on at_country, at_state, at_city instead.
2. Simply choose another unique key field. (Your location wouldn't be
unique anyway.)

Greetings,
Kuli

Am 16.06.2011 06:40, schrieb Omri Cohen:
 I just don't want to suffer all the limitation a multiValued field has.. (it
 does have some limitations, doesn't it?) I just remember I read somewhere
 that it does.




Re: DIH abort doesn't close datasources

2011-06-16 Thread Shalin Shekhar Mangar
On Wed, Jun 15, 2011 at 8:10 PM, Frank Wesemann
f.wesem...@fotofinder.netwrote:

 Hi,
 I just came across this:
 If I abort an import via /dataimport/?command=abort the connections to the
 (in my case) database stay open.
 Shouldn't DocBuilder#rollback() call something like cleanup() which in turn
 tries to close EntityProcessors, Datasources etc.
 instead of relying that finalize() will sometimes do it's job?


The abort command just sets a atomic boolean flag which is checked
frequently by the import threads to see if they should stop. If you look at
the DataImport.java's doFullImport or doDeltaImport methods, you'll see that
config.clearCaches is the clean up method which is called in a finally
block. So the data sources should be closed once the import actually aborts.
Note that there may be a time lag between calling the abort method and the
import actually getting abort if the import threads are waiting for I/O.

-- 
Regards,
Shalin Shekhar Mangar.


RE: Multiple indexes

2011-06-16 Thread Kai Gülzau
Are there any plans to support a kind of federated search
in a future solr version?

I think there are reasons to use seperate indexes for each document type
but do combined searches on these indexes
(for example if you need separate TFs for each document type).

I am aware of http://wiki.apache.org/solr/DistributedSearch
and a workaround to do federated search with sharding
http://stackoverflow.com/questions/2139030/search-multiple-solr-cores-and-return-one-result-set
but this seems to be too much network- and maintenance overhead.

Perhaps it is worth a try to use an IndexReaderFactory which
returns a lucene MultiReader!?
Is the IndexReaderFactory still Experimental?
https://issues.apache.org/jira/browse/SOLR-1366


Regards,

Kai Gülzau

 

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
 Sent: Wednesday, June 15, 2011 8:43 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Multiple indexes
 
 Next, however, I predict you're going to ask how you do a 'join' or 
 otherwise query accross both these cores at once though. You can't do 
 that in Solr.
 
 On 6/15/2011 1:00 PM, Frank Wesemann wrote:
  You'll configure multiple cores:
  http://wiki.apache.org/solr/CoreAdmin
  Hi.
 
  How to have multiple indexes in SOLR, with different fields and
  different types of data?
 
  Thank you very much!
  Bye.
 
 
 

Field Collapsing and Grouping in Solr 3.2

2011-06-16 Thread Sergio Martín
Hello.

 

Does anybody know if Field Collapsing and Grouping is available in Solr 3.2.

I mean directly available, not as a patch.

 

I have read conflicting statements about it...

 

Thanks a lot!

 

 

 http://www.playence.com/ Description: playence

Sergio Martín Cantero

playence KG

Penthouse office Soho II - Top 1

Grabenweg 68

6020 Innsbruck

Austria

Mobile: (+34)654464222

eMail:   mailto:sergio.mar...@playence.com sergio.mar...@playence.com

Web:www.playence.com

 

 skype:superepi2000 Description: skypeplayence
http://twitter.com/playence Description: twitterplayence
http://www.linkedin.com/companies/playence Description: linkedinplayence

 

Stay up to date on the latest developments of playence by subscribing to our
blog ( http://blog.playence.com http://blog.playence.com) or following us
in Twitter ( http://twitter.com/playence http://twitter.com/playence).

The information in this e-mail is confidential and may be legally
privileged. It is intended solely for the addressee and access to the e-mail
by anyone else is unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or omitted to be taken
in reliance on it, is prohibited and may be unlawful. If you have received
this e-mail in error please forward to  mailto:off...@playence.com
off...@playence.com.  Thank you for your cooperation. 

 



Re: DIH abort doesn't close datasources

2011-06-16 Thread Frank Wesemann

Shalin,
thank you for the answer.
I indeed didn't look into clearCache().
I thought it would just do that ( clear caches ). :)

Shalin Shekhar Mangar schrieb:

The abort command just sets a atomic boolean flag which is checked
frequently by the import threads to see if they should stop. If you look at
the DataImport.java's doFullImport or doDeltaImport methods, you'll see that
config.clearCaches is the clean up method which is called in a finally
block. So the data sources should be closed once the import actually aborts.
Note that there may be a time lag between calling the abort method and the
import actually getting abort if the import threads are waiting for I/O.

  



--
mit freundlichem Gruß,

Frank Wesemann
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.com/
Potsdamer Str. 96   Tel: +49 30 25 79 28 90
10785 BerlinFax: +49 30 25 79 28 999

Sitz: Berlin
Amtsgericht Berlin Charlottenburg (HRB 73099)
Geschäftsführer: Ali Paczensky





Showing facet of first N docs

2011-06-16 Thread Tommaso Teofili
Hi all,
Do you know if it is possible to show the facets for a particular field
related only to the first N docs of the total number of results?
It seems facet.limit doesn't help with it as it defines a window in the
facet constraints returned.
Thanks in advance,
Tommaso


Re: Field Collapsing and Grouping in Solr 3.2

2011-06-16 Thread Michael McCandless
Alas, no, not yet.. grouping/field collapse has had a long history
with Solr.

There were many iterations on SOLR-236, but that impl was never
committed.  Instead, SOLR-1682 was committed, but committed only to
trunk (never backported to 3.x despite requests).

Then, a new grouping module was factored out of Solr's trunk
implementation, and was backported to 3.x.

Finally, there is now an effort to cut over Solr trunk (SOLR-2564) and
Solr 3.x (SOLR-2524) to the new grouping module, which looks like it's
close to being done!

So hopefully for 3.3 but not promises!  This is open-source...

Mike McCandless

http://blog.mikemccandless.com


2011/6/16 Sergio Martín sergio.mar...@playence.com

 Hello.



 Does anybody know if Field Collapsing and Grouping is available in Solr
 3.2.

 I mean directly available, not as a patch.



 I have read conflicting statements about it...



 Thanks a lot!





 [image: Description: playence] http://www.playence.com/

 *Sergio Martín Cantero*

 *playence KG*

 Penthouse office Soho II - Top 1

 Grabenweg 68

 6020 Innsbruck

 Austria

 Mobile: (+34)654464222

 eMail:  sergio.mar...@playence.com

 Web:www.playence.com



 [image: Description: skypeplayence]  [image: Description: 
 twitterplayence]http://twitter.com/playence
   [image: Description: 
 linkedinplayence]http://www.linkedin.com/companies/playence



 Stay up to date on the latest developments of playence by subscribing to
 our blog (http://blog.playence.com) or following us in Twitter (
 http://twitter.com/playence).

 The information in this e-mail is confidential and may be legally
 privileged. It is intended solely for the addressee and access to the e-mail
 by anyone else is unauthorized. If you are not the intended recipient, any
 disclosure, copying, distribution or any action taken or omitted to be taken
 in reliance on it, is prohibited and may be unlawful. If you have received
 this e-mail in error please forward to off...@playence.com.  Thank you for
 your cooperation.





RE: Field Collapsing and Grouping in Solr 3.2

2011-06-16 Thread Sergio Martín
Mike, thanks a lot for your quick and precise answer!

Sergio Martín Cantero
playence KG
Penthouse office Soho II - Top 1
Grabenweg 68
6020 Innsbruck
Austria
Mobile: (+34)654464222
eMail:  sergio.mar...@playence.com
Web:www.playence.com



Stay up to date on the latest developments of playence by subscribing to our
blog (http://blog.playence.com) or following us in Twitter
(http://twitter.com/playence).
The information in this e-mail is confidential and may be legally
privileged. It is intended solely for the addressee and access to the e-mail
by anyone else is unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or omitted to be taken
in reliance on it, is prohibited and may be unlawful. If you have received
this e-mail in error please forward to off...@playence.com.  Thank you for
your cooperation. 

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: jueves, 16 de junio de 2011 12:51
To: solr-user@lucene.apache.org
Subject: Re: Field Collapsing and Grouping in Solr 3.2

Alas, no, not yet.. grouping/field collapse has had a long history with
Solr.

There were many iterations on SOLR-236, but that impl was never committed.
Instead, SOLR-1682 was committed, but committed only to trunk (never
backported to 3.x despite requests).

Then, a new grouping module was factored out of Solr's trunk implementation,
and was backported to 3.x.

Finally, there is now an effort to cut over Solr trunk (SOLR-2564) and Solr
3.x (SOLR-2524) to the new grouping module, which looks like it's close to
being done!

So hopefully for 3.3 but not promises!  This is open-source...

Mike McCandless

http://blog.mikemccandless.com


2011/6/16 Sergio Martín sergio.mar...@playence.com

 Hello.



 Does anybody know if Field Collapsing and Grouping is available in 
 Solr 3.2.

 I mean directly available, not as a patch.



 I have read conflicting statements about it...



 Thanks a lot!





 [image: Description: playence] http://www.playence.com/

 *Sergio Martín Cantero*

 *playence KG*

 Penthouse office Soho II - Top 1

 Grabenweg 68

 6020 Innsbruck

 Austria

 Mobile: (+34)654464222

 eMail:  sergio.mar...@playence.com

 Web:www.playence.com



 [image: Description: skypeplayence]  [image: Description:
twitterplayence]http://twitter.com/playence
   [image: Description: 
 linkedinplayence]http://www.linkedin.com/companies/playence



 Stay up to date on the latest developments of playence by subscribing 
 to our blog (http://blog.playence.com) or following us in Twitter ( 
 http://twitter.com/playence).

 The information in this e-mail is confidential and may be legally 
 privileged. It is intended solely for the addressee and access to the 
 e-mail by anyone else is unauthorized. If you are not the intended 
 recipient, any disclosure, copying, distribution or any action taken 
 or omitted to be taken in reliance on it, is prohibited and may be 
 unlawful. If you have received this e-mail in error please forward to 
 off...@playence.com.  Thank you for your cooperation.






Re: DIH abort doesn't close datasources

2011-06-16 Thread Shalin Shekhar Mangar
On Thu, Jun 16, 2011 at 3:46 PM, Frank Wesemann
f.wesem...@fotofinder.netwrote:

 Shalin,
 thank you for the answer.
 I indeed didn't look into clearCache().
 I thought it would just do that ( clear caches ). :)


Yeah, it is not the most aptly named method :)

Thanks for reviewing the code though!

-- 
Regards,
Shalin Shekhar Mangar.


Re: Mahout Solr

2011-06-16 Thread Adam Estrada
You're right...It would be nice to be able to see the cluster results coming
from Solr though...

Adam

On Thu, Jun 16, 2011 at 3:21 AM, Andrew Clegg andrew.clegg+mah...@gmail.com
 wrote:

 Well, it does have the ability to pull TermVectors from an index:


 https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html#CreatingVectorsfromText-FromLucene

 Nothing Solr-specific about it though.

 On 15 June 2011 15:38, Mark static.void@gmail.com wrote:
  Apache Mahout is a new Apache TLP project to create scalable, machine
  learning algorithms under the Apache license. It is related to other
 Apache
  Lucene projects and integrates well with Solr.
 
  How does Mahout integrate well with Solr? Can someone explain a brief
  overview on whats available. I'm guessing one of the features would be
 the
  replacing of the Carrot2 clustering algorithm with something a little
 more
  sophisticated?
 
  Thanks
 



 --

 http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg



Complex situation

2011-06-16 Thread roySolr
Hello,

First i will try to explain the situation:

I have some companies with openinghours. Some companies has multiple seasons
with different openinghours. I wil show some example data :

Companyid  Startdate(d-m)  Enddate(d-m) Openinghours_end
101-0101-04 17:00
101-0401-08 18:00
101-0831-12 17:30

201-0131-12 20:00

301-0101-06 17:00
301-0631-12 18:00

What i want is some facets on the left site of my page. They have to look
like this:

Closing today on:
17:00(23)
18:00(2)
20:00(1)

So i need to get the NOW to know which openinghours(seasons) i need in my
facet results. How should my index look like?
Can anybody helps me how i can save this data in the solr index?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html
Sent from the Solr - User mailing list archive at Nabble.com.


Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Mark Schoy
Hi,

I set up a Solr instance with 512 cores. Each core has 100k documents and 15
fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.

Now I've done some benchmarks with JMeter. On each thread iteration JMeter
queriing another Core by random. Here are the results (Duration:  each with
180 second):

Randomly queried cores | queries per second
1| 2016
2 | 2001
4 | 1978
8 | 1958
16 | 2047
32 | 1959
64 | 1879
128 | 1446
256 | 1009
512 | 428

Why are the queries per second until 64 constant and then the performance is
degreasing rapidly?

Solr only uses 10GB of the 16GB memory so I think it is not a memory issue.


Re: query routing with shards

2011-06-16 Thread Dmitry Kan
Hi Otis,

I followed your recommendation and decided to implement the
SearchComponent::modifyRequest(ResponseBuilder rb, SearchComponent who,
ShardRequest sreq) method, where the query routing happens. So far it is
working OK for the non-facet search, this is good news. The bad news is that
it fails on the facet search.

This is how request modification happens:

[code_snippet, SearchComponent::modifyRequest]
SolrQueryRequest req_routed = rb.req;
req_routed = routeRequest(req_routed);
rb.req = req_routed;
sreq.shards = shards.toString().split(,);
[/code_snippet]

where shards is StringBuilder, that accumulates the shards the request
should go to. req_routed also contains the target shards. Those are set like
this:


[code_snippet, my function routeRequest(SolrQueryRequest req)]
// could not find clone(), used ref reassignment
SolrQueryRequest req_local = req;
ModifiableSolrParams params = new
ModifiableSolrParams(req_local.getParams());
...
params.remove(ShardParams.SHARDS);
params.set(ShardParams.SHARDS, getShardsParams(yearToQuarterMap));
params.remove(ShardParams.IS_SHARD);
params.set(ShardParams.IS_SHARD, true);
req_local.setParams(params);
...
return req_local;
[/code_snippet]

The NPE happens down the road during the facet search, in the
FacetComponent::countFacets(), the cause of which is that OpenBitSet obs is
null for shardNum=0.

Do you have any idea why this happens, should some other field
of ResponseBuilder, SearchComponent or ShardRequest be changed?

BTW, I have tried to call FacetInfo::parse method inside
FacetComponent::modifyRequest() and countFacets(). Where do
the fi.facets.values() get initiated, is there some method to call?

Thanks,
Dmitry

On Fri, Jun 3, 2011 at 8:00 PM, Otis Gospodnetic otis_gospodne...@yahoo.com
 wrote:

 Nah, if you can quickly figure out which shard a given query maps to, then
 all
 this component needs to do is stick the appropriate shards param value in
 the
 request and let the request pass through to the other SearchComponents in
 the
 chain,  including QueryComponent, which will know what to do with the
 shards
 param.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Dmitry Kan dmitry@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, June 3, 2011 12:56:15 PM
  Subject: Re: query routing with shards
 
  Hi Otis,
 
  Thanks! This sounds promising. This custom implementation, will  it hurt
 in
  any way the stability of the front end SOLR? After implementing  it, can
 I
  run some tests to verify the stability /  performance?
 
  Dmitry
  On Fri, Jun 3, 2011 at 4:49 PM, Otis Gospodnetic  
 otis_gospodne...@yahoo.com
wrote:
 
   Hi Dmitry,
  
   Yes, you could also implement your  own custom SearchComponent.  In
 this
   component you could grab the  query param, examine the query value, and
   based on
   that add the  shards URL param with appropriate value, so that when the
regular
   QueryComponent grabs stuff from the request, it has the correct  shard
 in
   there
   already.
  
   Otis

   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem  search :: http://search-lucene.com/
  
  
  
   - Original  Message 
From: Dmitry Kan dmitry@gmail.com
To: solr-user@lucene.apache.org
  Sent: Fri, June 3, 2011 2:47:00 AM
Subject: Re: query routing  with shards
   
Hi Otis,
   
I  merely followed on the gmail's suggestion to include other  people
 into
   the
recipients list, Yonik was the first one :) I  won't do it  next
 time.
   
Thanks for a rapid reply.  The reason for doing this query  routing
 is
   that we
 abstract the distributed SOLR from the client code for  security
  reasons
(that is, we don't want to expose the entire shard farm  to  the
 world,
   but
only the frontend SOLR) and for  better decoupling.
   
Is  it possible to implement a  plugin to SOLR that would map queries
  to
shards?

We have other choices too, they'll take quite some time,   that's why
 I
decided to quickly ask, if I was missing something  from the SOLR
  main
components design and  configuration.
   
Dmitry
   
On  Fri, Jun 3,  2011 at 8:25 AM, Otis Gospodnetic 
   otis_gospodne...@yahoo.com
   wrote:
   
 Hi Dmitry (you may not  want to additionally copy Yonik, he's
subscribed to
  this
 list, too)

 
 It sounds  like you have the knowledge of which  query maps to
 which
   shard.
   If
  so, why not control/change the value of shards param in the
  request
to
 your
 front-end Solr  (aka distributed request dispatcher)  within your
 app,
which
 is
 the one calling Solr?
 
  Otis
 
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene  ecosystem search :: http://search-lucene.com/

 

 - Original  

Re: Boost Strangeness

2011-06-16 Thread Judioo
fascinating

Thank you so much Erik, I'm slowly beginning to understand.

SO I've discovered that by defining 'splitOnNumerics=0' on the filter
class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
can get *closer* to my required goal!

Now something else odd is occuring.

It only returns 2 results where there is over 70?

Why is that? I can't find were this is explained :(

query

/solr/select?omitNorms=trueq=b006m86ddefType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=onomitNorms=true

output

{

   - -
   responseHeader: {
  - status: 0
  - QTime: 51
  - -
  params: {
 - debugQuery: on
 - fl:
 
type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
 - indent: on
 - q: b006m86d
 - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
 subseries_container_id^8 clip_container_id^1 clip_episode_id^1
 - wt: json
 - -
 omitNorms: [
- true
- true
 ]
 - defType: dismax
  }
   }
   - -
   response: {
  - numFound: 2
  - start: 0
  - maxScore: 13.473297
  - -
  docs: [
 - -
 {
- parent_id: 
- id: b006m86d
- type: brand
- score: 13.473297
 }
 - -
 {
- series_container_id: 
- id: b00y1w9h
- type: episode
- brand_container_id: b006m86d
- subseries_container_id: 
- clip_episode_id: 
- score: 11.437143
 }
  ]
   }
   - -
   debug: {
  - rawquerystring: b006m86d
  - querystring: b006m86d
  - parsedquery: +DisjunctionMaxQuery((id:b006m86d^10.0 |
  clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
  series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
  brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()
  - parsedquery_toString: +(id:b006m86d^10.0 | clip_episode_id:b006m86d
  | subseries_container_id:b006m86d^8.0 |
series_container_id:b006m86d^8.0 |
  clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
  parent_id:b006m86d^9.0) ()
  - -
  explain: {
 - b006m86d:  13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
 of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
product of: 1.0 =
 tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
maxDocs=783800) 1.0 =
 fieldNorm(field=id, doc=27636) 
 - b00y1w9h:  11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
 of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
 product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
 product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
 0.007422088 = queryNorm 13.878762 = (MATCH)
 fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
 tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
 maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) 
  }
  - QParser: DisMaxQParser
  - altquerystring: null
  - boostfuncs: null
  - -
  timing: {
 - time: 51
 - -
 prepare: {
- time: 6
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 5
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 1
}
- -
org.apache.solr.handler.component.StatsComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.DebugComponent: {
   - time: 0
}
 }
 - -
 process: {
- time: 45
- -
org.apache.solr.handler.component.QueryComponent: {
   - time: 27
}
- -
org.apache.solr.handler.component.FacetComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.MoreLikeThisComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.HighlightComponent: {
   - time: 0
}
- -
org.apache.solr.handler.component.StatsComponent: {
   

Re: Showing facet of first N docs

2011-06-16 Thread Dmitry Kan
http://wiki.apache.org/solr/SimpleFacetParameters
facet.offset

This param indicates an offset into the list of constraints to allow paging.

The default value is 0.

This parameter can be specified on a per field basis.


Dmitry


On Thu, Jun 16, 2011 at 1:39 PM, Tommaso Teofili
tommaso.teof...@gmail.comwrote:

 Hi all,
 Do you know if it is possible to show the facets for a particular field
 related only to the first N docs of the total number of results?
 It seems facet.limit doesn't help with it as it defines a window in the
 facet constraints returned.
 Thanks in advance,
 Tommaso




-- 
Regards,

Dmitry Kan


Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Andrzej Bialecki

On 6/16/11 3:22 PM, Mark Schoy wrote:

Hi,

I set up a Solr instance with 512 cores. Each core has 100k documents and 15
fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.

Now I've done some benchmarks with JMeter. On each thread iteration JMeter
queriing another Core by random. Here are the results (Duration:  each with
180 second):

Randomly queried cores | queries per second
1| 2016
2 | 2001
4 | 1978
8 | 1958
16 | 2047
32 | 1959
64 | 1879
128 | 1446
256 | 1009
512 | 428

Why are the queries per second until 64 constant and then the performance is
degreasing rapidly?

Solr only uses 10GB of the 16GB memory so I think it is not a memory issue.



This may be an OS-level disk buffer issue. With a limited disk buffer 
space the more random IO occurs from different files, the higher is the 
churn rate, and if the buffers are full then the churn rate may increase 
dramatically (and the performance will drop then). Modern OS-es try to 
keep as much data in memory as possible, so the memory usage itself is 
not that informative - but check what are the pagein/pageout rates when 
you start hitting the 32 vs 64 cores.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: getFieldValue always returns an ArrayList?

2011-06-16 Thread Simon, Richard T
Interesting. You guessed right. I changed multivalued to multiValued and 
all of a sudden I get Strings. But, doesn't multivalued default to false? In my 
schema, I originally did not set multivalued. I only put in multivalued=false 
after I experienced this issue. 

-Rich

For the record, I had a number of fields which had never settings for 
multivalued because none of them were multivalued and I expected the default to 
be false. When I experienced this problem, I added multivalued=false to all 
of them. I still had the problem. So, I added a method to deal with the 
returned ArrayLists:

private Object getFieldValue(String field, SolrDocument document) {

ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);

}


I deliberately did not test if the return Object was an ArrayList because I 
wanted to get an exception if any of them were Strings; I got no exceptions, so 
they were all returned as ArrayLists. 

I then changed one of the fields to use multiValued=false, and I got an 
exception, trying to cast String to ArrayList! So, I changed all the 
troublesome fields to use multiValued, and changed my helper method to look 
like this:

private Object getFieldValue(String field, SolrDocument document) {
Object o = document.getFieldValue(field);

if (o instanceof ArrayList) {
System.out.println(### Field  + field +  is an 
instance of ArrayList.);
ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);
} else {
if (!(o instanceof String)) {
System.out.println(## ERROR);
} else {
System.out.println(### Field  + field +  
is an instance of String.);
}
return o;
}

}


Here's the output, interspersed with the schema definitions of the fields:

field name=uri type=string indexed=true stored=true 
multiValued=false required=true /
### Field uri is an instance of String.

field name=entity_label type=string indexed=false stored=true 
required=false /
### Field entity_label is an instance of ArrayList.

field name=institution_uri type=string indexed=true stored=true 
required=false /
### Field institution_uri is an instance of ArrayList.

field name=asserted_type_uri type=string indexed=true stored=true 
required=false /
### Field asserted_type_uri is an instance of ArrayList.

field name=asserted_type_label type=text_eaglei indexed=true 
stored=true required=false /
### Field asserted_type_label is an instance of ArrayList.

 field name=provider_uri type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_uri is an instance of String.

field name=provider_label type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_label is an instance of String.


As you can see, the ones with no declaration for multivalued are returned as 
ArrayLists, while the ones with multiValued=false are returned as Strings. 

So, it looks like there are two problems here: multivalued (small v) is not 
recognized, since using that in the schema still causes all fields to be 
returned as ArrayLists; and, multivalued does not default to false (or, at 
least, not setting it causes a field to be returned as an ArrayList, as though 
it were set to true).

-Rich


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 15, 2011 4:25 PM
To: solr-user@lucene.apache.org
Subject: Re: getFieldValue always returns an ArrayList?

Hmmm, I admit I'm not using embedded, and I'm using 3.2, but I'm
not seeing the behavior you are.

My question about reindexing could have been better stated, I
was just making sure you didn't have some leftover cruft where
your field was multi-valued from previous experiments, but if
you're reindexing each time that's not the problem.

Arrrh, camel case may be striking again. Try multiValued, not
multivalued

If that's still not it, can we see the code?

Best
Erick

On Wed, Jun 15, 2011 at 3:47 PM, Simon, Richard T
richard_si...@hms.harvard.edu wrote:
 We rebuild the index from scratch each time we start (for now). The fields in 
 question are not multi-valued; in fact, I explicitly set multi-valued to 
 false, just to be sure.

 Yes, this is SolrJ, using the embedded server, if that matters.

 Using Solr/Lucene 3.1.0.

 -Rich

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wednesday, June 15, 2011 3:44 PM
 To: solr-user@lucene.apache.org
 Subject: Re: getFieldValue always returns an ArrayList?

 Did you perhaps change the schema but not re-index? I'm 

Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread François Schiettecatte
I am assuming that you are running on linux here, I have found atop to be very 
useful to see what is going on.

http://freshmeat.net/projects/atop/

dstat is also very useful too but needs a little more work to 'decode'.

Obviously there is contention going on, you just need to figure out where it 
is, most likely it is disk I/O but it could also be the number of cores you 
have. Also I would not say that performance is decreasing rapidly, probably 
more of a gentle slope down if you plot it (your double the number of cores 
every time).

I would be very interested in hearing about what you find.

Cheers

François

On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote:

 On 6/16/11 3:22 PM, Mark Schoy wrote:
 Hi,
 
 I set up a Solr instance with 512 cores. Each core has 100k documents and 15
 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.
 
 Now I've done some benchmarks with JMeter. On each thread iteration JMeter
 queriing another Core by random. Here are the results (Duration:  each with
 180 second):
 
 Randomly queried cores | queries per second
 1| 2016
 2 | 2001
 4 | 1978
 8 | 1958
 16 | 2047
 32 | 1959
 64 | 1879
 128 | 1446
 256 | 1009
 512 | 428
 
 Why are the queries per second until 64 constant and then the performance is
 degreasing rapidly?
 
 Solr only uses 10GB of the 16GB memory so I think it is not a memory issue.
 
 
 This may be an OS-level disk buffer issue. With a limited disk buffer space 
 the more random IO occurs from different files, the higher is the churn rate, 
 and if the buffers are full then the churn rate may increase dramatically 
 (and the performance will drop then). Modern OS-es try to keep as much data 
 in memory as possible, so the memory usage itself is not that informative - 
 but check what are the pagein/pageout rates when you start hitting the 32 vs 
 64 cores.
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 



RE: getFieldValue always returns an ArrayList?

2011-06-16 Thread Simon, Richard T
FYI: Using multiValued=false for all string fields results in the following 
output:

### Field uri is an instance of String.
### Field entity_label is an instance of String.
### Field institution_uri is an instance of String.
### Field asserted_type_uri is an instance of String.
### Field asserted_type_label is an instance of String.
### Field provider_uri is an instance of String.
### Field provider_label is an instance of String.

-Rich

-Original Message-
From: Simon, Richard T 
Sent: Thursday, June 16, 2011 10:08 AM
To: solr-user@lucene.apache.org
Cc: Simon, Richard T
Subject: RE: getFieldValue always returns an ArrayList?

Interesting. You guessed right. I changed multivalued to multiValued and 
all of a sudden I get Strings. But, doesn't multivalued default to false? In my 
schema, I originally did not set multivalued. I only put in multivalued=false 
after I experienced this issue. 

-Rich

For the record, I had a number of fields which had never settings for 
multivalued because none of them were multivalued and I expected the default to 
be false. When I experienced this problem, I added multivalued=false to all 
of them. I still had the problem. So, I added a method to deal with the 
returned ArrayLists:

private Object getFieldValue(String field, SolrDocument document) {

ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);

}


I deliberately did not test if the return Object was an ArrayList because I 
wanted to get an exception if any of them were Strings; I got no exceptions, so 
they were all returned as ArrayLists. 

I then changed one of the fields to use multiValued=false, and I got an 
exception, trying to cast String to ArrayList! So, I changed all the 
troublesome fields to use multiValued, and changed my helper method to look 
like this:

private Object getFieldValue(String field, SolrDocument document) {
Object o = document.getFieldValue(field);

if (o instanceof ArrayList) {
System.out.println(### Field  + field +  is an 
instance of ArrayList.);
ArrayList list = 
(ArrayList)document.getFieldValue(field);
return list.get(0);
} else {
if (!(o instanceof String)) {
System.out.println(## ERROR);
} else {
System.out.println(### Field  + field +  
is an instance of String.);
}
return o;
}

}


Here's the output, interspersed with the schema definitions of the fields:

field name=uri type=string indexed=true stored=true 
multiValued=false required=true /
### Field uri is an instance of String.

field name=entity_label type=string indexed=false stored=true 
required=false /
### Field entity_label is an instance of ArrayList.

field name=institution_uri type=string indexed=true stored=true 
required=false /
### Field institution_uri is an instance of ArrayList.

field name=asserted_type_uri type=string indexed=true stored=true 
required=false /
### Field asserted_type_uri is an instance of ArrayList.

field name=asserted_type_label type=text_eaglei indexed=true 
stored=true required=false /
### Field asserted_type_label is an instance of ArrayList.

 field name=provider_uri type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_uri is an instance of String.

field name=provider_label type=string indexed=true stored=true 
multiValued=false required=false /
### Field provider_label is an instance of String.


As you can see, the ones with no declaration for multivalued are returned as 
ArrayLists, while the ones with multiValued=false are returned as Strings. 

So, it looks like there are two problems here: multivalued (small v) is not 
recognized, since using that in the schema still causes all fields to be 
returned as ArrayLists; and, multivalued does not default to false (or, at 
least, not setting it causes a field to be returned as an ArrayList, as though 
it were set to true).

-Rich


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, June 15, 2011 4:25 PM
To: solr-user@lucene.apache.org
Subject: Re: getFieldValue always returns an ArrayList?

Hmmm, I admit I'm not using embedded, and I'm using 3.2, but I'm
not seeing the behavior you are.

My question about reindexing could have been better stated, I
was just making sure you didn't have some leftover cruft where
your field was multi-valued from previous experiments, but if
you're reindexing each time that's not the problem.

Arrrh, camel case may be striking again. Try multiValued, not
multivalued

If that's still not it, can 

Re: Showing facet of first N docs

2011-06-16 Thread Tommaso Teofili
Thanks Dmitry, but maybe I didn't explain correctly as I am not sure
facet.offset is the right solution, I'd like not to page but to filter
facets.
I'll try to explain better with an example.
Imagine I make a query and first 2 docs in results have both 'xyz' and 'abc'
as values for field 'lemmas' while also other docs in the results have 'xyz'
or 'abc' as values of field 'lemmas' then I would like to show facets
coming from only the first 2 docs in the results thus having :
lst name=lemmas
  str name=xyz2/str
  str name=abc2/str
/lst
You can imagine this like a 'give me only facets related to the most
relevant docs in the results' functionality.
Any idea on how to do that?
Tommaso


2011/6/16 Dmitry Kan dmitry@gmail.com

 http://wiki.apache.org/solr/SimpleFacetParameters
 facet.offset

 This param indicates an offset into the list of constraints to allow
 paging.

 The default value is 0.

 This parameter can be specified on a per field basis.


 Dmitry


 On Thu, Jun 16, 2011 at 1:39 PM, Tommaso Teofili
 tommaso.teof...@gmail.comwrote:

  Hi all,
  Do you know if it is possible to show the facets for a particular field
  related only to the first N docs of the total number of results?
  It seems facet.limit doesn't help with it as it defines a window in the
  facet constraints returned.
  Thanks in advance,
  Tommaso
 



 --
 Regards,

 Dmitry Kan



Re: How to index correctly a text save with tinyMCE

2011-06-16 Thread Ariel
I have the following problem: I am using the spanish analyzer to index and
query, but due to I am using tinymce some charactes of the text are changed
codified in html, for example the text: En españa ...  it is changed to
En espantilde;a so I need a way to recodify that text to make queries
correctly.

Could you help me please ???
Regards
Ariel

On Wed, Jun 15, 2011 at 9:49 PM, Erick Erickson erickerick...@gmail.comwrote:

 Please review this page:
 http://wiki.apache.org/solr/UsingMailingLists

 You haven't stated what your problem is. Some
 examples of what your inputs and desired outputs
 are would be helpful

 Meanwhile, see this page:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 but that's a wild guess.

 Best
 Erick

 On Wed, Jun 15, 2011 at 2:30 PM, Ariel isaacr...@gmail.com wrote:
  Hi everybody, I am using tinyMCE to save the text I am indexing, but as
 you
  know the characters whith accents are changed. Could anybody tell me how
 to
  solve that problem ? Is there any analyzers that recognize rich text ???
 
  I would appreciate your help.
  Regards,
  Ariel
 



Re: query routing with shards

2011-06-16 Thread Dmitry Kan
Hi Otis,

I have fixed it by assigning the value to rb same as assigned to sreq:

rb.shards = shards.toString().split(,);


not tested that fully yet, but distributed faceting works at least on my pc
_3 shards 1 router_ setup.

Dmitry


On Thu, Jun 16, 2011 at 4:53 PM, Dmitry Kan dmitry@gmail.com wrote:

 Hi Otis,

 I followed your recommendation and decided to implement the
 SearchComponent::modifyRequest(ResponseBuilder rb, SearchComponent who,
 ShardRequest sreq) method, where the query routing happens. So far it is
 working OK for the non-facet search, this is good news. The bad news is that
 it fails on the facet search.

 This is how request modification happens:

 [code_snippet, SearchComponent::modifyRequest]
 SolrQueryRequest req_routed = rb.req;
 req_routed = routeRequest(req_routed);
 rb.req = req_routed;
 sreq.shards = shards.toString().split(,);
 [/code_snippet]

 where shards is StringBuilder, that accumulates the shards the request
 should go to. req_routed also contains the target shards. Those are set like
 this:


 [code_snippet, my function routeRequest(SolrQueryRequest req)]
 // could not find clone(), used ref reassignment
 SolrQueryRequest req_local = req;
 ModifiableSolrParams params = new
 ModifiableSolrParams(req_local.getParams());
 ...
 params.remove(ShardParams.SHARDS);
 params.set(ShardParams.SHARDS, getShardsParams(yearToQuarterMap));
 params.remove(ShardParams.IS_SHARD);
 params.set(ShardParams.IS_SHARD, true);
 req_local.setParams(params);
 ...
 return req_local;
 [/code_snippet]

 The NPE happens down the road during the facet search, in the
 FacetComponent::countFacets(), the cause of which is that OpenBitSet obs is
 null for shardNum=0.

 Do you have any idea why this happens, should some other field
 of ResponseBuilder, SearchComponent or ShardRequest be changed?

 BTW, I have tried to call FacetInfo::parse method inside
 FacetComponent::modifyRequest() and countFacets(). Where do
 the fi.facets.values() get initiated, is there some method to call?

 Thanks,
 Dmitry

 On Fri, Jun 3, 2011 at 8:00 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:

 Nah, if you can quickly figure out which shard a given query maps to, then
 all
 this component needs to do is stick the appropriate shards param value in
 the
 request and let the request pass through to the other SearchComponents in
 the
 chain,  including QueryComponent, which will know what to do with the
 shards
 param.

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Dmitry Kan dmitry@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Fri, June 3, 2011 12:56:15 PM
  Subject: Re: query routing with shards
 
  Hi Otis,
 
  Thanks! This sounds promising. This custom implementation, will  it hurt
 in
  any way the stability of the front end SOLR? After implementing  it, can
 I
  run some tests to verify the stability /  performance?
 
  Dmitry
  On Fri, Jun 3, 2011 at 4:49 PM, Otis Gospodnetic  
 otis_gospodne...@yahoo.com
wrote:
 
   Hi Dmitry,
  
   Yes, you could also implement your  own custom SearchComponent.  In
 this
   component you could grab the  query param, examine the query value,
 and
   based on
   that add the  shards URL param with appropriate value, so that when
 the
regular
   QueryComponent grabs stuff from the request, it has the correct  shard
 in
   there
   already.
  
   Otis

   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem  search :: http://search-lucene.com/
  
  
  
   - Original  Message 
From: Dmitry Kan dmitry@gmail.com
To: solr-user@lucene.apache.org
  Sent: Fri, June 3, 2011 2:47:00 AM
Subject: Re: query routing  with shards
   
Hi Otis,
   
I  merely followed on the gmail's suggestion to include other
  people
 into
   the
recipients list, Yonik was the first one :) I  won't do it  next
 time.
   
Thanks for a rapid reply.  The reason for doing this query  routing
 is
   that we
 abstract the distributed SOLR from the client code for  security
  reasons
(that is, we don't want to expose the entire shard farm  to  the
 world,
   but
only the frontend SOLR) and for  better decoupling.
   
Is  it possible to implement a  plugin to SOLR that would map
 queries  to
shards?

We have other choices too, they'll take quite some time,   that's
 why I
decided to quickly ask, if I was missing something  from the SOLR
  main
components design and  configuration.
   
Dmitry
   
On  Fri, Jun 3,  2011 at 8:25 AM, Otis Gospodnetic 
   otis_gospodne...@yahoo.com
   wrote:
   
 Hi Dmitry (you may not  want to additionally copy Yonik, he's
subscribed to
  this
 list, too)

 
 It sounds  like you have the knowledge of which  query maps to
 which
   shard.
   If
  so, why not control/change the value of shards 

Encoding of alternate fields in highlighting

2011-06-16 Thread Massimo Schiavon
I have an index with various fields and I want to highlight query 
matchings on title and content fields.
These fields could contain html tags so I've configured HtmlFormatter 
for highlighting. The problem is that if the query doesn't match the 
text of the field, solr returns the value of configured alternate field 
without encoding it.
Is there any way to get encoded value also for alternate fields? And in 
general there is a way to do html escaping on values returned from a 
response writer?


I'm using solr 3.1 and here is an excerpt from requestHandler configuration

[...]
str name=wtjson/str
str name=hltrue/str
str name=hl.fltitle,content/str
str name=hl.simple.pre![CDATA[b]]/str
str name=hl.simple.post![CDATA[/b]]/str
str name=f.title.hl.fragsize1024/str
str name=f.title.hl.alternateFieldtitle/str
str name=f.title.hl.maxAlternateFieldLength512/str
int name=f.title.hl.snippets1/int
str name=f.content.hl.alternateFieldcontent/str
str name=f.content.hl.maxAlternateFieldLength512/str
int name=f.content.hl.snippets2/int
[...]

and from highlighting configuration

[...]
highlighting
formatter name=html class=org.apache.solr.highlight.HtmlFormatter 
default=true

/formatter
encoder name=html class=org.apache.solr.highlight.HtmlEncoder 
default=true /
fragmentsBuilder name=default 
class=org.apache.solr.highlight.ScoreOrderFragmentsBuilder

default=true /
/highlighting
[...]

Thanks
Massimo

--
DISCLAIMER: This e-mail and any attachment is for authorised use by
the intended recipient(s) only. It may contain proprietary material,
confidential information and/or be subject to legal privilege. It
should not be copied, disclosed to, retained or used by, any other
party. If you are not an intended recipient then please promptly
delete this e-mail and any attachment and all copies and inform
the sender. Thank you.



Re: Complex situation

2011-06-16 Thread Alexey Serba
Am I right that you are only interested in results / facets for
current season? If it's so then you can index start/end dates as a
separate number fields and build your search filters like this
fq=+start_date_month:[* TO 6] +start_date_day:[* TO 17]
+end_date_month:[* TO 6] +end_date_day:[16 TO *] where 6/16 is
current month/day.

On Thu, Jun 16, 2011 at 5:20 PM, roySolr royrutten1...@gmail.com wrote:
 Hello,

 First i will try to explain the situation:

 I have some companies with openinghours. Some companies has multiple seasons
 with different openinghours. I wil show some example data :

 Companyid          Startdate(d-m)  Enddate(d-m)     Openinghours_end
 1                        01-01                01-04                 17:00
 1                        01-04                01-08                 18:00
 1                        01-08                31-12                 17:30

 2                        01-01                31-12                 20:00

 3                        01-01                01-06                 17:00
 3                        01-06                31-12                 18:00

 What i want is some facets on the left site of my page. They have to look
 like this:

 Closing today on:
 17:00(23)
 18:00(2)
 20:00(1)

 So i need to get the NOW to know which openinghours(seasons) i need in my
 facet results. How should my index look like?
 Can anybody helps me how i can save this data in the solr index?





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Complex-situation-tp3071936p3071936.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Showing facet of first N docs

2011-06-16 Thread karsten-solr
Hi Tommaso,

the FacetComponent works with the DocListAndSet#docSet.
It should be easy to switch to DocListAndSet#docList (which contains all 
documents for result list (default: TOP-10, but possible 15-25 (if start=15, 
rows=11). Which means to change the source code.

Instead of changing the source-code the easier way should be to send a second 
request with relevance-Filter (if your sort-criteria is relevance):
 http://lucene.472066.n3.nabble.com/Filter-by-relevance-td1837486.html

Best regards
  Karsten

http://lucene.472066.n3.nabble.com/Showing-facet-of-first-N-docs-td3071395.html
 Original-Nachricht 
 Datum: Thu, 16 Jun 2011 12:39:32 +0200
 Von: Tommaso Teofili tommaso.teof...@gmail.com
 An: solr-user@lucene.apache.org
 Betreff: Showing facet of first N docs

 Hi all,
 Do you know if it is possible to show the facets for a particular field
 related only to the first N docs of the total number of results?
 It seems facet.limit doesn't help with it as it defines a window in the
 facet constraints returned.
 Thanks in advance,
 Tommaso


Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Mark Schoy
Thanks for your answers.

Andrzej was right with his assumption. Solr only needs about 9GB memory but
the system needs the rest of it for disc IO:

64 Cores:  64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS =
16GB

Conclusion: My system can exactly buffer the data of 64 Cores. Every
additional core cant be buffered and the performance is decreasing.



2011/6/16 François Schiettecatte fschietteca...@gmail.com

 I am assuming that you are running on linux here, I have found atop to be
 very useful to see what is going on.

http://freshmeat.net/projects/atop/

 dstat is also very useful too but needs a little more work to 'decode'.

 Obviously there is contention going on, you just need to figure out where
 it is, most likely it is disk I/O but it could also be the number of cores
 you have. Also I would not say that performance is decreasing rapidly,
 probably more of a gentle slope down if you plot it (your double the number
 of cores every time).

 I would be very interested in hearing about what you find.

 Cheers

 François

 On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote:

  On 6/16/11 3:22 PM, Mark Schoy wrote:
  Hi,
 
  I set up a Solr instance with 512 cores. Each core has 100k documents
 and 15
  fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.
 
  Now I've done some benchmarks with JMeter. On each thread iteration
 JMeter
  queriing another Core by random. Here are the results (Duration:  each
 with
  180 second):
 
  Randomly queried cores | queries per second
  1| 2016
  2 | 2001
  4 | 1978
  8 | 1958
  16 | 2047
  32 | 1959
  64 | 1879
  128 | 1446
  256 | 1009
  512 | 428
 
  Why are the queries per second until 64 constant and then the
 performance is
  degreasing rapidly?
 
  Solr only uses 10GB of the 16GB memory so I think it is not a memory
 issue.
 
 
  This may be an OS-level disk buffer issue. With a limited disk buffer
 space the more random IO occurs from different files, the higher is the
 churn rate, and if the buffers are full then the churn rate may increase
 dramatically (and the performance will drop then). Modern OS-es try to keep
 as much data in memory as possible, so the memory usage itself is not that
 informative - but check what are the pagein/pageout rates when you start
 hitting the 32 vs 64 cores.
 
  --
  Best regards,
  Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 




Document Scoring

2011-06-16 Thread zarni aung
Hi,

I am designing my indexes to have 1 write-only master core, 2 read-only
slave cores.  That means the read-only cores will only have snapshots pulled
from the master and will not have near real time changes.  I was thinking
about adding a hybrid read and write master core that will have the most
recent changes from my primary data source.  I am thinking to query the
hybrid master and the read-only slaves and somehow try to intersect the
results in order to support near real time full text search.  Is this
feasible?

Thank you,

Zarni


Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-16 Thread Alexey Serba
 So a search for a product once the user logs in and searches for only the
 products that he has access to Will translate to something like this . ,the
 product ids are obtained form the db  for a particular user and can run
 into  n  number.

 search term fq=product_id(100 10001  ..n number)

 but we are currently running into too many Boolean expansion error .We are
 not able to tie the user also into roles as each user is mainly any one who
 comes to site and purchases a product .

I'm wondering if new trunk Solr join functionality can help here.

* http://wiki.apache.org/solr/Join

In theory you can index your products (product_id, ...) and
user_id-product many-to-many relation (user_product_id, user_id) into
signle/different cores and then do join, like
f=search termsfq={!join from=product_id to=user_product_id}user_id:10101

But I haven't tried that, so I'm just speculating.


RE: How to index correctly a text save with tinyMCE

2011-06-16 Thread Steven A Rowe
Hi Ariel,

On 6/16/2011 at 10:45 AM, Ariel wrote:
 I have the following problem: I am using the spanish analyzer to index
 and query, but due to I am using tinymce some charactes of the text are
 changed codified in html, for example the text: En españa ...  it is
 changed to En espantilde;a so I need a way to recodify that text to
 make queries correctly.

HTMLStripCharFilterFactory, which strips out HTML tags, also converts named 
character entities like ntilde; to their equivalent character.

Steve


Re: How to index correctly a text save with tinyMCE

2011-06-16 Thread Ariel
Thanks for your answer, I have just put the filter in my schema.xml but it
doesn't work I am using solr 1.4 and my conf is:

code
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.HTMLStripCharFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=Spanish/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/code


But it doesn't work in tomcat 6 logs I get this error:

 java.lang.ClassCastException:
org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to
org.apache.solr.analysis.TokenFilterFactory
at org.apache.solr.schema.IndexSchema$6.init(IndexSchema.java:831)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:149)
at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:835)
at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:58)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:424)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:447)
at
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:141)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:456)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:426)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117)
...

Any Idea ? How can I solve that problem ???

Regards
Ariel



On Thu, Jun 16, 2011 at 6:24 PM, Steven A Rowe sar...@syr.edu wrote:

 Hi Ariel,

 On 6/16/2011 at 10:45 AM, Ariel wrote:
  I have the following problem: I am using the spanish analyzer to index
  and query, but due to I am using tinymce some charactes of the text are
  changed codified in html, for example the text: En españa ...  it is
  changed to En espantilde;a so I need a way to recodify that text to
  make queries correctly.

 HTMLStripCharFilterFactory, which strips out HTML tags, also converts named
 character entities like ntilde; to their equivalent character.

 Steve



Re: How to index correctly a text save with tinyMCE

2011-06-16 Thread Shawn Heisey

On 6/16/2011 11:12 AM, Ariel wrote:

Thanks for your answer, I have just put the filter in my schema.xml but it
doesn't work I am using solr 1.4 and my conf is:

code
analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.HTMLStripCharFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=Spanish/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/code


But it doesn't work in tomcat 6 logs I get this error:

  java.lang.ClassCastException:
org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to
org.apache.solr.analysis.TokenFilterFactory


According to the wiki, the output of that filter must be passed to 
either another CharFilter or a Tokenizer.  Try moving it before 
WhitespaceTokenizerFactory.


Shawn



RE: getFieldValue always returns an ArrayList?

2011-06-16 Thread Chris Hostetter

: and all of a sudden I get Strings. But, doesn't multivalued default to 
: false? In my schema, I originally did not set multivalued. I only put in 
: multivalued=false after I experienced this issue.

That's dependent on the version of Solr, and it's is where the 
version property of the schema comes in.  (as the default behavior in 
solr changes, it does so dependent on what version you specify in your 
schema to prevent radical behavior changes if you upgrade but keep the 
same configs)...

schema name=example version=1.4
  !-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --



-Hoss


RE: getFieldValue always returns an ArrayList?

2011-06-16 Thread Simon, Richard T
We haven't changed Solr versions. We've been using 3.1.0 all along.

Plus, I have some code that runs during indexing and retrieves the fields from 
a SolrInputDocument, rather than a SolrDocument. That code gets Strings without 
any problem, and always has, even without saying multiValued=false.

-Rich

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, June 16, 2011 2:18 PM
To: solr-user@lucene.apache.org
Cc: Simon, Richard T
Subject: RE: getFieldValue always returns an ArrayList?


: and all of a sudden I get Strings. But, doesn't multivalued default to 
: false? In my schema, I originally did not set multivalued. I only put in 
: multivalued=false after I experienced this issue.

That's dependent on the version of Solr, and it's is where the 
version property of the schema comes in.  (as the default behavior in 
solr changes, it does so dependent on what version you specify in your 
schema to prevent radical behavior changes if you upgrade but keep the 
same configs)...

schema name=example version=1.4
  !-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --



-Hoss


Re: Strange behavior

2011-06-16 Thread Alexey Serba
Have you stopped Solr before manually copying the data? This way you
can be sure that index is the same and you didn't have any new docs on
the fly.

2011/6/14 Denis Kuzmenok forward...@ukr.net:
 What  should  i provide, OS is the same, environment is the same, solr
 is  completely  copied,  searches  work,  except that one, and that is
 strange..

 I think you will need to provide more information than this, no-one on this 
 list is omniscient AFAIK.

 François

 On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:

 Hi.

 I've  debugged search on test machine, after copying to production server
 the  entire  directory  (entire solr directory), i've noticed that one
 query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
 production.
 How can that be?








Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-16 Thread Sujatha Arun
Peter ,

Thanks for the clarification.

Why  I specifically asked was because, we have  many search instances
(200+) on a single JVM.

Each of these instaces could have  n users and each user can subscribe  to
n products .Now accordng to your suggestion , I need to maintain an
in-memory list  of all users and their subscribed products  for each of the
instances and use this list to fllter for a given query.We are maintaining
the user and  subscrption details in a DB.

 I was wondering ,instead if it would make  more sense(with respect to
memory) to  dynamically  get the subscribed product ids when ever a user
logs in (as   access is only for the user session) and  use this data to
flter the query ?

And we really do not have budget and hence wont be able to contract  LI  for
this ,though I will certanly need to get some JAVA experts help wthin my
org.

Thanks for your time

Regards
Sujatha



On Wed, Jun 15, 2011 at 11:29 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hi,

 By in-memory, I mean you hold a list of users (+ some other parameters
 like order number, expiry, what ever else you need) in one of those
 Greek HashMaps, and use this list to determine what query
 parameters/results will be processed for a given search request
 (SOLR-1872 reads an acl file to populate such a list). So if you had
 500 users who had purchased stuff at a given moment, you'd have 500
 entries in the table that hold the relevant data to filter/not filter
 searches/results.
 This won't cause a memory problem unless you have a million users and
 stored their autobiography in each entry.
 I wouldn't call this sort of thing a novice or even journeyman's task,
 you would definitely need to know about using and maintaining tables
 etc.
 Would you be able to contract someone to do the work on your behalf?
 There are some excellent resources around, and Lucid would certainly
 do a great job, but of course you'd need budget for this approach.
 Alternatively, maybe you can tap some java expertise within your
 organization to help out?

 HTH,
 Peter


 On Wed, Jun 15, 2011 at 6:17 PM, Sujatha Arun suja.a...@gmail.com wrote:
  Thanks ,Peter.
 
  I am not a Java  Programmer  and hence the code seems all Greek and Latin
 to
  me .I do have a basic knowledge ,but all this Map,hashMap
  ,Hashlist,NamedList  , I dont understand.
 
  However  I would like to implement the solution that you have mentoned
  ,so
  if you have any pointers for me ,would be great .I would also try to dig
  deep into JAVA.
 
  What s meant by  in-memory?Is it the Ram memory ,So If i  have n
  concurrent users ,each having n products subscrbed,what would be the
  Impact on memory ?
 
 
 
  Regards
  Sujatha
 
 
  On Tue, Jun 14, 2011 at 5:43 PM, Peter Sturge peter.stu...@gmail.com
 wrote:
 
  SOLR-1872 doesn't add discrete booleans to the query, it does it
  programmatically, so you shouldn't see this problem. (if you have a
  look at the code, you'll see how it filters queries)
  I suppose you could modify SOLR-1872 to use an in-memory,
  dynamically-updated user list (+ associated filters) instead of using
  the acl file.
  This would give you the 'changing users' and 'expiry' functionailty you
  need.
 
 
 
  On Tue, Jun 14, 2011 at 10:08 AM, Sujatha Arun suja.a...@gmail.com
  wrote:
   Thanks Peter , for your input .
  
   I really  would like a document and schema agnostic   solution as  in
  solr
   1872.
  
Am I right  in my assumption that SOLR1872  is same as the solution
 that
   we currently have where we add a flter query of the products  to
 orignal
   query and hence (SOLR 1872) will also run into  TOO many boolean
 clause
   expanson error?
  
   Regards
   Sujatha
  
  
   On Tue, Jun 14, 2011 at 1:53 PM, Peter Sturge peter.stu...@gmail.com
  wrote:
  
   Hi,
  
   SOLR-1834 is good when the original documents' ACL is accessible.
   SOLR-1872 is good where the usernames are persistent - neither of
   these really fit your use case.
   It sounds like you need more of an 'in-memory', transient access
   control mechanism. Does the access have to exist beyond the user's
   session (or the Solr vm session)?
   Your best bet is probably something like a custom SearchComponent or
   similar, that keeps track of user purchases, and either
 adjusts/limits
   the query or the results to suit.
   With your own module in the query chain, you can then decide when the
   'expiry' is, and limit results accordingly.
  
   SearchComponent's are pretty easy to write and integrate. Have a look
  at:
 http://wiki.apache.org/solr/SearchComponent
   for info on SearchComponent and its usage.
  
  
  
  
   On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun suja.a...@gmail.com
  wrote:
Hello,
   
   
Our Use Case is as follows
   
Several solr webapps (one JVM) ,Each webapp catering to one client
  .Each
client has their users who can purchase products from the  site
 .Once
   they
purchase ,they have full access to the products ,other wise 

RE: getFieldValue always returns an ArrayList?

2011-06-16 Thread Simon, Richard T
Ah! That was the problem. The version was 1.0. I'll change it to 1.2. Thanks!

-Rich

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, June 16, 2011 2:33 PM
To: Simon, Richard T
Cc: solr-user@lucene.apache.org
Subject: RE: getFieldValue always returns an ArrayList?


: We haven't changed Solr versions. We've been using 3.1.0 all along.

but that's not what i'm talking about.  I'm talking about the schema 
version ... a specific property declared in your schema.xml file.

did you check it?

(even when people start with Solr X, they sometimes are using schema.xml 
files provided by external packages -- Drupal, wordpress, etc... -- and 
don't notice that those are from older versions)

: Plus, I have some code that runs during indexing and retrieves the 
: fields from a SolrInputDocument, rather than a SolrDocument. That code 
: gets Strings without any problem, and always has, even without saying 
: multiValued=false.

SolrInputDocument's are irelevant.  they are used to index data, but they 
don't know anything about the schema.  A SolrInputDocument might be 
completely invalid because of multiple values for singled value fields, or 
missing values for required fields, etc...   what comes back from a search 
*is* consistent with the schema (even when there is only one value stored 
in a multiValued field)

-Hoss


Re: Updating only one indexed field for all documents quickly.

2011-06-16 Thread Alexey Serba
 with the integer field. If you just want to influence the
 score, then just plain external field fields should work for
 you.

 Is this an appropriate solution, give our use case?

Yes, check out ExternalFileField

* http://search.lucidimagination.com/search/document/CDRG_ch04_4.4.4
* 
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
* http://www.slideshare.net/greggdonovan/solr-lucene-etsy-by-gregg-donovan/28


It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Gabriele Kahlout
Hello,

I'm testing out different Similarity implementations, and to do that I
restart Solr each time I want to try a different similarity class I change
the class attributed of the similiary element in schema.xml. Beside running
multiple-cores, each with its own schema, is there a way to tell the
RequestHandler which similarity class to use?

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: Minimum Should Match + External Field + Function Query with boost

2011-06-16 Thread Chris Hostetter
: Seem to have a solution but I am still trying to figure out how/why it works. 
: 
: Addition of defType=edismax in the boost query seem to honor MM and
: correct boosting based on external file source. 

You didn't bost enough details in your original question to be 100% 
certain (would have needed to see the *full* solr url, including path, and 
your requestHandler declaration from solrconfig.xml to be sure) but i 
suspect the problem you were having is that you weren't actually using 
dismax (or edismax) at all until you added the explicit defType you 
mentioned...

: The new query syntax
: q={!boost b=dishRating v=$qq defType=edismax}qq=hot chicken wings 

compare the parsedquery_toString in the debug output of your previous 
message with the debug output you get now and i think you'll see a clear 
indication of when a DisjunctionMaxQuery is used (and what the mm is set 
to)


-Hoss


RE: HTMLStripTransformer will remove the content in XML??

2011-06-16 Thread Chris Hostetter

FYI: There's a new patch specificly for dealing with xml tags and entities 
that handles the CDATA case...

https://issues.apache.org/jira/browse/SOLR-2597

: Date: Fri, 27 May 2011 17:01:26 +0800
: From: Ellery Leung elleryle...@be-o.com
: Reply-To: solr-user@lucene.apache.org, elleryle...@be-o.com
: To: solr-user@lucene.apache.org
: Subject: RE: HTMLStripTransformer will remove the content in XML??
: 
: Got it.  Actually I use solr.MappingCharFilterFactory to replace the 
![CDATA[ and ]] to empty first, and use HTMLStripCharFilterFactory to get 
hello and solr.
: 
: For future reference, here is part of schema.xml
: 
: fieldType name=textMaxWord class=solr.TextField 
:   analyzer type=index
:   charFilter class=solr.MappingCharFilterFactory 
mapping=mappings.txt/
:   charFilter class=solr.HTMLStripCharFilterFactory /
: ...
: 
: In mappings.txt (2 lines)
: 
: ![CDATA[ = 
: 
: ]] = 
: 
: Restart Solr
: 
: It works.
: 
: Thank you
: 
: -Original Message-
: From: bryan rasmussen [mailto:rasmussen.br...@gmail.com] 
: Sent: 2011年5月27日 4:20 下午
: To: solr-user@lucene.apache.org; elleryle...@be-o.com
: Subject: Re: HTMLStripTransformer will remove the content in XML??
: 
: I would expect that it doesn't understand CDATA and thinks of
: everything between  and  as a 'tag'.
: 
: Best Regards,
: Bryan Rasmussen
: 
: On Fri, May 27, 2011 at 9:41 AM, Ellery Leung elleryle...@be-o.com wrote:
:  I have an XML string like this:
: 
: 
: 
:  ?xml version=1.0
:  encoding=UTF-8?languageintl![CDATA[hello]]/intlloc![CDATA[solr
:  ]]/loc/language
: 
: 
: 
:  By using HTMLStripTransformer, I expect to get 'hello,solr'.
: 
: 
: 
:  But actual this transformer will remove ALL THE TEXT INSIDE!
: 
: 
: 
:  Did I do something silly, or is it a bug?
: 
: 
: 
:  Thank you
: 
: 
: 
: 

-Hoss

Re: It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Erik Hatcher
No, there's not a way to control Similarity on a per-request basis.  

Some factors from Similarity are computed at index-time though.

What factors are you trying to tweak that way and why?  Maybe doing boosting 
using some other mechanism (boosting functions, boosting clauses) would be a 
better way to go?

Erik




On Jun 16, 2011, at 14:55 , Gabriele Kahlout wrote:

 Hello,
 
 I'm testing out different Similarity implementations, and to do that I
 restart Solr each time I want to try a different similarity class I change
 the class attributed of the similiary element in schema.xml. Beside running
 multiple-cores, each with its own schema, is there a way to tell the
 RequestHandler which similarity class to use?
 
 -- 
 Regards,
 K. Gabriele
 
 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).
 
 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).



Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread Andrzej Bialecki

On 6/16/11 5:31 PM, Mark Schoy wrote:

Thanks for your answers.

Andrzej was right with his assumption. Solr only needs about 9GB memory but
the system needs the rest of it for disc IO:

64 Cores:  64*100MB index size = 6,4GB + 9 GB Solr Cache + about 600 MB OS =
16GB

Conclusion: My system can exactly buffer the data of 64 Cores. Every
additional core cant be buffered and the performance is decreasing.


Glad to be of help... You could formulate this conclusion in a different 
way, too: if you specify too large a heap size then you stifle the OS 
disk buffers - Solr won't be able to use that excess of memory, but it 
won't be available for OS-level disk IO. Therefore reducing the heap 
size may actually increase your performance.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Gabriele Kahlout
On Thu, Jun 16, 2011 at 9:14 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 No, there's not a way to control Similarity on a per-request basis.

 Some factors from Similarity are computed at index-time though.


You got me on this.


 What factors are you trying to tweak that way and why?  Maybe doing
 boosting using some other mechanism (boosting functions, boosting clauses)
 would be a better way to go?

 I'm trying to assess the impact of coord (search-time) on Qtime. In one
implementation coord returns 1, while in another it's actually computed.

Running multiple cores adds considerable complication (must specify to share
data but not conf).
Patching the request handler to change similarity (didn't yet look into
this) will only change 'search-time' similarity. How about breaking up
similarity into run-time and compile-time? So requesthandler could take a
parameter to 'safely' set the run-time similarity?
I think many would welcome such responsibility distinction.


Erik




 On Jun 16, 2011, at 14:55 , Gabriele Kahlout wrote:

  Hello,
 
  I'm testing out different Similarity implementations, and to do that I
  restart Solr each time I want to try a different similarity class I
 change
  the class attributed of the similiary element in schema.xml. Beside
 running
  multiple-cores, each with its own schema, is there a way to tell the
  RequestHandler which similarity class to use?
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


RE: How to index correctly a text save with tinyMCE

2011-06-16 Thread Steven A Rowe
Hi Ariel,

As Shawn says, char filters come before tokenizers.

You need to use a charFilter tag instead of filter tag.

I've updated the HTMLStripCharFilter documentation on the Solr wiki to include 
this information: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Steve

 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Thursday, June 16, 2011 1:32 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to index correctly a text save with tinyMCE
 
 On 6/16/2011 11:12 AM, Ariel wrote:
  Thanks for your answer, I have just put the filter in my schema.xml but
 it
  doesn't work I am using solr 1.4 and my conf is:
 
  code
  analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.HTMLStripCharFilterFactory/
   filter class=solr.SnowballPorterFilterFactory
 language=Spanish/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
  /code
 
 
  But it doesn't work in tomcat 6 logs I get this error:
 
java.lang.ClassCastException:
  org.apache.solr.analysis.HTMLStripCharFilterFactory cannot be cast to
  org.apache.solr.analysis.TokenFilterFactory
 
 According to the wiki, the output of that filter must be passed to
 either another CharFilter or a Tokenizer.  Try moving it before
 WhitespaceTokenizerFactory.
 
 Shawn



getting started

2011-06-16 Thread Mari Masuda
Hello,

I am new to Solr and am in the beginning planning stage of a large project and 
could use some advice so as not to make a huge design blunder that I will 
regret down the road.

Currently I have about 10 MySQL databases that store information about 
different archival collections.  For example, we have data and metadata about a 
political poster collection, a television program, documents and photographs of 
and about a famous author, etc.  My job is to work with the staff archivists to 
come up with a standard metadata template so the 10 databases can be 
consolidated into one.  

Currently the info in these databases is accessed through 10 different sets of 
PHP pages that were written a long time ago for PHP 4.  My plan is to write a 
new Java application that will handle both public display of the info as well 
as an administrative interface so that staff members can add or edit the 
records.

I have decided to use Solr as the search mechanism for this project.  Because 
the info in each of our 10 collections is slightly different (e.g., a record 
about a poster does not contain duration information, but a record about a TV 
show does) I was thinking it would be good to separate each collection's index 
into a separate Solr core so that commits coming from one collection do not bog 
down the other unrelated collections.  One reservation I have is that 
eventually we would like to be able to type in Iraq and find records across 
all of the collections at once instead of having to search each collection 
separately.  Although I don't know anything about it at this stage, I did 
Google sharding after reading someone's recent post on this list and it 
sounds like that may be a potential answer to my question.  Does anyone have 
any advice on how I should initially set up Solr for my situation?  I am slowly 
making my way through the wiki and RTFMing, but I wanted to see what the 
experts have to say because at this point I don't really know where to start.

Thank you very much,
Mari

Re: It's not possible to decide at run-time which similarity class to use, right?

2011-06-16 Thread Robert Muir
On Thu, Jun 16, 2011 at 3:23 PM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
 I'm trying to assess the impact of coord (search-time) on Qtime. In one
 implementation coord returns 1, while in another it's actually computed.

On query time?

coord should be really cheap (unless your impl does something like
calculate a million digits of pi), as it is not actually computed
per-document.
instead, the result of all possible coord factors (e.g. 1/5, 2/5, 3/5,
4/5, 5/5) is computed up-front by BooleanQuery's scorers into a table.

See 
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer.java
and 
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer2.java


Re: getting started

2011-06-16 Thread Jonathan Rochkind

On 6/16/2011 4:41 PM, Mari Masuda wrote:

One reservation I have is that eventually we would like to be able to type in Iraq and 
find records across all of the collections at once instead of having to search each collection 
separately.  Although I don't know anything about it at this stage, I did Google 
sharding after reading someone's recent post on this list and it sounds like that may 
be a potential answer to my question.


So this kind of stuff can be tricky, but with that eventual requirement 
I would NOT put these in seperate cores. Sharding isn't (IMO, if someone 
disagrees, they will hopefully say so!) a good answer to searching 
accross entirely different 'schemas', or avoiding frequent-commit issues 
-- sharding is really just for scaling/performance when your index gets 
very very large. (Which it doesn't sound like yours will be, but you can 
deal with that as a separate issue if it becomes so).


If you're going to want to search across all the collections, put them 
all in the same core.  Either in the exact same indexed fields, or using 
certain common indexed fields -- those common ones are the ones you'll 
be able to search across all collections on. It's okay if some 
collections have unique indexed fields too --- documents in the core 
that don't belong to that collection just won't have any terms in that 
indexed field that is only used by a certain collection, no problem. 
(Then you can distribute this single core into shards if you need to for 
performance reasons related to number of documents/size of index).


You're right to be thinking about the fact that very frequent commits 
can be performance issues in Solr. But separating in different cores is 
going to create more problems for yourself (if you want to be able to 
search accross all collections), in an attempt to solve that one.  
(Among other things, not every Solr feature works in a 
distributed/sharded environment, it's just a more complicated and 
somewhat less mature setup for Solr).


The way I deal with the frequent-commit issue is by NOT doing frequent 
commits to my production Solr. Instead, I use Solr replication to have a 
'master' Solr index that I do commits to whenever I want, and a 'slave' 
Solr index that serves the production searches, and which only 
replicates from master periodically -- not too often to be 
too-frequent-commits.  That seems to be a somewhat common solution, if 
that use pattern works for you.


There are also some near real time features in more recent versions of 
Solr, that I'm not very familiar with. (not sure if any are included in 
the current latest release, or if they are all only still in the repo)  
My sense is that they too only work for certain use patterns, they 
aren't magic bullets for commit whatever you want as often as you want 
to Solr.  In general Solr isn't so great at very frequent major changes 
to the index.   Depending on exactly what sort of use pattern you are 
predicting/planning for your commits, maybe people can give you advice 
on how (or if) to do it.


But I personally don't think your idea of splitting your collections 
(that you'll eventually want to search accross into a single search) 
into shards is a good solution to frequent-commit issues. You'd be 
complicating your setup and causing other problems for yourself, and not 
really even entirely addressing the too-frequent-commit issue with that 
setup.


Re: getting started

2011-06-16 Thread Sascha SZOTT

Hi Mari,

it depends ...

* How many records are stored in your MySQL databases?
* How often will updates occur?
* How many db records / index documents are changed per update?

I would suggest to start with a single Solr core first. Thereby, you can 
concentrate on the basics and do not need to deal with more advanced 
things like sharding. In case you encounter performance issues later on, 
you can switch to a multi-core setup.


-Sascha

Mari Masuda wrote:

Hello,

I am new to Solr and am in the beginning planning stage of a large project and 
could use some advice so as not to make a huge design blunder that I will 
regret down the road.

Currently I have about 10 MySQL databases that store information about 
different archival collections.  For example, we have data and metadata about a 
political poster collection, a television program, documents and photographs of 
and about a famous author, etc.  My job is to work with the staff archivists to 
come up with a standard metadata template so the 10 databases can be 
consolidated into one.

Currently the info in these databases is accessed through 10 different sets of 
PHP pages that were written a long time ago for PHP 4.  My plan is to write a 
new Java application that will handle both public display of the info as well 
as an administrative interface so that staff members can add or edit the 
records.

I have decided to use Solr as the search mechanism for this project.  Because the info in each of 
our 10 collections is slightly different (e.g., a record about a poster does not contain duration 
information, but a record about a TV show does) I was thinking it would be good to separate each 
collection's index into a separate Solr core so that commits coming from one collection do not bog 
down the other unrelated collections.  One reservation I have is that eventually we would like to 
be able to type in Iraq and find records across all of the collections at once instead 
of having to search each collection separately.  Although I don't know anything about it at this 
stage, I did Google sharding after reading someone's recent post on this list and it 
sounds like that may be a potential answer to my question.  Does anyone have any advice on how I 
should initially set up Solr for my situation?  I am slowly making my way through the wiki and 
RTFMing, but I wanted to see what

the experts have to say because at this point I don't really know where to 
start.


Thank you very much,
Mari


sending results of function query to range query

2011-06-16 Thread Kevin Osborn
I am not sure if I can use function queries this way. I have a query like 
thisattributeX:[* TO ?] in my DB. I replace the ? with input from the front 
end. Obviously, this works fine. However, what I really want to do is 
attributeX:[* TO (3 * ?)] Is there anyway to embed the results of a function 
query inside the query?

Re: Encoding of alternate fields in highlighting

2011-06-16 Thread Koji Sekiguchi

(11/06/17 0:15), Massimo Schiavon wrote:

I have an index with various fields and I want to highlight query matchings on title 
and content
fields.
These fields could contain html tags so I've configured HtmlFormatter for 
highlighting. The problem
is that if the query doesn't match the text of the field, solr returns the 
value of configured
alternate field without encoding it.
Is there any way to get encoded value also for alternate fields? And in general 
there is a way to do
html escaping on values returned from a response writer?


Massimo,

At first impression, I think the requirement is reasonable. As long as we 
support HtmlEncoder,
we had better support it with alternateField option. Please open a jira issue, 
and if possible,
suggest appropriate option and attach a patch (patch is not required, but it is 
very helpful).

koji
--
http://www.rondhuit.com/en/


SOlR -- Out of Memory exception

2011-06-16 Thread jyn7
We just started using SOLR. I am trying to load a single file with 20 million
records into SOLR using the CSV uploader. I keep getting and out of Memory
after loading 7 million records. Here is the config:

autoCommit 
 maxDocs1/maxDocs
 maxTime6/maxTime 
I also  encountered a LockObtainFailedException
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
NativeFSLock@D:\work\solr\.\data\index\write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at
org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:1097)

So I changed the  lockType to SIngle, now again I am getting an Out of
Memory Exception. I also increased the JVM heap space to 2048M but still
getting an Out of Memory.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3074636.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: fieldCache problem OOM exception

2011-06-16 Thread Erick Erickson
Well, if my theory is right, you should be able to generate OOMs at will by
sorting and faceting on all your fields in one query.

But Lucene's cache should be garbage collected, can you take some memory
snapshots during the week? It should hit a point and stay steady there.

How much memory are you giving your JVM? It looks like a lot given your
memory snapshot.

Best
Erick

On Thu, Jun 16, 2011 at 3:01 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Hi Erik,

 yes I'm sorting and faceting.

 1) Fields for sorting:
   sort=f_dccreator_sort, sort=f_dctitle, sort=f_dcyear
   The parameter facet.sort= is empty, only using parameter sort=.

 2) Fields for faceting:
   f_dcperson, f_dcsubject, f_dcyear, f_dccollection, f_dclang, f_dctypenorm,
 f_dccontenttype
   Other faceting parameters:

 ...facet=truefacet.mincount=1facet.limit=100facet.sort=facet.prefix=...

 3) The LukeRequestHandler takes too long for my huge index so this is from
   the standalone luke (compiled for solr3.2):
   f_dccreator_sort = 10.029.196
   f_dctitle        = 21.514.939
   f_dcyear         =      1.471
   f_dcperson       = 14.138.165
   f_dcsubject      =  8.012.319
   f_dccollection   =      1.863
   f_dclang         =        299
   f_dctypenorm     =         14
   f_dccontenttype  =        497

 numDocs:    28.940.964
 numTerms:  686.813.235
 optimized:        true
 hasDeletions:    false

 What can you read/calculate from this values?

 Is my index to big for Lucene/Solr?

 What I don't understand, why fieldCache is not garbage collected
 and therefore reduced in size from time to time.

 Regards
 Bernd

 Am 15.06.2011 17:50, schrieb Erick Erickson:

 The first question I have is whether you're sorting and/or
 faceting on many unique string values? I'm guessing
 that sometime you are. So, some questions to help
 pin it down:
 1  what fields are you sorting on?
 2  what fields are you faceting on?
 3  how many unique terms in each (see the solr admin page).

 Best
 Erick

 On Wed, Jun 15, 2011 at 8:22 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de  wrote:

 Dear list,

 after getting OOM exception after one week of operation with
 solr 3.2 I used MemoryAnalyzer for the heapdumpfile.
 It looks like the fieldCache eats up all memory.

                                                    Objects     Shalow
 Heap
   Retained Heap
 org.apache.lucene.search.FieldCache                       0
 0

 = 14,636,950,632

 org.apache.lucene.search.FieldCacheImpl                   1
  32

 = 14,636,950,384

 org.apache.lucene.search.FieldCacheImpl$StringIndexCache  1
  32

 = 14,636,947,080

 org.apache.lucene.search.FieldCache$StringIndex          10
 320

 = 14,636,944,352

 java.lang.String[]                                      519
 567,811,040

 = 13,503,733,312

 char[]                                           81,766,595
  11,604,293,712

 = 11,604,293,712

 fieldCache retains over 14g of heap.

 When looking on stats page under fieldCache the description says:
 Provides introspection of the Lucene FieldCache, this is **NOT** a cache
 that is managed by Solr.

 So is this a jetty problem and not solr?

 Why is fieldCache growing and growing until OOM?

 Regards
 Bernd




Re: Boost Strangeness

2011-06-16 Thread Erick Erickson
Right, if you've only changed WordDelimiterFilterFactory in the query, then
then tokens you're analyzing may be split up. Try running some of the
terms through the admin/analysis page Unless you have
catenateAll=1, in the definition, the whole term won't be there

It becomes a question of why you even want WDFF in there in the first
place, do you ever want to split these fields up this way? Maybe start
by just taking it out completely?

Best
Erick

On Thu, Jun 16, 2011 at 9:55 AM, Judioo cont...@judioo.com wrote:
 fascinating

 Thank you so much Erik, I'm slowly beginning to understand.

 SO I've discovered that by defining 'splitOnNumerics=0' on the filter
 class 'solr.WordDelimiterFilterFactory' ( for ONLY the query analyzer ) I
 can get *closer* to my required goal!

 Now something else odd is occuring.

 It only returns 2 results where there is over 70?

 Why is that? I can't find were this is explained :(

 query

 /solr/select?omitNorms=trueq=b006m86ddefType=dismaxqf=id^10%20parent_id^9%20brand_container_id^8%20series_container_id^8%20subseries_container_id^8%20clip_container_id^1%20clip_episode_id^1debugQuery=onfl=type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,scorewt=jsonindent=onomitNorms=true

 output

 {

   - -
   responseHeader: {
      - status: 0
      - QTime: 51
      - -
      params: {
         - debugQuery: on
         - fl:
         
 type,id,parent_id,brand_container_id,series_container_id,subseries_container_id,clip_episode_id,clip_episode_id,score
         - indent: on
         - q: b006m86d
         - qf: id^10 parent_id^9 brand_container_id^8 series_container_id^8
         subseries_container_id^8 clip_container_id^1 clip_episode_id^1
         - wt: json
         - -
         omitNorms: [
            - true
            - true
         ]
         - defType: dismax
      }
   }
   - -
   response: {
      - numFound: 2
      - start: 0
      - maxScore: 13.473297
      - -
      docs: [
         - -
         {
            - parent_id: 
            - id: b006m86d
            - type: brand
            - score: 13.473297
         }
         - -
         {
            - series_container_id: 
            - id: b00y1w9h
            - type: episode
            - brand_container_id: b006m86d
            - subseries_container_id: 
            - clip_episode_id: 
            - score: 11.437143
         }
      ]
   }
   - -
   debug: {
      - rawquerystring: b006m86d
      - querystring: b006m86d
      - parsedquery: +DisjunctionMaxQuery((id:b006m86d^10.0 |
      clip_episode_id:b006m86d | subseries_container_id:b006m86d^8.0 |
      series_container_id:b006m86d^8.0 | clip_container_id:b006m86d |
      brand_container_id:b006m86d^8.0 | parent_id:b006m86d^9.0)) ()
      - parsedquery_toString: +(id:b006m86d^10.0 | clip_episode_id:b006m86d
      | subseries_container_id:b006m86d^8.0 |
 series_container_id:b006m86d^8.0 |
      clip_container_id:b006m86d | brand_container_id:b006m86d^8.0 |
      parent_id:b006m86d^9.0) ()
      - -
      explain: {
         - b006m86d:  13.473297 = (MATCH) sum of: 13.473297 = (MATCH) max
         of: 13.473297 = (MATCH) fieldWeight(id:b006m86d in 27636),
 product of: 1.0 =
         tf(termFreq(id:b006m86d)=1) 13.473297 = idf(docFreq=2,
 maxDocs=783800) 1.0 =
         fieldNorm(field=id, doc=27636) 
         - b00y1w9h:  11.437143 = (MATCH) sum of: 11.437143 = (MATCH) max
         of: 11.437143 = (MATCH) weight(brand_container_id:b006m86d^8.0 in 61),
         product of: 0.82407516 = queryWeight(brand_container_id:b006m86d^8.0),
         product of: 8.0 = boost 13.878762 = idf(docFreq=1, maxDocs=783800)
         0.007422088 = queryNorm 13.878762 = (MATCH)
         fieldWeight(brand_container_id:b006m86d in 61), product of: 1.0 =
         tf(termFreq(brand_container_id:b006m86d)=1) 13.878762 = idf(docFreq=1,
         maxDocs=783800) 1.0 = fieldNorm(field=brand_container_id, doc=61) 
      }
      - QParser: DisMaxQParser
      - altquerystring: null
      - boostfuncs: null
      - -
      timing: {
         - time: 51
         - -
         prepare: {
            - time: 6
            - -
            org.apache.solr.handler.component.QueryComponent: {
               - time: 5
            }
            - -
            org.apache.solr.handler.component.FacetComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.MoreLikeThisComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.HighlightComponent: {
               - time: 1
            }
            - -
            org.apache.solr.handler.component.StatsComponent: {
               - time: 0
            }
            - -
            org.apache.solr.handler.component.DebugComponent: {
               - time: 0
            }
         }
         - -
         process: {
            - time: 45
            - -
            

Re: Document Scoring

2011-06-16 Thread Erick Erickson
I really wouldn't go there, it sounds like there are endless
opportunities for errors!

How real-time is real-time? Could you fix this entirely
by
1 adjusting expectations for, say, 5 minutes.
2 adjusting your commit (on the master) and poll (on the slave) appropriately?

Best
Erick

On Thu, Jun 16, 2011 at 11:41 AM, zarni aung zau...@gmail.com wrote:
 Hi,

 I am designing my indexes to have 1 write-only master core, 2 read-only
 slave cores.  That means the read-only cores will only have snapshots pulled
 from the master and will not have near real time changes.  I was thinking
 about adding a hybrid read and write master core that will have the most
 recent changes from my primary data source.  I am thinking to query the
 hybrid master and the read-only slaves and somehow try to intersect the
 results in order to support near real time full text search.  Is this
 feasible?

 Thank you,

 Zarni



Re: SOlR -- Out of Memory exception

2011-06-16 Thread Erick Erickson
H, are you still getting your OOM after 7M records? Or some larger
number? And how are you using the CSV uploader?

Best
Erick

On Thu, Jun 16, 2011 at 9:14 PM, jyn7 jyotsna.namb...@gmail.com wrote:
 We just started using SOLR. I am trying to load a single file with 20 million
 records into SOLR using the CSV uploader. I keep getting and out of Memory
 after loading 7 million records. Here is the config:

 autoCommit
         maxDocs1/maxDocs
         maxTime6/maxTime
 I also  encountered a LockObtainFailedException
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
 NativeFSLock@D:\work\solr\.\data\index\write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:84)
        at
 org.apache.lucene.index.IndexWriter.lt;initgt;(IndexWriter.java:1097)

 So I changed the  lockType to SIngle, now again I am getting an Out of
 Memory Exception. I also increased the JVM heap space to 2048M but still
 getting an Out of Memory.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3074636.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SOlR -- Out of Memory exception

2011-06-16 Thread jyn7
Yes Eric, after changing the lock type to Single, I got an OOM after loading
5.5 million records. I am using the curl command to upload the csv.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3074765.html
Sent from the Solr - User mailing list archive at Nabble.com.


omitTermFreqAndPositions in a TextField fieldType

2011-06-16 Thread Michael Ryan
Is it possible to use omitTermFreqAndPositions=true in a fieldType 
declaration that uses class=solr.TextField? I've tried doing this and it does 
not seem to work (i.e., the prx file size does not change). Using it in a 
field declaration does work, but I'd rather set it in the fieldType so I 
don't have to repeat it multiple times in my schema.

From my schema.xml file:
fieldType name=foobar class=solr.TextField sortMissingLast=true
omitNorms=true omitTermFreqAndPositions=true indexed=true
stored=true positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

In the TextField class I found that it disables OMIT_TF_POSITIONS, which I'm 
assuming is the cause of my problem:
if (schema.getVersion() 1.1f) properties = ~OMIT_TF_POSITIONS;

Does it even make sense to use omitTermFreqAndPositions for a TextField, or am 
I perhaps doing something I shouldn't be?

-Michael


Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-16 Thread Sujatha Arun
Alexey,

Do you mean that we  have current Index as it is and have a separate core
which  has only the user-id ,product-id relation and at while querying ,do a
join between the two cores based on the user-id.


This would involve us to Index/delete the product  as and when the user
subscription for a product changes ,This would involve some amount of
latency if the Indexing (we have a queue system for Indexing across the
various instances) or deletion is delayed

IF we want to go ahead with this solution ,We currently are using solr 1.3
, so  is this functionality available as a patch for solr 1.3?Would it be
possible to  do with a separate Index  instead of a core ,then I can create
only one  Index common for all our instances and then use this instance to
do the join.

Thanks
Sujatha

On Thu, Jun 16, 2011 at 9:27 PM, Alexey Serba ase...@gmail.com wrote:

  So a search for a product once the user logs in and searches for only the
  products that he has access to Will translate to something like this .
 ,the
  product ids are obtained form the db  for a particular user and can run
  into  n  number.
 
  search term fq=product_id(100 10001  ..n number)
 
  but we are currently running into too many Boolean expansion error .We
 are
  not able to tie the user also into roles as each user is mainly any one
 who
  comes to site and purchases a product .

 I'm wondering if new trunk Solr join functionality can help here.

 * http://wiki.apache.org/solr/Join

 In theory you can index your products (product_id, ...) and
 user_id-product many-to-many relation (user_product_id, user_id) into
 signle/different cores and then do join, like
 f=search termsfq={!join from=product_id to=user_product_id}user_id:10101

 But I haven't tried that, so I'm just speculating.



Re: SOlR -- Out of Memory exception

2011-06-16 Thread pravesh
If you are sending whole CSV in a single HTTP request using curl, why not
consider sending it in smaller chunks?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOlR-Out-of-Memory-exception-tp3074636p3075091.html
Sent from the Solr - User mailing list archive at Nabble.com.