Re: Problem with XML encode UFT-8

2011-02-22 Thread Jan Høydahl
Hi,

Please explain some more.
a) What version of Solr?
b) Are you trying to feed XML or PDF?
c) What request handler are you feeding to? /update or /update/extract ?
d) Can you copy/paste some more lines from the error log?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 21. feb. 2011, at 15.02, jayronsoares wrote:

 
 Hi I'm using solr py to stored files in pdf, however at moment of run script,
 shows me that issue:
 
 An invalid XML character (Unicode: 0xc) was found in the element content of
 the document.
 
 Someone could give some help?
 
 cheers
 jayron
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Any-new-python-libraries-tp493419p2545020.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Faceting

2011-02-22 Thread Jan Høydahl
Hi,

Even if the customer types a correct product name, how do you know that 
merchant A and merchant B both have registered that exact product in the same 
way?

Merchant A may say as product name White Sony LCD TV XY123 and the other says 
Sony XY123 LCD TV, colour=white

If you're serious about price comparison service, I think you need to invest in 
finding what products are the same before indexing, and then tagging them with 
some unique normalized name. Then when after a search, you show a facet with 
that normalized name and first when the user has selected the correct facet, 
can you be 100% certain that you're comparing apples to apples.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 22. feb. 2011, at 07.23, Praveen Parameswaran wrote:

 Hi ,
 @Tommaso @Jan Høydahl Thanks for the response :)
 
 I 've done it almost similar to what Tommaso suggested and yes it's about
 70-80% accurate.
 I understand the contradiction in the search - customer find stuff without
 the exact right wording (recall) at the same time as you want the query to
 be precise (precision).
 
 In my scenario both cases are there as well, but mostly a customer would
 know which product name he is searching for and he will be interested in
 comparing the prices that different marchants offer. What I feel is that ,
 may be the Search itself has to be classified based on the contexts.
 
 Will it be possible in solr to have the below:
 1 . A customer uses the correct product name to search , get the accurate
 results
 2.  A customer uses a keyword or without the exact name , get the most
 relevant results.
 
 2nd part is fine as it's working good. 1st part is where I'm struggling.
 
 thanks
 Praveen
 
 On Mon, Feb 21, 2011 at 5:23 PM, Tommaso Teofili
 tommaso.teof...@gmail.comwrote:
 
 Hi Praveen,
 as far as I understand you have to set the type of the field(s) you are
 searching over to be conservative.
 So for example you won't include stemmer and lowercase filters and use only
 a whitespace tokenizer, more over you should search with the default
 operator set to AND.
 Then faceting over those field(s) will depend on those type settings.
 You may find the following wiki page useful:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 My 2 cents,
 
 
 2011/2/21 Praveen Parameswaran buz.p...@gmail.com
 
 Hi,
 
 Is it possible to have 100% accuracy for facet counts using solr ? Since
 this is for a product price comparison site I would need the search to
 return accurate results. for example if I search sony lcd Tv I do not
 want
 sony Led Tv to be returned int he results.  Please let me know if this
 is
 possible and how?
 
 
 Thanks
 
 Prav
 
 



Configure 2 or more Tomcat instances.

2011-02-22 Thread rajini maski
   I have a tomcat6.0 instance running in my system, with
connector port-8090, shutdown port -8005 ,AJP/1.3  port-8009 and redirect
port-8443  in server.xml (path = C:\Program Files\Apache Software
Foundation\Tomcat 6.0\conf\server.xml)

   How do I configure one more independent tomcat instance
in the same system..? I went through many sites.. but couldn't fix
this. If anyone one know the proper configuration steps please reply..

Regards,
Rajani Maski


Re: Configure 2 or more Tomcat instances.

2011-02-22 Thread Jonathan DeMello
Hey Rajani,

From what I've seen, you just need to copy the Tomcat folder and change the
following ports in server.xml: shutdown, connector,ajp. Then you can start
them up independently.

Regards,

Jonathan


On Tue, Feb 22, 2011 at 3:15 PM, rajini maski rajinima...@gmail.com wrote:

   I have a tomcat6.0 instance running in my system, with
 connector port-8090, shutdown port -8005 ,AJP/1.3  port-8009 and redirect
 port-8443  in server.xml (path = C:\Program Files\Apache Software
 Foundation\Tomcat 6.0\conf\server.xml)

   How do I configure one more independent tomcat instance
 in the same system..? I went through many sites.. but couldn't fix
 this. If anyone one know the proper configuration steps please reply..

 Regards,
 Rajani Maski



solr indexing

2011-02-22 Thread satya swaroop
Hi all,
   to my keen intrest on solr indexing mechanism i started mining the
code of solr indexing (/update/extract), i read the indexing file formats,
scoring procedure, i have some queries regarding this..
1) the scoring is performed on the dynamic and precalculated value(doc
boost, field boost, lengthnorm). In calculating the score if suppose a term
in the index consits nearly one million docs then is solr calculating the
score for each and every doc present for the term and getting the top docs
from the index??? or is it undergoing any mechanism such that limiting the
calculation of score to only a particular docs???

If anybody know about it or any documentation regarding this please inform
me...


Regards,
satya


Re: Any plan to make Field Collapsing available for distributed search?

2011-02-22 Thread Koji Sekiguchi

(11/02/22 13:46), Andy wrote:

Hello,

I'm looking into Field Collapsing. According to the documentation one limitation is that 
distributed search support for result grouping has not yet been implemented.

Just wondered if there's any plan to add distributed search support to field 
collapsing. Or is there any technical obstacle that make such a feature 
unlikely?

Thanks

Andy


Andy,

There is an open ticket for it:
https://issues.apache.org/jira/browse/SOLR-2066

Koji
--
http://www.rondhuit.com/en/


disable replication in a persistent way

2011-02-22 Thread Ahmet Arslan
Hello,

solr/replication?command=disablepoll disables replication on slave(s). However 
it is not persistent. After solr/tomcat restart, slave(s) will continue 
polling. 

Is there a built-in way to disable replication on slave side in a persistent 
manner?

Currently I am using system property substitution along with 
solrcore.properties file to simulate this.

lst name=slave
str name=enable${enable.slave:false}/str 

#solrcore.properties in slave
enable.master=true

And modify solrcore.properties with a custom solr request handler after the 
disablepoll command, to make it persistent. It seems that there is no existing 
mechanism to write solrconfig.properties file, am I correct?

Thanks,
Ahmet




  


Re: Datetime problems with dataimport

2011-02-22 Thread MOuli

Ok i got it.

It should look like -mm-ddThh:mm:ssZ
for example: 2011-02-22T15:07:00Z
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552477.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Datetime problems with dataimport

2011-02-22 Thread Adam Estrada
I logged an issue in Jira that relates to this and it looks like Yonik picked 
it up.

https://issues.apache.org/jira/browse/SOLR-2286

Adam


On Feb 22, 2011, at 9:07 AM, MOuli wrote:

 
 Ok i got it.
 
 It should look like -mm-ddThh:mm:ssZ
 for example: 2011-02-22T15:07:00Z
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552477.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Multiple Blocked threads on UnInvertedField.getUnInvertedField() SegmentReader$CoreReaders.getTermsReader

2011-02-22 Thread Rachita Choudhary
Hi Solr Users,

We are upgrading from Solr 1.3 to Solr 1.4.1.
While using Solr 1.3 , we were seeing multiple blocking active threads on
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal() .

To utilize the benefits of NIO, on upgrading to Solr 1.4.1, we see other
type of multiple blocking threads on
org.apache.solr.request.UnInvertedField.getUnInvertedField()  

SegmentReader$CoreReaders.getTermsReader.
Due to this, the QTimes shoots up from few hundreds to thousand of
msec.. even going upto 30-40 secs for a single query.

- The multiple blocking threads show up after few thousands of queries.
- We do not have faceting and sorting on the same fields.
- Our facet fields are multivalued text fields, but no large text values are
present.
- Index size - around 10 GB
- We have not specified any method for faceting in our schema.xml.
- Our field value cache settings are:
 fieldValueCache
class=solr.FastLRUCache
size=175
autowarmCount=0
showItems=10
  /

Can someone please tell us the why we are seeing these blocked threads ?
Also if they are related to our field value cache , then a cache of size 175
will be filled up with very few initial queries and right after that we
should see multiple blocking threads ?
What difference it will make if we have facet.method = enum ?
Is this all related to fieldValueCache or is there some other configuration
which we need to set to avoid these blocking threads?

Thanks,
Rachita

*Cache values example:
*facetField1_27443 :
{field=facet1_27443,memSize=4214884,tindexSize=52,time=22,phase1=15,nTerms=4,bigTerms=0,termInstances=6,uses=1}

facetField1_70 :
{field=facetField1_70,memSize=4223310,tindexSize=308,time=28,phase1=21,nTerms=636,bigTerms=0,termInstances=14404,uses=1}

facetField2 : 
{field=facetField2,memSize=4262644,tindexSize=3156,time=273,phase1=267,nTerms=12188,bigTerms=0,termInstances=1255522,uses=7031}
*
Stack trace for
org.apache.solr.request.UnInvertedField.getUnInvertedField() -
BLOCKED*

at org.apache.solr.request.UnInvertedField.getUnInvertedField
(UnInvertedField.java:837)
 at org.apache.solr.request.SimpleFacets.getTermCounts (SimpleFacets.java:250)
 at org.apache.solr.request.SimpleFacets.getFacetFieldCounts
(SimpleFacets.java:283)
 at org.apache.solr.request.SimpleFacets.getFacetCounts (SimpleFacets.java:166)
 at org.apache.solr.handler.component.FacetComponent.process
(FacetComponent.java:72)
 at org.apache.solr.handler.component.SearchHandler.handleRequestBody
(SearchHandler.java:195)
 at org.apache.solr.handler.RequestHandlerBase.handleRequest
(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute (SolrCore.java:1316)
 at org.apache.solr.servlet.SolrDispatchFilter.execute
(SolrDispatchFilter.java:338)
 at org.apache.solr.servlet.SolrDispatchFilter.doFilter
(SolrDispatchFilter.java:241)
 at com.caucho.server.dispatch.FilterFilterChain.doFilter
(FilterFilterChain.java:87)
 at com.caucho.server.webapp.WebAppFilterChain.doFilter
(WebAppFilterChain.java:187)
 at com.caucho.server.dispatch.ServletInvocation.service
(ServletInvocation.java:266)
 at com.caucho.server.http.HttpRequest.handleRequest (HttpRequest.java:270)
 at com.caucho.server.port.TcpConnection.run (TcpConnection.java:678)
 at com.caucho.util.ThreadPool$Item.runTasks (ThreadPool.java:721)
 at com.caucho.util.ThreadPool$Item.run (ThreadPool.java:643)
 at java.lang.Thread.run (Thread.java:595)


*org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader() -
BLOCKED*

at org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader
(SegmentReader.java:170)
 at org.apache.lucene.index.SegmentTermDocs. (SegmentTermDocs.java:52)
 at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:987)
 at org.apache.lucene.index.IndexReader.termDocs (IndexReader.java:1102)
 at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:981)
 at org.apache.solr.search.SolrIndexReader.termDocs (SolrIndexReader.java:320)
 at org.apache.solr.search.SolrIndexSearcher.getDocSetNC
(SolrIndexSearcher.java:640)
 at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet
(SolrIndexSearcher.java:563)
 at org.apache.solr.search.SolrIndexSearcher.numDocs
(SolrIndexSearcher.java:1422)
 at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount
(ExtendedFacet.java:132)
 at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount
(ExtendedFacet.java:92)
 at com.askme.solrenhancements.facet.ExtendedFacet.getFacetAdditionalInfo
(ExtendedFacet.java:69)
 at com.askme.solrenhancements.facet.ExtendedFacet.getFacetInfo
(ExtendedFacet.java:56)
 at com.askme.solrenhancements.facet.CustomFacetComponent.process
(CustomFacetComponent.java:43)
 at org.apache.solr.handler.component.SearchHandler.handleRequestBody
(SearchHandler.java:195)
 at org.apache.solr.handler.RequestHandlerBase.handleRequest
(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute (SolrCore.java:1316)
 at 

Re: Datetime problems with dataimport

2011-02-22 Thread MOuli

Can you give me an example?

Should it looks like 2011-02-22'T'14:55:20 or 2011-02-22T14:55:20 or
2011-02-22 14:55:20. I tested every one of this formats, but got anyway the
Exception.

Invalid Date String:'2009-12-09'T'00:00:00'
Invalid Date String:'2009-12-09 00:00:00'
Invalid Date String:'2009-12-09T00:00:00'

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552422.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Configure 2 or more Tomcat instances.

2011-02-22 Thread Paul Libbrecht
Rajini,

you need to make the (~3) ports defined in conf/server.xml different.

paul


Le 22 févr. 2011 à 12:15, rajini maski a écrit :

   I have a tomcat6.0 instance running in my system, with
 connector port-8090, shutdown port -8005 ,AJP/1.3  port-8009 and redirect
 port-8443  in server.xml (path = C:\Program Files\Apache Software
 Foundation\Tomcat 6.0\conf\server.xml)
 
   How do I configure one more independent tomcat instance
 in the same system..? I went through many sites.. but couldn't fix
 this. If anyone one know the proper configuration steps please reply..
 
 Regards,
 Rajani Maski



Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() SegmentReader$CoreReaders.getTermsReader

2011-02-22 Thread Bill Bell
+1 for more investigation

Bill Bell
Sent from mobile


On Feb 22, 2011, at 7:13 AM, Rachita Choudhary rachita.choudh...@burrp.com 
wrote:

 Hi Solr Users,
 
 We are upgrading from Solr 1.3 to Solr 1.4.1.
 While using Solr 1.3 , we were seeing multiple blocking active threads on
 org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal() .
 
 To utilize the benefits of NIO, on upgrading to Solr 1.4.1, we see other
 type of multiple blocking threads on
 org.apache.solr.request.UnInvertedField.getUnInvertedField()  
 
 SegmentReader$CoreReaders.getTermsReader.
 Due to this, the QTimes shoots up from few hundreds to thousand of
 msec.. even going upto 30-40 secs for a single query.
 
 - The multiple blocking threads show up after few thousands of queries.
 - We do not have faceting and sorting on the same fields.
 - Our facet fields are multivalued text fields, but no large text values are
 present.
 - Index size - around 10 GB
 - We have not specified any method for faceting in our schema.xml.
 - Our field value cache settings are:
 fieldValueCache
class=solr.FastLRUCache
size=175
autowarmCount=0
showItems=10
  /
 
 Can someone please tell us the why we are seeing these blocked threads ?
 Also if they are related to our field value cache , then a cache of size 175
 will be filled up with very few initial queries and right after that we
 should see multiple blocking threads ?
 What difference it will make if we have facet.method = enum ?
 Is this all related to fieldValueCache or is there some other configuration
 which we need to set to avoid these blocking threads?
 
 Thanks,
 Rachita
 
 *Cache values example:
 *facetField1_27443 :
 {field=facet1_27443,memSize=4214884,tindexSize=52,time=22,phase1=15,nTerms=4,bigTerms=0,termInstances=6,uses=1}
 
 facetField1_70 :
 {field=facetField1_70,memSize=4223310,tindexSize=308,time=28,phase1=21,nTerms=636,bigTerms=0,termInstances=14404,uses=1}
 
 facetField2 : 
 {field=facetField2,memSize=4262644,tindexSize=3156,time=273,phase1=267,nTerms=12188,bigTerms=0,termInstances=1255522,uses=7031}
 *
 Stack trace for
 org.apache.solr.request.UnInvertedField.getUnInvertedField() -
 BLOCKED*
 
 at org.apache.solr.request.UnInvertedField.getUnInvertedField
 (UnInvertedField.java:837)
 at org.apache.solr.request.SimpleFacets.getTermCounts (SimpleFacets.java:250)
 at org.apache.solr.request.SimpleFacets.getFacetFieldCounts
 (SimpleFacets.java:283)
 at org.apache.solr.request.SimpleFacets.getFacetCounts (SimpleFacets.java:166)
 at org.apache.solr.handler.component.FacetComponent.process
 (FacetComponent.java:72)
 at org.apache.solr.handler.component.SearchHandler.handleRequestBody
 (SearchHandler.java:195)
 at org.apache.solr.handler.RequestHandlerBase.handleRequest
 (RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute (SolrCore.java:1316)
 at org.apache.solr.servlet.SolrDispatchFilter.execute
 (SolrDispatchFilter.java:338)
 at org.apache.solr.servlet.SolrDispatchFilter.doFilter
 (SolrDispatchFilter.java:241)
 at com.caucho.server.dispatch.FilterFilterChain.doFilter
 (FilterFilterChain.java:87)
 at com.caucho.server.webapp.WebAppFilterChain.doFilter
 (WebAppFilterChain.java:187)
 at com.caucho.server.dispatch.ServletInvocation.service
 (ServletInvocation.java:266)
 at com.caucho.server.http.HttpRequest.handleRequest (HttpRequest.java:270)
 at com.caucho.server.port.TcpConnection.run (TcpConnection.java:678)
 at com.caucho.util.ThreadPool$Item.runTasks (ThreadPool.java:721)
 at com.caucho.util.ThreadPool$Item.run (ThreadPool.java:643)
 at java.lang.Thread.run (Thread.java:595)
 
 
 *org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader() -
 BLOCKED*
 
 at org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader
 (SegmentReader.java:170)
 at org.apache.lucene.index.SegmentTermDocs. (SegmentTermDocs.java:52)
 at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:987)
 at org.apache.lucene.index.IndexReader.termDocs (IndexReader.java:1102)
 at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:981)
 at org.apache.solr.search.SolrIndexReader.termDocs (SolrIndexReader.java:320)
 at org.apache.solr.search.SolrIndexSearcher.getDocSetNC
 (SolrIndexSearcher.java:640)
 at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet
 (SolrIndexSearcher.java:563)
 at org.apache.solr.search.SolrIndexSearcher.numDocs
 (SolrIndexSearcher.java:1422)
 at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount
 (ExtendedFacet.java:132)
 at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount
 (ExtendedFacet.java:92)
 at com.askme.solrenhancements.facet.ExtendedFacet.getFacetAdditionalInfo
 (ExtendedFacet.java:69)
 at com.askme.solrenhancements.facet.ExtendedFacet.getFacetInfo
 (ExtendedFacet.java:56)
 at com.askme.solrenhancements.facet.CustomFacetComponent.process
 (CustomFacetComponent.java:43)
 at org.apache.solr.handler.component.SearchHandler.handleRequestBody
 

Re: Question About Highlighting

2011-02-22 Thread Ahsan |qbal
Hi All

I even tried that (Appending hl.usePhraseHighlighter=true) but it still
does not work.

Please help
Regards
Ahsan Iqbal

On Fri, Feb 18, 2011 at 12:30 AM, Ahmet Arslan iori...@yahoo.com wrote:

  I had a requirement to implement phrase proximity like [a
  b c w/5 d e f] for
  this i have implemented a custom query parser plug in which
  I make use of nested
  span queries to fulfill this requirement. Now it looks that
  documents are
  filtered correctly, but there is an issue in highlighting
  that also highlights
  the terms that are alone(not in phrase), can some body
  suggest me a fix to this
  issue
 

 Appending hl.usePhraseHighlighter=true should work.






Question about Nested Span Near Query

2011-02-22 Thread Ahsan |qbal
Hi All

I had a requirement to implement queries that involves phrase proximity.
like user should be able to search ab cd w/5 de fg, both phrases as
whole should be with in 5 words of each other. For this I implement a query
parser that make use of nested span queries, so above query would be parsed
as

spanNear([spanNear([Contents:ab, Contents:cd], 0, true),
spanNear([Contents:de, Contents:fg], 0, true)], 5, false)

Queries like this seems to work really good when phrases are small but when
phrases are large this doesn't work fine. Now my question, Is there any
limitation of SpanNearQuery. that we cannot handle large phrases in this
way?

please help

Regards
Ahsan


Tokenizer that Protects Phrases

2011-02-22 Thread David Yang
Hi,

 

I am trying to tokenize a string field of products. Two different
products are: camera, security camera. What I would like is for
security camera to be treated differently to camera - and only be
displayed when the search is for security camera, otherwise, the
results should only display camera. 

 

In other words, even though they share the English word camera, their
meanings are different.

 

Now my guess about the best way to deal with this is just to manually
provide a file of words that together is a token. For ex. laptop
battery, security camera. Kind of like protwords, but like
protphrases.

 

Is this a good idea to solve this problem? How do I implement it if it
is the right way? If there is a better way of dealing with this what is
it?

 

Thanks for your time,

David

 



Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() SegmentReader$CoreReaders.getTermsReader

2011-02-22 Thread Yonik Seeley
On Tue, Feb 22, 2011 at 9:13 AM, Rachita Choudhary
rachita.choudh...@burrp.com wrote:
 Hi Solr Users,

 We are upgrading from Solr 1.3 to Solr 1.4.1.
 While using Solr 1.3 , we were seeing multiple blocking active threads on
 org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal() .

 To utilize the benefits of NIO, on upgrading to Solr 1.4.1, we see other
 type of multiple blocking threads on
 org.apache.solr.request.UnInvertedField.getUnInvertedField()  

 SegmentReader$CoreReaders.getTermsReader.
 Due to this, the QTimes shoots up from few hundreds to thousand of
 msec.. even going upto 30-40 secs for a single query.

 - The multiple blocking threads show up after few thousands of queries.
 - We do not have faceting and sorting on the same fields.
 - Our facet fields are multivalued text fields, but no large text values are
 present.
 - Index size - around 10 GB
 - We have not specified any method for faceting in our schema.xml.
 - Our field value cache settings are:
  fieldValueCache
        class=solr.FastLRUCache
        size=175
        autowarmCount=0
        showItems=10
  /

 Can someone please tell us the why we are seeing these blocked threads ?
 Also if they are related to our field value cache , then a cache of size 175
 will be filled up with very few initial queries and right after that we
 should see multiple blocking threads ?
 What difference it will make if we have facet.method = enum ?

fc method on a multivalued field instantiates an UnInvertedField (like
a multi-valued field cache) which can take some time.
Just like sorting, you may want to use some warming faceting queries
to make sure that real queries don't pay the cost of the initial entry
construction.

From your fieldValueCache statistics, it looks like the number of
terms is low enough that the enum method may be fine here.

-Yonik
http://lucidimagination.com


 Is this all related to fieldValueCache or is there some other configuration
 which we need to set to avoid these blocking threads?

 Thanks,
 Rachita

 *Cache values example:
 *facetField1_27443 :
 {field=facet1_27443,memSize=4214884,tindexSize=52,time=22,phase1=15,nTerms=4,bigTerms=0,termInstances=6,uses=1}

 facetField1_70 :
 {field=facetField1_70,memSize=4223310,tindexSize=308,time=28,phase1=21,nTerms=636,bigTerms=0,termInstances=14404,uses=1}

 facetField2 : 
 {field=facetField2,memSize=4262644,tindexSize=3156,time=273,phase1=267,nTerms=12188,bigTerms=0,termInstances=1255522,uses=7031}


Snipet in results

2011-02-22 Thread Rosa (Anuncios)

Hi,

I would like to have a google similar snipet of 2-3 lines of docs in my 
search results.


Something like:

TITLE - full title of doc
Description - that extract the sentence or some text before and after 
keywords with highlightining and merge a couple of these extracted piece 
together


Thanks for your help,

Rosa


Sorting - bad performance

2011-02-22 Thread Jon Drukman
The performance factors wiki says:
If you do a lot of field based sorting, it is advantageous to add explicitly
warming queries to the newSearcher and firstSearcher event listeners in your
solrconfig which sort on those fields, so the FieldCache is populated prior to
any queries being executed by your users.

I've got an index with 24+ million docs of forum posts from users.  I want to be
able to get a given user's posts sorted by date.  It's taking 20 seconds right
now.  What would I put in the newSearch/firstSearcher to make that quicker?  Is
there any other general approach I can use to speed up sorting?

The schema looks like

 fields
   field name=type_id type=string indexed=true stored=true
required=true /
   field name=subhead type=text indexed=true stored=true/
   field name=post_date type=date indexed=true stored=true /
   field name=author type=cistring indexed=true stored=true /
   field name=parent_author type=cistring indexed=true stored=true /
 /fields

cistring is a case-insensitive string type i created:

   fieldType name=cistring class=solr.StrField sortMissingLast=true
omitNorms=true
analyzer type=index
tokenizer class=solr.LowerCaseTokenizerFactory/
/analyzer
analyzer type=query
tokenizer class=solr.LowerCaseTokenizerFactory/
/analyzer
/fieldType



UpdateProcessor and copyField

2011-02-22 Thread Teruhiko Kurosaka
Can fields created by copyField instructions be processed by
UpdateProcessors?
Or only raw input fields can?

So far my experiment is suggesting the latter.


T. Kuro Kurosaka






Indexing languages, dataimporthandler

2011-02-22 Thread Greg Georges
Hello all,

I have just gone through the mailing list and have set up my different field 
type analysers for my 6 different languages in my shema.xml. Here is my 
question. I am using the dataimporthandler to import data from my database into 
my index. In my table, the documentname column's data can be in any of the 6 
languages. Lets say I want to index this data and apply the different language 
analysers for certain cases, what would be the best way in my case. The real 
problem is that I do not know the language of the string in the documentname 
column once I create my index, therefore I cannot apply the correct field type. 
Should I create a custom transformer?

Thanks

Greg


Re: Snipet in results

2011-02-22 Thread Leonardo Souza
http://wiki.apache.org/solr/HighlightingParameters

[ ]'s
Leonardo Souza
 °v°   Linux user #375225
 /(_)\   http://counter.li.org/
 ^ ^



On Tue, Feb 22, 2011 at 3:39 PM, Rosa (Anuncios) 
rosaemailanunc...@gmail.com wrote:

 Hi,

 I would like to have a google similar snipet of 2-3 lines of docs in my
 search results.

 Something like:

 TITLE - full title of doc
 Description - that extract the sentence or some text before and after
 keywords with highlightining and merge a couple of these extracted piece
 together

 Thanks for your help,

 Rosa



DIH and updating specific record

2011-02-22 Thread Olson, Ron
Hi all-

I am trying to determine if there is a way to tell Solr to update its index 
with a specific ID to a record in the database. All the examples and 
documentation seems to discuss using a last updated date/time field, but in 
this case modifying the table would not be an option. Instead, I'd like to 
invoke Solr's DIH delta query with a specific ID to say here's something new 
or updated, please update your index with it.

I apologize if this is a trivial thing, but I can't seem to find any 
documentation on how to do it.

Thanks,

Ron


DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is unauthorized and strictly prohibited. If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.


RE: XML Stripping from DIH

2011-02-22 Thread Olson, Ron
Thanks a lot! I thought I'd looked on this page but didn't see this one, not 
sure why.

I greatly appreciate it!

Ron

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
Sent: Sunday, February 20, 2011 5:59 AM
To: solr-user@lucene.apache.org
Subject: Re: XML Stripping from DIH

Ron,

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Olson, Ron rol...@lbpc.com
 To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Sent: Fri, February 18, 2011 4:05:15 PM
 Subject: XML Stripping from DIH

 Hi all-

 I have some XML in a database that I am trying to index and  store; I am
interested in the various pieces of text, but none of the tags. I've  been
trying to figure out a way to strip all the tags out, but haven't found
anything within Solr to do so; the XML parser seems to want XPath to get the
various element values, when all I want is to turn the whole thing into one 
blob
of text, regardless of whether it makes any contextual sense.

 Is there  something in Solr to do this, or is it something I'd have to write
myself (which  I'm willing to do if necessary)?

 Thanks for any  info,

 Ron

 DISCLAIMER: This electronic message, including any  attachments, files or
documents, is intended only for the addressee and may  contain CONFIDENTIAL,
PROPRIETARY or LEGALLY PRIVILEGED information.  If  you are not the intended
recipient, you are hereby notified that any use,  disclosure, copying or
distribution of this message or any of the information  included in or with it
is  unauthorized and strictly prohibited.  If  you have received this message 
in
error, please notify the sender immediately by  reply e-mail and permanently
delete and destroy this message and its  attachments, along with any copies
thereof. This message does not create any  contractual obligation on behalf of
the sender or Law Bulletin Publishing  Company.
 Thank you.



DISCLAIMER: This electronic message, including any attachments, files or 
documents, is intended only for the addressee and may contain CONFIDENTIAL, 
PROPRIETARY or LEGALLY PRIVILEGED information.  If you are not the intended 
recipient, you are hereby notified that any use, disclosure, copying or 
distribution of this message or any of the information included in or with it 
is  unauthorized and strictly prohibited.  If you have received this message in 
error, please notify the sender immediately by reply e-mail and permanently 
delete and destroy this message and its attachments, along with any copies 
thereof. This message does not create any contractual obligation on behalf of 
the sender or Law Bulletin Publishing Company.
Thank you.


Re: Indexing languages, dataimporthandler

2011-02-22 Thread Teruhiko Kurosaka
Greg,

You could use copyField to copy the column in question to 6 fields, one
for each of your 6 languages,
and hope they none of the analyzers do something reasonable without
crashing.
Or apply the white-space tokenizer and hope for the best?

If the column has long enough text, you could try a language detector.
My company, Basis Technology, sells one, and it can plug into Solr easily.
http://www.basistech.com/language-identification/


On 2/22/11 11:50 AM, Greg Georges greg.geor...@biztree.com wrote:

Hello all,

I have just gone through the mailing list and have set up my different
field type analysers for my 6 different languages in my shema.xml. Here
is my question. I am using the dataimporthandler to import data from my
database into my index. In my table, the documentname column's data can
be in any of the 6 languages. Lets say I want to index this data and
apply the different language analysers for certain cases, what would be
the best way in my case. The real problem is that I do not know the
language of the string in the documentname column once I create my index,
therefore I cannot apply the correct field type. Should I create a custom
transformer?

Thanks

Greg


T. Kuro Kurosaka, 415-227-9600x122, 617-386-7122(direct)





Re: Passing parameters to DataImportHandler

2011-02-22 Thread Chris Hostetter

: It'd be nice to be able to pass HTTP parameters into DataImportHandler
: that'd be passed into the SQL as parameters, is this possible?

there is a specific sub-section about this in the docs...

http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters


-Hoss


Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Stephen Duncan Jr
I'm trying to use
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
as
a bf parameter to my dismax handler.  The problem is, the value of NOW can
cause documents in a similar range (date value within a few seconds of each
other) to sometimes round to be equal, and sometimes not, changing their
sort order (when equal, falling back to a secondary sort).  This, in turn,
screws up paging.

The problem is that score is rounded to a lower level of precision than what
the suggested formula produces as a difference between two values within
seconds of each other.  It seems to me if I could round the value to minutes
or hours, where the difference will be large enough to not be rounded-out,
then I wouldn't have problems with order changing on me.  But it's not legal
syntax to specify something like:
recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)

Is this a problem anyone has faced and solved?  Anyone have suggested
solutions, other than indexing a copy of the date field that's rounded to
the hour?

--
Stephen Duncan Jr
www.stephenduncanjr.com


RE: DIH and updating specific record

2011-02-22 Thread David Yang
Chris Hostetter answered this just recently:
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_paramete
rs

My addition:
Pass a parameter like command=delta-importidz=31415
And access it via 'sql where id=${dataimporter.request.idz}'

If the idz is a string you might need to prequote the idz value.

-Original Message-
From: Olson, Ron [mailto:rol...@lbpc.com] 
Sent: Tuesday, February 22, 2011 3:18 PM
To: solr-user@lucene.apache.org
Subject: DIH and updating specific record

Hi all-

I am trying to determine if there is a way to tell Solr to update its
index with a specific ID to a record in the database. All the examples
and documentation seems to discuss using a last updated date/time
field, but in this case modifying the table would not be an option.
Instead, I'd like to invoke Solr's DIH delta query with a specific ID to
say here's something new or updated, please update your index with it.

I apologize if this is a trivial thing, but I can't seem to find any
documentation on how to do it.

Thanks,

Ron


DISCLAIMER: This electronic message, including any attachments, files or
documents, is intended only for the addressee and may contain
CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are
not the intended recipient, you are hereby notified that any use,
disclosure, copying or distribution of this message or any of the
information included in or with it is unauthorized and strictly
prohibited. If you have received this message in error, please notify
the sender immediately by reply e-mail and permanently delete and
destroy this message and its attachments, along with any copies thereof.
This message does not create any contractual obligation on behalf of the
sender or Law Bulletin Publishing Company.
Thank you.


Re: UpdateProcessor and copyField

2011-02-22 Thread Markus Jelsma
Yes. But did you actually search the mailing list or Solr's wiki? I guess not.

Here it is:
http://wiki.apache.org/solr/UpdateRequestProcessor

 Can fields created by copyField instructions be processed by
 UpdateProcessors?
 Or only raw input fields can?
 
 So far my experiment is suggesting the latter.
 
 
 T. Kuro Kurosaka


RE: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread David Yang
One suggestion: use logarithms to compress the large time range into something 
easier to compare: 1/log(ms(now,date)

-Original Message-
From: Stephen Duncan Jr [mailto:stephen.dun...@gmail.com] 
Sent: Tuesday, February 22, 2011 6:03 PM
To: solr-user@lucene.apache.org
Subject: Sort Stability With Date Boosting and Rounding

I'm trying to use
http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
as
a bf parameter to my dismax handler.  The problem is, the value of NOW can
cause documents in a similar range (date value within a few seconds of each
other) to sometimes round to be equal, and sometimes not, changing their
sort order (when equal, falling back to a secondary sort).  This, in turn,
screws up paging.

The problem is that score is rounded to a lower level of precision than what
the suggested formula produces as a difference between two values within
seconds of each other.  It seems to me if I could round the value to minutes
or hours, where the difference will be large enough to not be rounded-out,
then I wouldn't have problems with order changing on me.  But it's not legal
syntax to specify something like:
recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)

Is this a problem anyone has faced and solved?  Anyone have suggested
solutions, other than indexing a copy of the date field that's rounded to
the hour?

--
Stephen Duncan Jr
www.stephenduncanjr.com


Re: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Geert-Jan Brits
You could always use a secondary sort as a tie-breaker, i.e: something
unique like 'documentid' or something. That would ensure a stable sort.

2011/2/23 Stephen Duncan Jr stephen.dun...@gmail.com

 I'm trying to use

 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
 as
 a bf parameter to my dismax handler.  The problem is, the value of NOW can
 cause documents in a similar range (date value within a few seconds of each
 other) to sometimes round to be equal, and sometimes not, changing their
 sort order (when equal, falling back to a secondary sort).  This, in turn,
 screws up paging.

 The problem is that score is rounded to a lower level of precision than
 what
 the suggested formula produces as a difference between two values within
 seconds of each other.  It seems to me if I could round the value to
 minutes
 or hours, where the difference will be large enough to not be rounded-out,
 then I wouldn't have problems with order changing on me.  But it's not
 legal
 syntax to specify something like:
 recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)

 Is this a problem anyone has faced and solved?  Anyone have suggested
 solutions, other than indexing a copy of the date field that's rounded to
 the hour?

 --
 Stephen Duncan Jr
 www.stephenduncanjr.com



Re: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Markus Jelsma
Hi,

You're right, it's illegal syntax to use other functions in the ms function, 
which is a pity indeed.

However, you reduce the score by 50% for each year. Therefore paging through 
the results shouldn't make that much of a difference because the difference in 
score with NOW+2 minutes has a negligable impact on the total score.

I had some thoughts on this issue as well but i decided the impact was too 
little to bother about.

Cheers,

 I'm trying to use
 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_n
 ewer_documents as
 a bf parameter to my dismax handler.  The problem is, the value of NOW can
 cause documents in a similar range (date value within a few seconds of each
 other) to sometimes round to be equal, and sometimes not, changing their
 sort order (when equal, falling back to a secondary sort).  This, in turn,
 screws up paging.
 
 The problem is that score is rounded to a lower level of precision than
 what the suggested formula produces as a difference between two values
 within seconds of each other.  It seems to me if I could round the value
 to minutes or hours, where the difference will be large enough to not be
 rounded-out, then I wouldn't have problems with order changing on me.  But
 it's not legal syntax to specify something like:
 recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)
 
 Is this a problem anyone has faced and solved?  Anyone have suggested
 solutions, other than indexing a copy of the date field that's rounded to
 the hour?
 
 --
 Stephen Duncan Jr
 www.stephenduncanjr.com


hierarchical faceting, SOLR-792 - confused on config

2011-02-22 Thread kmf

I'm using solr 4.0 and trying to implement a hierarchical faceting example. 
The example I'm trying to implement is taken from the webcast Mastering the
Power of Faceted Search.
(http://www.lucidimagination.com/solutions/webcasts/faceting)  Around minute
30, Chris Hostetter gives a very nice tips  tricks example he described
as Taxonomy facets.  Where I'm confused is how to get the data
indexed/organized into the taxonomy facets (0/NonFic, 1/NonFic/Law,
0/NonFic, 1/NonFic/Sci, 0/NonFic, 1/NonFic/Hist, 1/NonFic/Sci,
2/NonFic/Sci/Phys).  Since I'm using DIH to import my data from a DB, do I
create a TemplateTransformer to produce the indexed data?  Do I have to do
something special within schema.xml and/or solrconfig.xml?  

Once I figure out the correct config setup, I assume it's simply a matter of
creating the correct solr query like he describes in the video?

Thanks,
kmf
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2556394.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: hierarchical faceting, SOLR-792 - confused on config

2011-02-22 Thread Koji Sekiguchi

(11/02/23 8:26), kmf wrote:


I'm using solr 4.0 and trying to implement a hierarchical faceting example.
The example I'm trying to implement is taken from the webcast Mastering the
Power of Faceted Search.
(http://www.lucidimagination.com/solutions/webcasts/faceting)  Around minute
30, Chris Hostetter gives a very nice tips  tricks example he described
as Taxonomy facets.  Where I'm confused is how to get the data
indexed/organized into the taxonomy facets (0/NonFic, 1/NonFic/Law,
0/NonFic, 1/NonFic/Sci, 0/NonFic, 1/NonFic/Hist, 1/NonFic/Sci,
2/NonFic/Sci/Phys).  Since I'm using DIH to import my data from a DB, do I
create a TemplateTransformer to produce the indexed data?  Do I have to do
something special within schema.xml and/or solrconfig.xml?

Once I figure out the correct config setup, I assume it's simply a matter of
creating the correct solr query like he describes in the video?

Thanks,
kmf


kmf,

disclaimer: I've never seen the webcast yet.

First, SOLR-792 is not for hierarchical faceting. Please see SOLR-64.
Second, please take a look at PathHierarchyTokenizer in trunk and 3x.
It cannot output the depth factor (0/, 1/, ...), though.

Hmm, does everyone think that it has to be better if it outputs
the depth factors to type or payload or somewhere else?

Koji
--
http://www.rondhuit.com/en/


Re: UpdateProcessor and copyField

2011-02-22 Thread Teruhiko Kurosaka
Markus,

I searched but I couldn't find a definite answer, so I posted this
question.
The article you quoted talks about implementing a copyField-like operation
using UpdateProcessor.  It doesn't talk about relationship between
the copyField operation proper and UpdateProcessors.

Kuro

On 2/22/11 3:00 PM, Markus Jelsma markus.jel...@openindex.io wrote:

Yes. But did you actually search the mailing list or Solr's wiki? I guess
not.

Here it is:
http://wiki.apache.org/solr/UpdateRequestProcessor

 Can fields created by copyField instructions be processed by
 UpdateProcessors?
 Or only raw input fields can?
 
 So far my experiment is suggesting the latter.
 
 
 T. Kuro Kurosaka



Re: Date Math

2011-02-22 Thread Chris Hostetter

: org.apache.lucene.queryParser.ParseException: Cannot parse 
'last_modified:-DAY': 
...
: Are they not supported as a short-cut for NOW-1DAY?  I'm using Solr 1.4.

No, -1DAY is a valid DateMath string (to the DateMathParser) but as a 
field value you must specify a valid date string, which can *end* with a 
DateMath string.  so NOW-1DAY is legal, as is 
2011-02-22T12:34:56Z-1DAY

Note also: you didn't do -1DAY you tried -DAY which isn't valid 
anywhere.


-Hoss


Re: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Stephen Duncan Jr
The problem comes when you have results that are all the same natural score
(because you've filtered them, with no primary search, for instance), and
are very close together in time.  Then, as you page through, the order
changes.  So the user experience is that they see duplicate documents, and
miss out on some of the docs in the overall set.  It's not something
negligible that I can ignore.  I either have to come up with a fix for this,
or get rid of the boost function altogether.

Stephen Duncan Jr
www.stephenduncanjr.com


On Tue, Feb 22, 2011 at 6:09 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 You're right, it's illegal syntax to use other functions in the ms
 function,
 which is a pity indeed.

 However, you reduce the score by 50% for each year. Therefore paging
 through
 the results shouldn't make that much of a difference because the difference
 in
 score with NOW+2 minutes has a negligable impact on the total score.

 I had some thoughts on this issue as well but i decided the impact was too
 little to bother about.

 Cheers,

  I'm trying to use
 
 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_n
  ewer_documents as
  a bf parameter to my dismax handler.  The problem is, the value of NOW
 can
  cause documents in a similar range (date value within a few seconds of
 each
  other) to sometimes round to be equal, and sometimes not, changing their
  sort order (when equal, falling back to a secondary sort).  This, in
 turn,
  screws up paging.
 
  The problem is that score is rounded to a lower level of precision than
  what the suggested formula produces as a difference between two values
  within seconds of each other.  It seems to me if I could round the value
  to minutes or hours, where the difference will be large enough to not be
  rounded-out, then I wouldn't have problems with order changing on me.
  But
  it's not legal syntax to specify something like:
  recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)
 
  Is this a problem anyone has faced and solved?  Anyone have suggested
  solutions, other than indexing a copy of the date field that's rounded to
  the hour?
 
  --
  Stephen Duncan Jr
  www.stephenduncanjr.com



Re: Sort Stability With Date Boosting and Rounding

2011-02-22 Thread Stephen Duncan Jr
No, the problem is that, due to rounding, sometimes the docs ARE considered
ties, and therefore the secondary sort is used, but sometimes they don't
round to exactly equal, and the tiebreaker isn't used, and the results get
shuffled.

Stephen Duncan Jr
www.stephenduncanjr.com


On Tue, Feb 22, 2011 at 6:09 PM, Geert-Jan Brits gbr...@gmail.com wrote:

 You could always use a secondary sort as a tie-breaker, i.e: something
 unique like 'documentid' or something. That would ensure a stable sort.

 2011/2/23 Stephen Duncan Jr stephen.dun...@gmail.com

  I'm trying to use
 
 
 http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
  as
  a bf parameter to my dismax handler.  The problem is, the value of NOW
 can
  cause documents in a similar range (date value within a few seconds of
 each
  other) to sometimes round to be equal, and sometimes not, changing their
  sort order (when equal, falling back to a secondary sort).  This, in
 turn,
  screws up paging.
 
  The problem is that score is rounded to a lower level of precision than
  what
  the suggested formula produces as a difference between two values within
  seconds of each other.  It seems to me if I could round the value to
  minutes
  or hours, where the difference will be large enough to not be
 rounded-out,
  then I wouldn't have problems with order changing on me.  But it's not
  legal
  syntax to specify something like:
  recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1)
 
  Is this a problem anyone has faced and solved?  Anyone have suggested
  solutions, other than indexing a copy of the date field that's rounded to
  the hour?
 
  --
  Stephen Duncan Jr
  www.stephenduncanjr.com