date:20110112

On Wed, Jan 12, 2011 at 11:50 AM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 I have installed and tested the sample xml file and tried indexing..
 everything went successful and when i tried with log files i got an error..

Please provide details of what you are doing, and of the error messages.
How exactly are you sending the data files to Solr for indexing? Also,
note that you will most likely need to change the default schema.xml.

 i tried reading the schema.xml and didn't get a clear idea.. can you please
 help..

It is very difficult to try to help you, given the scarce details that you
provide. I would again suggest that you look for someone local to help
you out. Alternatively, read carefully through the extensive documentation
on the Solr Wiki, or get a copy of the Solr book:
https://www.packtpub.com/solr-1-4-enterprise-search-server/book

Regards,
Gora

Re: Input raw log file

On Wed, Jan 12, 2011 at 12:10 PM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 if i convert it to CSV or XML then it will be time consuming cause the
 indexing and getting data out of it should be real time.. is there any way i
 can do other than this.. if not what are the ways i can convert them to CSV
 and XML.. and lastly which is the doc folder of solr
[...]

What is real time for you? Conversion should be pretty fast.

Also, you could use a FileDataSource, LineEntityProcessor,
and a RegexTransformer to pick up data right from the text
files. This is why I recommended this link to you originally:
http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/

Regards,
Goea

Re: Grouping - not sure where to handle (solr or outside of)

2011-01-12 Thread Stefan Matheis

kmf,

after a first read .. i would say, that sound's a bit like
http://wiki.apache.org/solr/FieldCollapsing ? But that depends mainly on
your current schema, take a look and let us know, if it helps :)

Regards
Stefan

On Tue, Jan 11, 2011 at 8:06 PM, kmf kfole...@gmail.com wrote:


 I currently have a DIH that is working in terms of being able to
 search/filter on various facets, but I'm struggling to figure out how to
 take it to the next level of what I'd like ideally.

 We have a database where the atomic unit is a condition (like an
 environment description - temp, light, high salt, etc) and these conditions
 can be in groups.

 For example, conditionA may belong to groups huckleberry, star wars and
 some group.

 When I search/filter on a facet I'm currently able to see the conditions
 and
 the information about the conditions (like which group(s) it belongs to),
 but what I'm wanting to do is be able to return group names and their
 member
 conditions along with the conditions' respective info when I search/filter
 on a facet.

 So instead of seeing:

 - condtionA
description: some description
groups:  huckleberry, star wars, some group

 I would like to see is:
 - huckleberry
  conditionA   temp: 78light: 12hrs, NaCl: 35g/L
  condition35 control, temp: 65, NaCl: 25g/L

 - star wars
  conditionA temp: 78light: 12hrs, NaCl: 35g/L
  conditionDE temp: 78, light: 24hrs, NaCl: 0


 Is this doable?  My DIH has one entity that is conditions with all of its
 sub entities, would I need to change the DIH to achieve what I want to do?
 And/or do I need to configure the solrconfig and schema files to be able to
 do what I want to do?

 I realize that part of the problem is presentation which is not solr, but
 I'm struggling with figuring out how to transpose from condition to group
 in
 the index, if that makes sense?  Assuming that's what I need to do.

 Or am I totally wrong in thinking I would handle this in the index?

 Thanks,
 kmf

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Grouping-not-sure-where-to-handle-solr-or-outside-of-tp2236108p2236108.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Input raw log file

2011-01-12 Thread Dinesh


i got some idea like creating a DIH and then doing with that.. thanks every
one for the help.. hope i'll create an regex DIH i guess that's right..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Input-raw-log-file-tp2210043p2239947.html
Sent from the Solr - User mailing list archive at Nabble.com.

other index input in Solr

2011-01-12 Thread Jörg Agatz

Guten Morgen Solr-Users,

In Deutschland ist es Morgen daher diese Begrüßung.

Ich habe ein kleines Problem mit Solr.
Ich habe einen Index, erstellt von einem anderen Programm, es ist ein Lucene
Index und kann von Luke problemlos gelesen werden.

Diesen möchte ich nun jedoch mit Solr durchsuchen lassen.
Ich starte Solr ohne probleme und kann mir in der Adminoberfläche auch den
index angucken, jedoch keine Suche darin starten, ich erhalte keine Antwort
sondern nur Fehlermeldungen.

Leider weiß ich nicht viel damit anzufangen, es scheint als wenn Solr sich
über Leere Felder im Index Beschwert, ich weiß nur nicht wie ich das Ändere.

EROOR:

HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:761) at
org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:619) at
org.apache.solr.schema.StrField.write(StrField.java:46) at
org.apache.solr.schema.SchemaField.write(SchemaField.java:108) at
org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:307) at
org.apache.solr.request.XMLWriter$3.writeDocs(XMLWriter.java:483) at
org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:420) at
org.apache.solr.request.XMLWriter.writeDocList(XMLWriter.java:457) at
org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:520) at
org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130) at
org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
--

*type* Status report

*message* *null java.lang.NullPointerException at
org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:761) at
org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:619) at
org.apache.solr.schema.StrField.write(StrField.java:46) at
org.apache.solr.schema.SchemaField.write(SchemaField.java:108) at
org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:307) at
org.apache.solr.request.XMLWriter$3.writeDocs(XMLWriter.java:483) at
org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:420) at
org.apache.solr.request.XMLWriter.writeDocList(XMLWriter.java:457) at
org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:520) at
org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130) at
org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636) *

*description* *The server encountered an internal error (null
java.lang.NullPointerException at
org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:761) at
org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:619) at
org.apache.solr.schema.StrField.write(StrField.java:46) at
org.apache.solr.schema.SchemaField.write(SchemaField.java:108) at

Re: spell suggest response

2011-01-12 Thread Stefan Matheis

satya,

nice to hear, that it work's :)

on your question to similar words: i would say no - suggestions are only
generated based on available records, and afaik only if the given
word/phrase is misspelled. Perhaps MoreLikeThis could help you, but not sure
on this - especially because you're talking about single words and not
similar documents :/

Stefan

On Wed, Jan 12, 2011 at 6:14 AM, satya swaroop satya.yada...@gmail.comwrote:

 Hi Stefan,
  Ya it works :). Thanks...
  But i have a question... can it be done only getting spell
 suggestions even if the spelled word is correct... I mean near words to
 it...
   ex:-


 http://localhost:8080/solr/spellcheckCompRH?q=javarows=0spellcheck=truespellcheck.count=10
   In the o/p the suggestions will not be coming as
 java is a word that spelt correctly...
  But cant we get near suggestions as javax,javacetc.., ???

 Regards,
 satya

Regex DataImportHandler

2011-01-12 Thread Dinesh


Can anyone explain me how to create regex DataImportHandler..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-DataImportHandler-tp2240084p2240084.html
Sent from the Solr - User mailing list archive at Nabble.com.

DataImportHandler on Websphere 6.1 NullPointer Exception from SRTServletResponse.setContentType

2011-01-12 Thread chrisclark

Has anyone had any success using the DataImportHandler on Webshpere 6.1

I am getting the following exception from Websphere when viewing the
DataImport Development Console in the browser. The ajax call to retrieve the
dataconfig.xml fails. The thing is that if you do an import the import
succeeds.

[1/11/11 15:38:10:194 GMT] 0042 SolrDispatchF I
org.apache.solr.servlet.SolrDispatchFilter init SolrDispatchFilter.init()
done
[1/11/11 15:38:10:381 GMT] 0042 SolrCore I
org.apache.solr.core.SolrCore execute [] webapp=/solr path=/select
params={command=show-configqt=/dataimport} status=0 QTime=47
[1/11/11 15:38:10:428 GMT] 0042 SolrDispatchF E
org.apache.solr.common.SolrException log java.lang.NullPointerException
at
com.ibm.ws.webcontainer.srt.SRTServletResponse.setContentType(SRTServletResponse.java:1017)
at
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:318)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)

This may be a symptom of what's causing problems when my app tries to do a
dataimport using SolrJ, so that is why I put the stack trace here.

What's happening in my app is that SolrJ sends an Http request to the Solr
instance to do a dataimport. The dataimport succeeds, but the response comes
back as a 404 page not found. This causes SolrJ to throw an exception, and
so the rest of my application fails and reports an error. When doing this
call there is no stack trace in the logs, just an error saying page not
found.

The app works fine on JBoss but doesn't work on Websphere.
The version of Solr is 1.4.1
Websphere is:
version 6.1.0.0
Build Number: b0620.14
Build Date: 5/16/06

--
View this message in context:
http://lucene.472066.n3.nabble.com/DataImportHandler-on-Websphere-6-1-NullPointer-Exception-from-SRTServletResponse-setContentType-tp2240281p2240281.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex DataImportHandler

On Wed, Jan 12, 2011 at 3:07 PM, Dinesh mdineshkuma...@karunya.edu.in wrote:

 Can anyone explain me how to create regex DataImportHandler..
[...]

Dear Dinesh,

No offence, but please do some basic leg work on your own
first, and then ask more specific questions.

Did you read the Hathi trust blog that I have now referenced
twice, and try out ideas from that?

Alternatively, as also asked before, please post a short excerpt
from your log files, indicating the parts of the data that you
want to extract. Maybe someone can help you then.

Regards,
Gora

Re: spell suggest response

2011-01-12 Thread satya swaroop

Hi stefan,
I need the words from the index record itself. If java is given
then the relevant or similar or near words in the index should be shown.
Even the given keyword is true... can it be possible???


ex:-

http://localhost:8080/solr/spellcheckCompRH?q=javarows=0spellcheck=truespellcheck.count=10
   In the o/p the suggestions will not be coming as
java is a word that spelt correctly...
  But cant we get near suggestions as javax,javacetc.., ???(the
terms in the index)

I read  about  suggester in solr wiki at
http://wiki.apache.org/solr/Suggester . But i tried to implement it but got
errors as

*error loading class org.apache.solr.spelling.suggest.suggester*

Regards,
satya

Re: Regex DataImportHandler

2011-01-12 Thread Dinesh


ya i did.. i'm trying it.. still for a better solution i asked...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Regex-DataImportHandler-tp2240084p2240295.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: What can cause segment corruption?

2011-01-12 Thread Michael McCandless

Corruption should only happen if 1) we have a bug in Lucene (but we
work hard to fix such bugs, though, LUCENE-2593, fixed in 2.9.4, is a
recent case) or 2) there are hardware problems on the machine.

Mike

On Tue, Jan 11, 2011 at 10:02 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 Thanks for your answer,

 It's not a disk space problem here :

 # df -h
 Filesystem            Size  Used Avail Use% Mounted on
 /dev/sda4             280G   22G  244G   9% /


 We will try to install solr on a different server (We just need a little
 time for that)


 Stéphane


 Le 11/01/2011 15:42, Jason Rutherglen a écrit :

 Stéphane,

 I've only seen production index corruption when during merge the
 process ran out of disk space, or there is an underlying hardware
 related issue.

 On Tue, Jan 11, 2011 at 5:06 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 Hi,


 I'm using Solr 1.4.1 (Lucene 2.9.3)

 And some segments get corrupted:

  4 of 11: name=_p40 docCount=470035
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=1,946.747
    diagnostics = {optimize=true, mergeFactor=6,
 os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_p40_bj.del]
    test: open reader.OK [9299 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term source:margolisphil docFreq=1 !=
 num docs seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term source:margolisphil docFreq=1 != num
 docs
 seen 0 + num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at
 org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [15454281 total field count; avg 33.543
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at
 org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 What might cause this corruption?


 I detailed my configuration here:


 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201101.mbox/%3c4d2ae506.7070...@blogspirit.com%3e

 Thanks,

Not storing, but highlighting from document sentences

Hello,

I'm indexing some content (articles) whose text I cannot store in its original 
form for copyright reason.  So I can index the content, but cannot store it.  
However, I need snippets and search term highlighting.  


Any way to accomplish this elegantly?  Or even not so elegantly?

Here is one idea:

* Create 2 indices: main index for indexing (but not storing) the original 
content, the secondary index for storing individual sentences from the original 
article.

* That is, before indexing an article, split it into sentences.  Then index the 
article in the main index, and index+store each sentence in the secondary 
index.  So for each doc in the main index there will be multiple docs in the 
secondary index with individual sentences.  Each sentence doc includes an ID of 
the parent document.

* Then run queries against the main index, and pull individual sentences from 
the secondary index for snippet+highlight purposes.


The problem I see with this approach (and there may be other ones that I am not 
seeing yet) is with queries like foo AND bar.  In this case foo may be a 
match 
from sentence #1, and bar may be a match from sentence #7.  Or maybe foo is 
a match in sentence #1, and bar is a match in multiple sentences: #7 and #10 
and #23.

Regardless, when a query is run against the main index, you don't know where 
the 
match was, so you don't know which sentences to go get from the secondary index.

Does anyone have any suggestions for how to handle this?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

DataImportHandler on Websphere - http response 404

2011-01-12 Thread chrisclark


Has anyone had any success using the DataImportHandler on Webshpere 6.1

Below are the logs for a call to reload-config.
I have turned on debug and stepped through the code and the
dataImportHandler correctly reloads the config and the response gets written
out to the http response without any errors being thrown from the Solr code.
However, in Websphere the response is returned as a 404 page not found. So
this is happening somewhere in the Websphere code.
There are no errors reported in any of the Websphere logs.

This all works fine on JBoss but doesn't work on Websphere.
The version of Solr is 1.4.1
Websphere is: 
version 6.1.0.0 
Build Number: b0620.14
Build Date: 5/16/06


This is the log file snippet.

12-Jan-2011 10:54:15,320 -  -  DEBUG header:70 -  GET
/solr/dataimport?optimize=trueclean=falsecommit=truecommand=reload-configqt=%2FdataimportomitHeader=truewt=javabinversion=1
HTTP/1.1[\r][\n]
12-Jan-2011 10:54:15,352 -  -  DEBUG header:70 -  User-Agent:
Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0[\r][\n]
12-Jan-2011 10:54:15,367 -  -  DEBUG header:70 -  Host:
10.101.41.1:10012[\r][\n]
12-Jan-2011 10:54:15,398 -  -  DEBUG header:70 -  [\r][\n]
12-Jan-2011 10:55:11,403 -  -  DEBUG header:70 -  HTTP/1.1 404 Not
Found[\r][\n]
12-Jan-2011 10:55:11,418 -  -  DEBUG header:70 -  HTTP/1.1 404 Not
Found[\r][\n]
12-Jan-2011 10:55:11,434 -  -  DEBUG header:70 -  Last-Modified: Wed, 12
Jan 2011 10:54:15 GMT[\r][\n]
12-Jan-2011 10:55:11,465 -  -  DEBUG header:70 -  ETag:
12d79dc9632[\r][\n]
12-Jan-2011 10:55:11,481 -  -  DEBUG header:70 -  Cache-Control:
no-cache, no-store[\r][\n]
12-Jan-2011 10:55:11,497 -  -  DEBUG header:70 -  Pragma:
no-cache[\r][\n]
12-Jan-2011 10:55:11,528 -  -  DEBUG header:70 -  Expires: Sat, 01 Jan
2000 01:00:00 GMT[\r][\n]
12-Jan-2011 10:55:11,543 -  -  DEBUG header:70 -  Content-Type:
text/html;charset=ISO-8859-1[\r][\n]
12-Jan-2011 10:55:11,559 -  -  DEBUG header:70 -  $WSEP: [\r][\n]
12-Jan-2011 10:55:11,575 -  -  DEBUG header:70 -  Content-Language:
en-US[\r][\n]
12-Jan-2011 10:55:11,606 -  -  DEBUG header:70 -  Content-Length:
51[\r][\n]
12-Jan-2011 10:55:11,622 -  -  DEBUG header:70 -  Connection:
Close[\r][\n]
12-Jan-2011 10:55:11,637 -  -  DEBUG header:70 -  Date: Wed, 12 Jan 2011
10:55:10 GMT[\r][\n]
12-Jan-2011 10:55:11,653 -  -  DEBUG header:70 -  Server: WebSphere
Application Server/6.1[\r][\n]
12-Jan-2011 10:55:11,684 -  -  DEBUG header:70 -  [\r][\n]
12-Jan-2011 10:55:11,700 -  -  DEBUG content:70 -  Error 404: SRVE0190E:
File not found: /dataimport[\r][\n]
12-Jan-2011 10:55:11,715 -  -  ERROR SolrSearchEngine:422 - Failed to
perform reload-config
org.apache.solr.client.solrj.SolrServerException: Error executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at
com.norkom.search.business.engine.solr.SolrSearchEngine.executeDataImportCommand(SolrSearchEngine.java:415)
at
com.norkom.search.business.engine.solr.SolrSearchEngine.reloadDataImportConfig(SolrSearchEngine.java:374)
at
com.norkom.search.business.engine.solr.SolrSearchEngine.buildIndex(SolrSearchEngine.java:314)
at
com.norkom.search.business.jobs.FtsBuildIndexJob.executeJob(FtsBuildIndexJob.java:62)
at com.norkom.base.business.jobs.ThreadedJob.run(ThreadedJob.java:52)
at java.lang.Thread.run(Thread.java:797)
Caused by: 
org.apache.solr.common.SolrException: Not Found

Not Found




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-on-Websphere-http-response-404-tp2240440p2240440.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Input raw log file

2011-01-12 Thread Peter Karich

 Dinesh,

it will stay 'real time' even if you convert it. Converting should be
done in the millisecond range if at all measureable (e.g. if you apply
streaming).
Beware: To use the real features you'll need the latest trunk of solr IMHO.

I've done similar log-feeding stuff here (with code!):
http://karussell.wordpress.com/2010/10/27/feeding-solr-with-its-own-logs/
(not with a realtime solr!)
You'll have to adapt the parser/matcher to fit your needs.

Regards,
Peter.

 if i convert it to CSV or XML then it will be time consuming cause the
 indexing and getting data out of it should be real time.. is there any way i
 can do other than this.. if not what are the ways i can convert them to CSV
 and XML.. and lastly which is the doc folder of solr


-- 
http://jetwick.com open twitter search

Re: Not storing, but highlighting from document sentences

2011-01-12 Thread Stefan Matheis

Otis,

just interested in .. storing the full text is not allowed, but splitting up
in separate sentences is okay?

while you think about using the sentences only as secondary/additional
source, maybe it would help to search in the sentences itself, or would that
give misleading results in your case?

Stefan

On Wed, Jan 12, 2011 at 12:02 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hello,

 I'm indexing some content (articles) whose text I cannot store in its
 original
 form for copyright reason.  So I can index the content, but cannot store
 it.
 However, I need snippets and search term highlighting.


 Any way to accomplish this elegantly?  Or even not so elegantly?

 Here is one idea:

 * Create 2 indices: main index for indexing (but not storing) the original
 content, the secondary index for storing individual sentences from the
 original
 article.

 * That is, before indexing an article, split it into sentences.  Then index
 the
 article in the main index, and index+store each sentence in the secondary
 index.  So for each doc in the main index there will be multiple docs in
 the
 secondary index with individual sentences.  Each sentence doc includes an
 ID of
 the parent document.

 * Then run queries against the main index, and pull individual sentences
 from
 the secondary index for snippet+highlight purposes.


 The problem I see with this approach (and there may be other ones that I am
 not
 seeing yet) is with queries like foo AND bar.  In this case foo may be a
 match
 from sentence #1, and bar may be a match from sentence #7.  Or maybe
 foo is
 a match in sentence #1, and bar is a match in multiple sentences: #7 and
 #10
 and #23.

 Regardless, when a query is run against the main index, you don't know
 where the
 match was, so you don't know which sentences to go get from the secondary
 index.

 Does anyone have any suggestions for how to handle this?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/

Re: issue with the spatial search with solr

2011-01-12 Thread ur lops

Hi Dennis, thanks a lot for pointing the problem. It works.


On Tue, Jan 11, 2011 at 11:50 PM, Dennis Gearon gear...@sbcglobal.netwrote:

 You didn't happen to notice that you have one field names
 RestaurantLocation and
 another named RestaurantName, did you?

 You must be submitting 'RestaurantName', and it's being applied to a geo
 field.

  Dennis Gearon


 Signature Warning
 
 It is always a good idea to learn from your own mistakes. It is usually a
 better
 idea to learn from others’ mistakes, so you do not have to make them
 yourself.
 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


 EARTH has a Right To Life,
 otherwise we all die.



 - Original Message 
 From: ur lops urlop...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 11, 2011 11:13:36 PM
 Subject: issue with the spatial search with solr

 Hi,
 I took the latest build from the hudson and installed on my computer. I
 have done the following changes in my schema.xml

 fieldType name=latLon class=solr.LatLonType
 subFieldSuffix=_latLon/
 dynamicField name=*_latLon  type=tdouble indexed=true
 stored=false/
 field name=restaurantLocation  type=latLonindexed=true
 stored=true/

 When i run the query like this:
 HTTP ERROR 500

 Problem accessing /solr/select. Reason:

The field restaurantName does not support spatial filtering

 org.apache.solr.common.SolrException: The field restaurantName does
 not support spatial filtering
at

 org.apache.solr.search.SpatialFilterQParser.parse(SpatialFilterQParser.java:86)
at org.apache.solr.search.QParser.getQuery(QParser.java:143)
at

 org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:112)

at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:210)

at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1296)
at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
at

 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)

at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at

 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)

at

 org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
 org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at

 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at

 org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at

 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)



 This is my solr query:


 select?wt=jsonindent=truefl=name,storeq=*:*fq={!geofilt%20sfield=restaurantName}pt=45.15,-93.85d=5



 Any help will be highly appreciated.

 Thanks

Re: Not storing, but highlighting from document sentences

Hi Stefan,

Yes, splitting in separate sentences (and storing them) is OK because with a 
bunch of sentences you can't really reconstruct the original article unless you 
know which order to put them in.

Searching against the sentence won't work for queries like foo AND bar because 
this should match original articles even if foo and bar are in different 
sentences.

Otis



- Original Message 
 From: Stefan Matheis matheis.ste...@googlemail.com
 To: solr-user@lucene.apache.org
 Sent: Wed, January 12, 2011 7:02:46 AM
 Subject: Re: Not storing, but highlighting from document sentences
 
 Otis,
 
 just interested in .. storing the full text is not allowed, but  splitting up
 in separate sentences is okay?
 
 while you think about  using the sentences only as secondary/additional
 source, maybe it would help  to search in the sentences itself, or would that
 give misleading results in  your case?
 
 Stefan
 
 On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic 
 otis_gospodne...@yahoo.com  wrote:
 
  Hello,
 
  I'm indexing some content (articles)  whose text I cannot store in its
  original
  form for copyright  reason.  So I can index the content, but cannot store
  it.
   However, I need snippets and search term highlighting.
 
 
   Any way to accomplish this elegantly?  Or even not so  elegantly?
 
  Here is one idea:
 
  * Create 2 indices:  main index for indexing (but not storing) the original
  content, the  secondary index for storing individual sentences from the
   original
  article.
 
  * That is, before indexing an article,  split it into sentences.  Then index
  the
  article in the  main index, and index+store each sentence in the secondary
  index.   So for each doc in the main index there will be multiple docs in
   the
  secondary index with individual sentences.  Each sentence doc  includes an
  ID of
  the parent document.
 
  * Then  run queries against the main index, and pull individual sentences
   from
  the secondary index for snippet+highlight  purposes.
 
 
  The problem I see with this approach (and  there may be other ones that I am
  not
  seeing yet) is with  queries like foo AND bar.  In this case foo may be a
   match
  from sentence #1, and bar may be a match from sentence #7.   Or maybe
  foo is
  a match in sentence #1, and bar is a match  in multiple sentences: #7 and
  #10
  and #23.
 
   Regardless, when a query is run against the main index, you don't know
   where the
  match was, so you don't know which sentences to go get from  the secondary
  index.
 
  Does anyone have any suggestions  for how to handle this?
 
  Thanks,
  Otis
   
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem  search :: http://search-lucene.com/

Re: solr wildcard queries and analyzers

2011-01-12 Thread Kári Hreinsson

Have you made any progress?  Since the AnalyzingQueryParser doesn't inherit 
from QParserPlugin solr doesn't want to use it but I guess we could implement a 
similar parser that does inherit from QParserPlugin?

Switching parser seems to be what is needed?  Has really no one solved this 
before?

- Kári

- Original Message -
From: Matti Oinas matti.oi...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tuesday, 11 January, 2011 12:47:52 PM
Subject: Re: solr wildcard queries and analyzers

This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

2011/1/11 Matti Oinas matti.oi...@gmail.com:
 Sorry, the message was not meant to be sent here. We are struggling
 with the same problem here.

 2011/1/11 Matti Oinas matti.oi...@gmail.com:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

 On wildcard and fuzzy searches, no text analysis is performed on the
 search word.

 2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
 Hi,

 I am having a problem with the fact that no text analysis are performed on 
 wildcard queries.  I have the following field type (a bit simplified):
    fieldType name=text class=solr.TextField positionIncrementGap=100
      analyzer
        tokenizer class=solr.WhitespaceTokenizerFactory /
        filter class=solr.TrimFilterFactory /
        filter class=solr.LowerCaseFilterFactory /
        filter class=solr.ASCIIFoldingFilterFactory /
      /analyzer
    /fieldType

 My problem has to do with Icelandic characters, when I index a document 
 with a text field including the word sjálfsögðu it gets indexed as 
 sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the 
 Icelandic characters with their English equivalents).  Then, when I search 
 (without a wildcard) for sjálfsögðu or sjalfsogdu I get that document 
 as a result.  This is convenient since it enables people to search without 
 using accented characters and yet get the results they want (e.g. if they 
 are working on computers with English keyboards).

 However this all falls apart when using wildcard searches, then the search 
 string isn't passed through the filters, and even if I search for sjálf* 
 I don't get any results because the index doesn't contain the original 
 words (I get result if I search for sjalf*).  I know people have been 
 having a similar problem with the case sensitivity of wildcard queries and 
 most often the solution seems to be to lowercase the string before passing 
 it on to solr, which is not exactly an optimal solution (yet a simple one 
 in that case).  The Icelandic characters complicate things a bit and 
 applying the same solution (doing the lowercasing and character mapping) in 
 my application seems like unnecessary duplication of code already part of 
 solr, not to mention complication of my application and possible 
 maintenance down the road.

 Is there any way around this?  How are people solving this?  Is there a way 
 to apply the filters to wildcard queries?  I guess removing the 
 ASCIIFoldingFilterFactory is the simplest solution but this 
 normalization (of the text done by the filter) is often very useful.

 I hope I'm not overlooking some obvious explanation. :/

 Thanks in advance,
 Kári Hreinsson

Re: FunctionQuery plugin propieties

2011-01-12 Thread dante stroe

Nevermind, I found it. You can add xml children to your plugin declaration
in solrconfig.xml and then retrieve them by casting the namedList arguments
received by your plugin at initialitzaion to SolrParams.

On Tue, Jan 11, 2011 at 10:28 AM, dante stroe dante.st...@gmail.com wrote:

 Hi,

Is there any way one can define proprieties for a function plugin
 extending the ValueSourceParser inside solrconfig.xml (as one can do with
 the defaults attribute for a query parser plugin inside the request
 handler)?

 Thanks,
 Dante

Can't find source or jar for Solr class JaspellTernarySearchTrie

2011-01-12 Thread Larry White

Hi,

I'm trying to find the source code for class: JaspellTernarySearchTrie. It's
supposed to be used for spelling suggestions.

It's referenced in the javadoc:
http://lucene.apache.org/solr/api/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.html

I realize this is a dumb question, but i've been looking through the
downloads for several hours.  I can't actually find the
package org/apache/solr/spelling/suggest/ that it's supposed to be under.

So if you would be so kind...
What jar is it compiled into?
Where is the source in the downloaded source tree?

thanks.

RE: Not storing, but highlighting from document sentences

2011-01-12 Thread Steven A Rowe

Hi Otis,

I think you can get what you want by doing the first stage retrieval, and then 
in the second stage, add required constraint(s) to the query for the matching 
docid(s), and change the AND operators in the original query to OR.  
Coordination will cause the best snippet(s) to rise to the top, no?

Hmm, you'll want to run the second stage once for each hit from the first 
stage, though, unless you can afford to collect *all* hits and pull out each 
first stage's hit from the intermixed second stage results...

Steve

 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, January 12, 2011 7:29 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not storing, but highlighting from document sentences
 
 Hi Stefan,
 
 Yes, splitting in separate sentences (and storing them) is OK because with
 a
 bunch of sentences you can't really reconstruct the original article
 unless you
 know which order to put them in.
 
 Searching against the sentence won't work for queries like foo AND bar
 because
 this should match original articles even if foo and bar are in different
 sentences.
 
 Otis
 
 
 
 - Original Message 
  From: Stefan Matheis matheis.ste...@googlemail.com
  To: solr-user@lucene.apache.org
  Sent: Wed, January 12, 2011 7:02:46 AM
  Subject: Re: Not storing, but highlighting from document sentences
 
  Otis,
 
  just interested in .. storing the full text is not allowed, but
 splitting up
  in separate sentences is okay?
 
  while you think about  using the sentences only as secondary/additional
  source, maybe it would help  to search in the sentences itself, or would
 that
  give misleading results in  your case?
 
  Stefan
 
  On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic 
  otis_gospodne...@yahoo.com  wrote:
 
   Hello,
  
   I'm indexing some content (articles)  whose text I cannot store in its
   original
   form for copyright  reason.  So I can index the content, but cannot
 store
   it.
However, I need snippets and search term highlighting.
  
  
Any way to accomplish this elegantly?  Or even not so  elegantly?
  
   Here is one idea:
  
   * Create 2 indices:  main index for indexing (but not storing) the
 original
   content, the  secondary index for storing individual sentences from
 the
original
   article.
  
   * That is, before indexing an article,  split it into sentences.  Then
 index
   the
   article in the  main index, and index+store each sentence in the
 secondary
   index.   So for each doc in the main index there will be multiple docs
 in
the
   secondary index with individual sentences.  Each sentence doc
 includes an
   ID of
   the parent document.
  
   * Then  run queries against the main index, and pull individual
 sentences
from
   the secondary index for snippet+highlight  purposes.
  
  
   The problem I see with this approach (and  there may be other ones
 that I am
   not
   seeing yet) is with  queries like foo AND bar.  In this case foo may
 be a
match
   from sentence #1, and bar may be a match from sentence #7.   Or
 maybe
   foo is
   a match in sentence #1, and bar is a match  in multiple sentences:
 #7 and
   #10
   and #23.
  
Regardless, when a query is run against the main index, you don't
 know
where the
   match was, so you don't know which sentences to go get from  the
 secondary
   index.
  
   Does anyone have any suggestions  for how to handle this?
  
   Thanks,
   Otis

   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem  search :: http://search-lucene.com/

Re: schema.xml in other than conf folder

2011-01-12 Thread Shanmugavel SRD


Hi,
  These two links helped me to solve the problem.
https://issues.apache.org/jira/browse/SOLR-1154
http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node
Thanks,
SRD
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/schema-xml-in-other-than-conf-folder-tp2206587p2241266.html
Sent from the Solr - User mailing list archive at Nabble.com.

Term frequency across multiple documents

2011-01-12 Thread Aaron Bycoffe

I'm attempting to calculate term frequency across multiple documents
in Solr. I've been able to use TermVectorComponent to get this data on
a per-document basis but have been unable to find a way to do it for
multiple documents -- that is, get a list of terms appearing in the
documents and how many times each one appears. I'd also like to be
able to filter the list of terms to be able to see how many times a
specific term appears, though this is less important.

Is there a way to do this in Solr?


Aaron

Re: Solr trunk for production

2011-01-12 Thread Ron Mayer

Otis Gospodnetic wrote:
 Are people using Solr trunk in serious production environments?  I suspect 
 the 
 answer is yes, just want to see if there are any gotchas/warnings.

Yes, since it seemed the best way to get edismax with this patch[1]; and to get
the more update-friendly MergePolicy[2].

Main gotcha I noticed so far is trying to figure out appropriate times
to sync with trunk's newer patches; and whether or not we need to rebuild
our kinda big ( 1TB) indexes when we do.

[1] the patch I needed: https://issues.apache.org/jira/browse/SOLR-2058
[2] nicer MergePolicy https://issues.apache.org/jira/browse/LUCENE-2602

Re: Resolve a DataImportHandler datasource based on previous entity

2011-01-12 Thread alexei


Hi Gora,

Unfortunately reorganizing the data is not an option for me.
Multiple databases exist and a third party is taking care of
populating them. Once a database reaches a certain size, a switch
occurs and a new database is created with the same table structure.


Gora Mohanty-3 wrote:
 
 I meant a script that runs the query that defines the datasources for all
 fields, writes a Solr DIH configuration file, and then initiates a
 dataimport.
 
Ok, so the query would select only the articles for which the data is 
sitting in a specific datasource. Then, only that one datasource would be
indexed.
For each additional datasource would the script initiate another full-import
with 
the clean attribute set to false?


I tried to make some changes to DIH that comes with Solr 1.4.1
The getResolvedEntityAttribute(dataSource); method seems to so the trick.
Here is the modified code. It feels awkward but it seems to work.

org.apache.solr.handler.dataimport.ContextImpl

  public DataSource getDataSource() {
if (ds != null) return ds;
if(entity == null) return  null;

String dataSourceResolved =
this.getResolvedEntityAttribute(dataSource);
 
if (entity.dataSrc == null) {  
entity.dataSrc = dataImporter.getDataSourceInstance(entity,
dataSourceResolved, this);
entity.dataSource = dataSourceResolved;
} else if (!dataSourceResolved.equals(entity.dataSource)) { 
entity.dataSrc.close();
entity.dataSrc = dataImporter.getDataSourceInstance(entity,
dataSourceResolved, this);
entity.dataSource = dataSourceResolved;
}
if (entity.dataSrc != null  docBuilder != null 
docBuilder.verboseDebug 
 Context.FULL_DUMP.equals(currentProcess())) {
  //debug is not yet implemented properly for deltas
  entity.dataSrc =
docBuilder.writer.getDebugLogger().wrapDs(entity.dataSrc);
}
return entity.dataSrc;
  }

I hope I am not breaking any other functionality... 
Would it be possible to add something like this to a future release?

Regards,
Alex



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Resolve-a-DataImportHandler-datasource-based-on-previous-entity-tp2235573p2241653.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr trunk for production

What's the syntax for spatial for that version of Solr?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Ron Mayer r...@0ape.com
To: solr-user@lucene.apache.org
Sent: Wed, January 12, 2011 7:18:10 AM
Subject: Re: Solr trunk for production

Otis Gospodnetic wrote:
 Are people using Solr trunk in serious production environments?  I suspect 
 the 

 answer is yes, just want to see if there are any gotchas/warnings.

Yes, since it seemed the best way to get edismax with this patch[1]; and to get
the more update-friendly MergePolicy[2].

Main gotcha I noticed so far is trying to figure out appropriate times
to sync with trunk's newer patches; and whether or not we need to rebuild
our kinda big ( 1TB) indexes when we do.

[1] the patch I needed: https://issues.apache.org/jira/browse/SOLR-2058
[2] nicer MergePolicy https://issues.apache.org/jira/browse/LUCENE-2602

Re: segment gets corrupted (after background merge ?)

2011-01-12 Thread Stéphane Delprat


I got another corruption.

It sure looks like it's the same type of error. (on a different field)

It's also not linked to a merge, since the segment size did not change.


*** good segment :

  1 of 9: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_22s.del]
test: open reader.OK [275881 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs 
pairs; 204561440 tokens]
test: stored fields...OK [45511958 total field count; avg 
29.066 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]



a few hours latter :

*** broken segment :

  1 of 17: name=_ncc docCount=1841685
compound=false
hasProx=true
numFiles=9
size (MB)=6,683.447
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_ncc_24f.del]
test: open reader.OK [278167 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != 
num docs seen 0 + num docs deleted 0]
java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs 
seen 0 + num docs deleted 0
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [45429565 total field count; avg 
29.056 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]

FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:

java.lang.RuntimeException: Term Index test failed
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


I'll activate infoStream for next time.


Thanks,


Le 12/01/2011 00:49, Michael McCandless a écrit :

When you hit corruption is it always this same problem?:

   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
num docs seen 0 + num docs deleted 0

Can you run with Lucene's IndexWriter infoStream turned on, and catch
the output leading to the corruption?  If something is somehow messing
up the bits in the deletes file that could cause this.

Mike

On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
stephane.delp...@blogspirit.com  wrote:

Hi,

We are using :
Solr Specification Version: 1.4.1
Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
Lucene Specification Version: 2.9.3
Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

# java -version
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

We want to index 4M docs in one core (and when it works fine we will add
other cores with 2M on the same server) (1 doc ~= 1kB)

We use SOLR replication every 5 minutes to update the slave server (queries
are executed on the slave only)

Documents are changing very quickly, during a normal day we will have approx
:
* 200 000 updated docs
* 1000 new docs
* 200 deleted docs


I attached the last good checkIndex : solr20110107.txt
And the corrupted one : solr20110110.txt


This is not the first time a segment gets corrupted on this server, that's
why I ran frequent checkIndex. (but as you can see the first segment is
1.800.000 docs and it works fine!)


I can't find any SEVER FATAL or exception in the Solr logs.


I also attached my schema.xml and solrconfig.xml


Is there something wrong with what we are doing ? Do you need other info ?


Thanks,

Re: Tuning StatsComponent

2011-01-12 Thread stockii


i try this: 

 http://host:port
/solr/select?q=YOUR_QUERYstats=onstats.field=amountf.amount.stats.facet=currencyrows=0
 

and this:
 http://host:portsolr
/select?q=amount_us:*+OR+amount_eur:*[+OR+amount_...:*]stats=onstats.field=amount_usdstats.field=amount_eur[stats.field=amount_...]rows=0
 

of my index. 

but however i change my request, every request have a Qtime of ~10 seconds
...

my result  solr StatsComponent cannot be fast on 31 Million documents =(
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2241793.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: segment gets corrupted (after background merge ?)

2011-01-12 Thread Michael McCandless

Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted 0?

It looks like new deletions were flushed against the segment (del file
changed from _ncc_22s.del to _ncc_24f.del).

Are you hitting any exceptions during indexing?

Mike

On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 I got another corruption.

 It sure looks like it's the same type of error. (on a different field)

 It's also not linked to a merge, since the segment size did not change.


 *** good segment :

  1 of 9: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ncc_22s.del]
    test: open reader.OK [275881 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs;
 204561440 tokens]
    test: stored fields...OK [45511958 total field count; avg 29.066
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 a few hours latter :

 *** broken segment :

  1 of 17: name=_ncc docCount=1841685
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=6,683.447
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ncc_24f.del]
    test: open reader.OK [278167 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num
 docs seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs seen
 0 + num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [45429565 total field count; avg 29.056
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 I'll activate infoStream for next time.


 Thanks,


 Le 12/01/2011 00:49, Michael McCandless a écrit :

 When you hit corruption is it always this same problem?:

   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
 num docs seen 0 + num docs deleted 0

 Can you run with Lucene's IndexWriter infoStream turned on, and catch
 the output leading to the corruption?  If something is somehow messing
 up the bits in the deletes file that could cause this.

 Mike

 On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com  wrote:

 Hi,

 We are using :
 Solr Specification Version: 1.4.1
 Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
 Lucene Specification Version: 2.9.3
 Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55

 # java -version
 java version 1.6.0_20
 Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
 Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

 We want to index 4M docs in one core (and when it works fine we will add
 other cores with 2M on the same server) (1 doc ~= 1kB)

 We use SOLR replication every 5 minutes to update the slave server
 (queries
 are executed on the slave only)

 Documents are changing very quickly, during a normal day we will have
 approx
 :
 * 200 000 updated docs
 * 1000 new docs
 * 200 deleted docs


 I attached the last good checkIndex : solr20110107.txt
 And the corrupted one : solr20110110.txt


 This is not the first time a segment gets corrupted on this server,
 that's
 why I ran frequent checkIndex. (but as you can see the first segment is
 1.800.000 docs and it works fine!)


 I can't find any SEVER FATAL or exception in the Solr logs.


 I also attached my schema.xml and solrconfig.xml


 Is there something wrong with what we are doing ? Do you need other info
 ?


 Thanks,

Re: Tuning StatsComponent

2011-01-12 Thread stockii


my field Type  is double maybe sint is better ? but i need double ... =(
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2241903.html
Sent from the Solr - User mailing list archive at Nabble.com.

Where does admin UI visually distinguish between master and slave?

2011-01-12 Thread Will Milspec

Hi all,

I'm getting started with a master/slave configuration for two solr
instances.  Two distinguish between 'master' and 'slave', I've set he system
properties (e.g. -Dmaster.enabled) and using the same 'solrconfig.xml'.

I can see via the system properties admin UI that the jvm (and thus solr)
sees correct values, i.e.:
enable.master = false
enable.slave = true

However, the replication admin UI is identical for both 'master' and
'slave'. (i.e.
http://localhost:8983/solr/production/admin/replication/index.jsp)

I'd like a clearer visual confirmation that the master node is indeed a
master and the slave is a slave.

Summary question:
Does the admin UI  distinguish betwen master and slave?

thanks

will

Re: Not storing, but highlighting from document sentences

Hi Steve,

- Original Message 
 From: Steven A Rowe sar...@syr.edu
 Subject: RE: Not storing, but highlighting from document sentences

 I think you can get what you want by doing the first stage  retrieval, and 
 then 
in the second stage, add required constraint(s) to the query  for the matching 
docid(s), and change the AND operators in the original query to  OR.  
Coordination will cause the best snippet(s) to rise to the top,  no?

Right, right.
So if the original query is: foo AND bar, I'd run it against the main index, 
get 
top N hits, say N=10.
Then I'd create another query: +(foo OR bar) +articleID:(ORed list of top N 
article IDs from main results)
And then I'd use that to get enough sentence docs to have at least 1 of them 
for each hit from the main index.

Hm, I wonder what happens when instead of simple foo AND bar you have a more 
complex query with more elaborate grouping and such...

 Hmm, you'll want to run the second stage once for each hit from the  first 
stage, though, unless you can afford to collect *all* hits and pull out  each 
first stage's hit from the intermixed second stage  results...

Wouldn't the above get me all sentences I need for top N hits from the main 
result in a single shot, assuming I use high enough rows=NNN to minimize the 
possibility of not getting even 1 sentence for any one of those top N hits?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/ 

 Steve

  -Original Message-
  From:  Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
   Sent: Wednesday, January 12, 2011 7:29 AM
  To: solr-user@lucene.apache.org
   Subject: Re: Not storing, but highlighting from document sentences

  Hi Stefan,

  Yes, splitting in separate sentences (and  storing them) is OK because with
  a
  bunch of sentences you can't  really reconstruct the original article
  unless you
  know which  order to put them in.

  Searching against the sentence won't work  for queries like foo AND bar
  because
  this should match original  articles even if foo and bar are in different
  sentences.

  Otis

  - Original Message  
   From: Stefan Matheis matheis.ste...@googlemail.com
To: solr-user@lucene.apache.org
Sent: Wed, January 12, 2011 7:02:46 AM
   Subject: Re: Not  storing, but highlighting from document sentences

Otis,

   just interested in .. storing the full text is  not allowed, but
  splitting up
   in separate sentences is  okay?

   while you think about  using the sentences  only as secondary/additional
   source, maybe it would help  to  search in the sentences itself, or would
  that
   give  misleading results in  your case?

   Stefan

   On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic  
   otis_gospodne...@yahoo.com   wrote:

Hello,

 I'm indexing some content (articles)  whose text I cannot store in  its
original
form for copyright   reason.  So I can index the content, but cannot
  store
 it.
 However, I need snippets and search term  highlighting.

 Any  way to accomplish this elegantly?  Or even not so  elegantly?

Here is one idea:

 * Create 2 indices:  main index for indexing (but not storing)  the
  original
content, the  secondary index for  storing individual sentences from
  the
  original
article.

* That  is, before indexing an article,  split it into sentences.   Then
  index
the
article in the   main index, and index+store each sentence in the
  secondary
 index.   So for each doc in the main index there will be multiple  docs
  in
 the
secondary index  with individual sentences.  Each sentence doc
  includes an
 ID of
the parent document.

* Then  run queries against the main index, and pull  individual
  sentences
 from
the  secondary index for snippet+highlight  purposes.

The problem I see with this approach (and   there may be other ones
  that I am
not
 seeing yet) is with  queries like foo AND bar.  In this case  foo may
  be a
 match
from  sentence #1, and bar may be a match from sentence #7.   Or
   maybe
foo is
a match in sentence #1, and  bar is a match  in multiple sentences:
  #7 and
 #10
and #23.

  Regardless, when a query is run against the main index, you don't
   know
 where the
match was, so you don't  know which sentences to go get from  the
  secondary
 index.

Does anyone have any  suggestions  for how to handle this?

 Thanks,
Otis

 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene  ecosystem  search :: http://search-lucene.com/

Re: Multiple Solr instances common core possible ?

That's correct.  Only 1 instance should be writing.  You should be able to 
point 
multiple Solr read-only instances to the same physical read-only index.  I 
don't 
recall trying this recently, though.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Dennis Gearon gear...@sbcglobal.net
 To: solr-user@lucene.apache.org
 Sent: Tue, January 11, 2011 12:29:54 PM
 Subject: Re: Multiple Solr instances common core possible ?
 
 NOT sure about any of it, but THINK that READ ONLY, with one solr instance 
doing 

 writes is possible. I've heard that it's NEVER possible to do multiple Solr 
 Instances writing.
 
  Dennis Gearon
 
 
 Signature  Warning
 
 It is always a good idea to learn from your own  mistakes. It is usually a 
better 

 idea to learn from others’ mistakes, so you  do not have to make them 
 yourself. 

 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
 EARTH  has a Right To Life,
 otherwise we all die.
 
 
 
 - Original  Message 
 From: Ravi Kiran ravi.bhas...@gmail.com
 To: solr-user@lucene.apache.org
 Sent:  Tue, January 11, 2011 9:15:06 AM
 Subject: Multiple Solr instances common  core possible ?
 
 Hello,
 Is it possible to  deploy multiple solr instances with different
 context roots pointing to the  same solr core ? If I do this will there be
 any deadlocks or file handle  issues ? The reason I need this setup is
 because I want to expose solr to an  third party vendor via a different
 context root. My solr instance is deployed  on Glassfish. Alternately, if
 there is a configurable way to setup multiple  context roots for the same
 solr instance that will suffice at this point of  time.
 
 Ravi Kiran

Re: DataImportHandler on Websphere - http response 404

2011-01-12 Thread chrisclark


I have found a workaround for this.

1. change the entry in solrconfig.xml for the DataImportHandler by removing
the slash from the name, like this requestHandler name=dataimport...

2. when making the request to the SolrJ server, don't use a slash in the qt
parameter, i.e.
solrParameters.set(qt, dataimport);

If you use slashes, the url generated by SolrJ will be like
'/solr/dataimport...qt=%2Fdataimport'
Removing the slashes will change the url to something like
'/solr/select...qt=dataimport'
(Solr will use the 'qt' parameter to find the right handler).

The resulting url will be something like this:
/solr/select?optimize=trueclean=falsecommit=truecommand=reload-configqt=dataimportwt=javabinversion=1

My guess is that '/select' is mapped in the web.xml of Solr to a servlet,
whereas '/dataimport' is not and that Websphere will complain about that,
whereas JBoss doesn't care.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-on-Websphere-http-response-404-tp2240440p2242162.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: default RegexFragmenter

Sebastian,

If I remember my regular expressions, that - and / are really just that.  The 
stuff inside angle brackets means any of the characters between [ and ].  - 
and / are just two of those characters, along with newline, space, comma, etc.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Sebastian M mihais...@yahoo.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 11, 2011 11:22:01 AM
 Subject: default RegexFragmenter
 
 
 Hello,
 
 I'm investigating an issue where spellcheck queries are  tokenized without
 being explicitly told to do so, resulting in suggestions  such as
 www.www.product4sale.com.com for the queries such  as
 www.product4sale.com.
 
 The default RegexFragmenter fragmenter  (name=regex) uses the regular
 expression:
 
 [-\w  ,/\n\']{20,200}
 
 I understand parts of it, but I'm not sure about the -  sign, or the slash
 midway through it.
 I would like to perhaps tailor this  regular expression to not cause query
 terms such as www.product4sale.com to  be broken down on the period marks,
 but just be kept as they are.
 
 Any  suggestions or answers are highly appreciated!
 
 Sebastian
 -- 
 View  this message in context: 
http://lucene.472066.n3.nabble.com/default-RegexFragmenter-tp2235106p2235106.html

 Sent  from the Solr - User mailing list archive at Nabble.com.

Re: Where does admin UI visually distinguish between master and slave?

Hi Will,

I don't think we have a clean master or slave label anywhere in the Admin 
UI.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Will Milspec will.mils...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wed, January 12, 2011 11:18:17 AM
 Subject: Where does admin UI visually distinguish between master and 
slave?
 
 Hi all,
 
 I'm getting started with a master/slave configuration for two  solr
 instances.  Two distinguish between 'master' and 'slave', I've set  he system
 properties (e.g. -Dmaster.enabled) and using the same  'solrconfig.xml'.
 
 I can see via the system properties admin UI that the  jvm (and thus solr)
 sees correct values, i.e.:
 enable.master =  false
 enable.slave = true
 
 However, the replication admin UI is  identical for both 'master' and
 'slave'. (i.e.
 http://localhost:8983/solr/production/admin/replication/index.jsp)
 
 I'd  like a clearer visual confirmation that the master node is indeed a
 master  and the slave is a slave.
 
 Summary question:
 Does the admin UI   distinguish betwen master and slave?
 
 thanks
 
 will

Re: Where does admin UI visually distinguish between master and slave?

Well, slaves to show different things in the replication.jsp page.

Master  http://10cc:8080/solr/replication
Poll Interval   00:00:10
Local Index Index Version: 1294666552434, Generation: 2515
Location: /var/lib/solr/data/index
Size: 4.65 GB
Times Replicated Since Startup: 934 

Where master nodes (or slaves where enabled=false) show:

Local Index Index Version: 1294666552449, Generation: 2530
Location: /var/lib/solr/data/index
Size: 4.65 GB 

On Wednesday 12 January 2011 17:24:57 Otis Gospodnetic wrote:
 Hi Will,
 
 I don't think we have a clean master or slave label anywhere in the
 Admin UI.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 
  From: Will Milspec will.mils...@gmail.com
  To: solr-user@lucene.apache.org
  Sent: Wed, January 12, 2011 11:18:17 AM
  Subject: Where does admin UI visually distinguish between master and
 
 slave?
 
  Hi all,
  
  I'm getting started with a master/slave configuration for two  solr
  instances.  Two distinguish between 'master' and 'slave', I've set  he
  system properties (e.g. -Dmaster.enabled) and using the same 
  'solrconfig.xml'.
  
  I can see via the system properties admin UI that the  jvm (and thus
  solr) sees correct values, i.e.:
  enable.master =  false
  enable.slave = true
  
  However, the replication admin UI is  identical for both 'master' and
  'slave'. (i.e.
  http://localhost:8983/solr/production/admin/replication/index.jsp)
  
  I'd  like a clearer visual confirmation that the master node is indeed a
  master  and the slave is a slave.
  
  Summary question:
  Does the admin UI   distinguish betwen master and slave?
  
  thanks
  
  will

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: icq or other 'instant gratification' communication forums for Solr

Dennis,

Join #solr on Freenode.

But it's not necessarily any livelier than this ML.  It depends who's actively 
on.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Dennis Gearon gear...@sbcglobal.net
 To: solr-user@lucene.apache.org
 Sent: Tue, January 11, 2011 12:09:22 AM
 Subject: icq or other 'instant gratification' communication forums for Solr
 
 Are there any chatrooms or ICQ rooms to ask questions late at night to people 
 who stay up or are on other side of planet?
 
  Dennis  Gearon
 
 
 Signature Warning
 
 It is always a good  idea to learn from your own mistakes. It is usually a 
better 

 idea to learn  from others’ mistakes, so you do not have to make them 
 yourself. 

 from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'
 
 
 EARTH  has a Right To Life,
 otherwise we all die.

Re: spell suggest response

2011-01-12 Thread Juan Grande

It isn't exactly what you want, but did you try with the onlyMorePopular
parameter?

http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.onlyMorePopular

Regards,

Juan Grande

On Wed, Jan 12, 2011 at 7:29 AM, satya swaroop satya.yada...@gmail.comwrote:

 Hi stefan,
I need the words from the index record itself. If java is given
 then the relevant or similar or near words in the index should be shown.
 Even the given keyword is true... can it be possible???


 ex:-


 http://localhost:8080/solr/spellcheckCompRH?q=javarows=0spellcheck=truespellcheck.count=10
   In the o/p the suggestions will not be coming as
 java is a word that spelt correctly...
   But cant we get near suggestions as javax,javacetc.., ???(the
 terms in the index)

 I read  about  suggester in solr wiki at
 http://wiki.apache.org/solr/Suggester . But i tried to implement it but
 got
 errors as

 *error loading class org.apache.solr.spelling.suggest.suggester*

 Regards,
 satya

Re: pruning search result with search score gradient

2011-01-12 Thread Erick Erickson

What's the use-case you're trying to solve? Because if you're
still showing results to the user, you're taking information away
from them. Where are you expecting to get the list? If you try
to return the entire list, you're going to pay the penalty
of creating the entire list and transmitting it across the wire rather
than just a pages' worth.

And if you're paging, the user will do this for you by deciding for
herself when she's getting less relevant results.

So I don't understand what the value to the end user you're trying
to provide is, perhaps if you elaborate on that I'll have more useful
response

Best
Erick

On Tue, Jan 11, 2011 at 3:12 AM, Julien Piquot julien.piq...@arisem.comwrote:

 Hi everyone,

 I would like to be able to prune my search result by removing the less
 relevant documents. I'm thinking about using the search score : I use the
 search scores of the document set (I assume there are sorted by descending
 order), normalise them (0 would be the the lowest value and 1 the greatest
 value) and then calculate the gradient of the normalised scores. The
 documents with a gradient below a threshold value would be rejected.
 If the scores are linearly decreasing, then no document is rejected.
 However, if there is a brutal score drop, then the documents below the drop
 are rejected.
 The threshold value would still have to be tuned but I believe it would
 make a much stronger metric than an absolute search score.

 What do you think about this approach? Do you see any problem with it? Is
 there any SOLR tools that could help me dealing with that?

 Thanks for your answer.

 Julien

Re: Not storing, but highlighting from document sentences

2011-01-12 Thread Tomislav Poljak

Hi Steven,
if I understand correctly, you are suggesting query execution in two
phases: first execute query on whole article index core (where whole
articles are indexed, but not stored) to get article IDs (for articles
which match original query).  Then for each match in article core:
change the AND operators from the original query to OR and add
articleID condition/filter and execute such query on sentence based
index (with assumption each sentence based doc has articleID set).

Is this correct and it this what is you'll want to run the second
stage once for each hit from the first stage, though referring to?

Example for this scenario would be for original query q=apples and
oranges, execute q=apples and orange with fl=articleId on article
core and for each articleIdX result execute q=(apples OR orange) AND
articleId:articleIdX on sentence based core.

Same thing (with the same results) should be doable with only a single
query in second phase, for previous example that single query for
second phase would be for all articleId1,...,articleIdN something
like:

q=((apples OR orange) AND articleId:articleId1) OR ((apples OR orange)
AND articleId:articleId2) OR ... OR  apples OR orange) AND
articleId:articleIdN)

But, here in second case results are ordered by sentence scoring
instead of article and reslts should be re-ordered. Is this what is
unless you can afford to collect *all* hits and pull out  each first
stage's hit from the intermixed second stage  results refering to?

My actual question after this really long intro is: couldn't this be
done with single second level query approach, but on each topN
start/row chunk as user iterates through first level results?

For example, user executes query q=apples and oranges and this
results in 1000 results, but first page display only for example 20
results which means proposed solution would:

1. phase: execute execute q=apples and orange with fl=articleId on
article core, but with start=0rows=20
2. phase: q=((apples OR orange) AND articleId:articleId1) OR ((apples
OR orange) AND articleId:articleId2) OR ... OR  apples OR orange) AND
articleId:articleId20)
3. Reorder sentence results to match order defined by article matching
scores and return to user

Only, the results here would need to be collapsed on unique articleID,
so only 20 results are provided in result set (because multiple
sentence based doc can be returned for a single unique articleID)

Would this work?

Thanks,
Tomislav

2011/1/12 Steven A Rowe sar...@syr.edu:
 Hi Otis,

 I think you can get what you want by doing the first stage retrieval, and 
 then in the second stage, add required constraint(s) to the query for the 
 matching docid(s), and change the AND operators in the original query to OR.  
 Coordination will cause the best snippet(s) to rise to the top, no?

 Hmm, you'll want to run the second stage once for each hit from the first 
 stage, though, unless you can afford to collect *all* hits and pull out each 
 first stage's hit from the intermixed second stage results...

 Steve

 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, January 12, 2011 7:29 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not storing, but highlighting from document sentences

 Hi Stefan,

 Yes, splitting in separate sentences (and storing them) is OK because with
 a
 bunch of sentences you can't really reconstruct the original article
 unless you
 know which order to put them in.

 Searching against the sentence won't work for queries like foo AND bar
 because
 this should match original articles even if foo and bar are in different
 sentences.

 Otis



 - Original Message 
  From: Stefan Matheis matheis.ste...@googlemail.com
  To: solr-user@lucene.apache.org
  Sent: Wed, January 12, 2011 7:02:46 AM
  Subject: Re: Not storing, but highlighting from document sentences
 
  Otis,
 
  just interested in .. storing the full text is not allowed, but
 splitting up
  in separate sentences is okay?
 
  while you think about  using the sentences only as secondary/additional
  source, maybe it would help  to search in the sentences itself, or would
 that
  give misleading results in  your case?
 
  Stefan
 
  On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic 
  otis_gospodne...@yahoo.com  wrote:
 
   Hello,
  
   I'm indexing some content (articles)  whose text I cannot store in its
   original
   form for copyright  reason.  So I can index the content, but cannot
 store
   it.
    However, I need snippets and search term highlighting.
  
  
    Any way to accomplish this elegantly?  Or even not so  elegantly?
  
   Here is one idea:
  
   * Create 2 indices:  main index for indexing (but not storing) the
 original
   content, the  secondary index for storing individual sentences from
 the
    original
   article.
  
   * That is, before indexing an article,  split it into sentences.  Then
 index
   the
   article in the  main index, and index+store

Re: Term frequency across multiple documents

2011-01-12 Thread Juan Grande

Maybe there is a better solution, but I think that you can solve this
problem using facets. You will get the number of documents where each term
appears. Also, you can filter a specific set of terms by entering a query
like +field:term1 OR +field:term2 OR ..., or using the facet.query
parameter.

Regards,

Juan Grande

On Wed, Jan 12, 2011 at 11:08 AM, Aaron Bycoffe 
abyco...@sunlightfoundation.com wrote:

 I'm attempting to calculate term frequency across multiple documents
 in Solr. I've been able to use TermVectorComponent to get this data on
 a per-document basis but have been unable to find a way to do it for
 multiple documents -- that is, get a list of terms appearing in the
 documents and how many times each one appears. I'd also like to be
 able to filter the list of terms to be able to see how many times a
 specific term appears, though this is less important.

 Is there a way to do this in Solr?


 Aaron

Re: pruning search result with search score gradient

2011-01-12 Thread Jonathan Rochkind

Some times I've _considered_ trying to do this (but generally decided it 
wasn't worth it) was when I didn't want those documents below the 
threshold to show up in the facet values.  In my application the facet 
counts are sometimes very pertinent information, that are sometimes not 
quite as useful as they could be when they include barely-relevant hits.


On 1/12/2011 11:42 AM, Erick Erickson wrote:

What's the use-case you're trying to solve? Because if you're
still showing results to the user, you're taking information away
from them. Where are you expecting to get the list? If you try
to return the entire list, you're going to pay the penalty
of creating the entire list and transmitting it across the wire rather
than just a pages' worth.

And if you're paging, the user will do this for you by deciding for
herself when she's getting less relevant results.

So I don't understand what the value to the end user you're trying
to provide is, perhaps if you elaborate on that I'll have more useful
response

Best
Erick

On Tue, Jan 11, 2011 at 3:12 AM, Julien Piquotjulien.piq...@arisem.comwrote:


Hi everyone,

I would like to be able to prune my search result by removing the less
relevant documents. I'm thinking about using the search score : I use the
search scores of the document set (I assume there are sorted by descending
order), normalise them (0 would be the the lowest value and 1 the greatest
value) and then calculate the gradient of the normalised scores. The
documents with a gradient below a threshold value would be rejected.
If the scores are linearly decreasing, then no document is rejected.
However, if there is a brutal score drop, then the documents below the drop
are rejected.
The threshold value would still have to be tuned but I believe it would
make a much stronger metric than an absolute search score.

What do you think about this approach? Do you see any problem with it? Is
there any SOLR tools that could help me dealing with that?

Thanks for your answer.

Julien

RE: Not storing, but highlighting from document sentences

2011-01-12 Thread Steven A Rowe

  I think you can get what you want by doing the first stage  retrieval,
  and then in the second stage, add required constraint(s) to the query
  for the matching docid(s), and change the AND operators in the
  original query to OR.  Coordination will cause the best snippet(s) to
  rise to the top,  no?
 
 Right, right.
 So if the original query is: foo AND bar, I'd run it against the main
 index, get top N hits, say N=10.
 Then I'd create another query: +(foo OR bar) +articleID:(ORed list of top
 N article IDs from main results)
 And then I'd use that to get enough sentence docs to have at least 1 of
 them for each hit from the main index.
 
 Hm, I wonder what happens when instead of simple foo AND bar you have a
 more complex query with more elaborate grouping and such...

:) I was hoping that you could limit the query language to exclude grouping...  
If not, you could walk the boolean query, trim all clauses that are PROHIBITED, 
then flatten all of the remaining terms to a single OR'd query?

  Hmm, you'll want to run the second stage once for each hit from the
  first stage, though, unless you can afford to collect *all* hits and pull
  out each first stage's hit from the intermixed second stage  results...
 
 Wouldn't the above get me all sentences I need for top N hits from the
 main result in a single shot, assuming I use high enough rows=NNN to
 minimize the possibility of not getting even 1 sentence for any one of
 those top N hits?

Yes, but the problem is that the worst case is that you have to retrieve *all* 
second-stage hits to get at least one for each of the first-stage hits.  So if 
you're okay with NNN = numDocs, then no problem.

Steve

Re: DIH - Closing ResultSet in JdbcDataSource

2011-01-12 Thread Shane Perry

I have found where a root entity has completed processing and added the
logic to clear the entity's cache at that point (didn't change any of the
logic for clearing all entity caches once the import has completed).  I have
also created an enhancement request found at
https://issues.apache.org/jira/browse/SOLR-2313.

On Tue, Jan 11, 2011 at 2:54 PM, Shane Perry thry...@gmail.com wrote:

 By placing some strategic debug messages, I have found that the JDBC
 connections are not being closed until all entity elements have been
 processed (in the entire config file).  A simplified example would be:

 dataConfig
   dataSource name=ds1 driver=org.postgresql.Driver
 url=jdbc:postgresql://localhost:5432/db1 user=... password=... /
   dataSource name=ds2 driver=org.postgresql.Driver
 url=jdbc:postgresql://localhost:5432/db2 user=... password=... /

   document
 entity name=entity1 datasource=ds1 ...
   ... field list ...
   entity name=entity1a datasource=ds1 ...
 ... field list ...
   /entity
/entity
 entity name=entity2 datasource=ds2 ...
   ... field list ...
   entity name=entity2a datasource=ds2 ...
 ... field list ...
   /entity
/entity
   /document
 /dataConfig

 The behavior is:

 JDBC connection opened for entity1 and entity1a - Applicable queries run
 and ResultSet objects processed
 All open ResultSet and Statement objects closed for entity1 and entity1a
 JDBC connection opened for entity2 and entity2a - Applicable queries run
 and ResultSet objects processed
 All open ResultSet and Statement objects closed for entity2 and entity2a
 All JDBC connections (none are closed at this point) are closed.

 In my instance, I have some 95 unique entity elements (19 parents with 5
 children each), resulting in 95 open JDBC connections.  If I understand the
 process correctly, it should be safe to close the JDBC connection for a
 root entity (immediate children of document) and all descendant
 entity elements once the parent has been successfully completed.  I have
 been digging around the code, but due to my unfamiliarity with the code, I'm
 not sure where this would occur.

 Is this a valid solution?  It's looking like I should probably open a
 defect and I'm willing to do so along with submitting a patch, but need a
 little more direction on where the fix would best reside.

 Thanks,

 Shane



 On Mon, Jan 10, 2011 at 7:14 AM, Shane Perry thry...@gmail.com wrote:

 Gora,

 Thanks for the response.  After taking another look, you are correct about
 the hasnext() closing the ResultSet object (1.4.1 as well as 1.4.0).  I
 didn't recognize the case difference in the two function calls, so missed
 it.  I'll keep looking into the original issue and reply if I find a
 cause/solution.

 Shane


 On Sat, Jan 8, 2011 at 4:04 AM, Gora Mohanty g...@mimirtech.com wrote:

 On Sat, Jan 8, 2011 at 1:10 AM, Shane Perry thry...@gmail.com wrote:
  Hi,
 
  I am in the process of migrating our system from Postgres 8.4 to Solr
  1.4.1.  Our system is fairly complex and as a result, I have had to
 define
  19 base entities in the data-config.xml definition file.  Each of these
  entities executes 5 queries.  When doing a full-import, as each entity
  completes, the server hosting Postgres shows 5 idle in transaction
 for the
  entity.
 
  In digging through the code, I found that the JdbcDataSource wraps the
  ResultSet object in a custom ResultSetIterator object, leaving the
 ResultSet
  open.  Walking through the code I can't find a close() call anywhere on
 the
  ResultSet.  I believe this results in the idle in transaction
 processes.
 [...]

 Have not examined the idle in transaction issue that you
 mention, but the ResultSet object in a ResultSetIterator is
 closed in the private hasnext() method, when there are no
 more results, or if there is an exception. hasnext() is called
 by the public hasNext() method that should be used in
 iterating over the results, so I see no issue there.

 Regards,
 Gora

 P.S. This is from Solr 1.4.0 code, but I would not think that
this part of the code would have changed.

RE: Not storing, but highlighting from document sentences

2011-01-12 Thread Steven A Rowe

Hi Tomislav,

 if I understand correctly, you are suggesting query execution in two
 phases: first execute query on whole article index core (where whole
 articles are indexed, but not stored) to get article IDs (for articles
 which match original query).  Then for each match in article core:
 change the AND operators from the original query to OR and add
 articleID condition/filter and execute such query on sentence based
 index (with assumption each sentence based doc has articleID set).

Yes.

 Is this correct and it this what is you'll want to run the second
 stage once for each hit from the first stage, though referring to?
 
 Example for this scenario would be for original query q=apples and
 oranges, execute q=apples and orange with fl=articleId on article
 core and for each articleIdX result execute q=(apples OR orange) AND
 articleId:articleIdX on sentence based core.
 
 Same thing (with the same results) should be doable with only a single
 query in second phase, for previous example that single query for
 second phase would be for all articleId1,...,articleIdN something
 like:
 
 q=((apples OR orange) AND articleId:articleId1) OR ((apples OR orange)
 AND articleId:articleId2) OR ... OR  apples OR orange) AND
 articleId:articleIdN)
 
 But, here in second case results are ordered by sentence scoring
 instead of article and reslts should be re-ordered. Is this what is
 unless you can afford to collect *all* hits and pull out  each first
 stage's hit from the intermixed second stage  results refering to?

Yes.

 My actual question after this really long intro is: couldn't this be
 done with single second level query approach, but on each topN
 start/row chunk as user iterates through first level results?
 
 For example, user executes query q=apples and oranges and this
 results in 1000 results, but first page display only for example 20
 results which means proposed solution would:
 
 1. phase: execute execute q=apples and orange with fl=articleId on
 article core, but with start=0rows=20
 2. phase: q=((apples OR orange) AND articleId:articleId1) OR ((apples
 OR orange) AND articleId:articleId2) OR ... OR  apples OR orange) AND
 articleId:articleId20)
 3. Reorder sentence results to match order defined by article matching
 scores and return to user
 
 Only, the results here would need to be collapsed on unique articleID,
 so only 20 results are provided in result set (because multiple
 sentence based doc can be returned for a single unique articleID)
 
 Would this work?

I think so, but I don't have any experience using collapsing, so I can't say 
for sure.

BTW, Otis' rearrangement of your phase #2 would also work, and would be 
theoretically faster to evaluate: q=+(apples orange) +articleId:(articleId1 ... 
articleId20)

Steve

Re: Where does admin UI visually distinguish between master and slave?

2011-01-12 Thread Will Milspec

Hi all,

Thanks for the feedback. I've checked the code with a few different inputs
and believe I have found a bug.

Could someone comment as to whether I'm missing something? I will file go
ahead and file it if someone can attest looks like a bug.

Bug Summary:
==
- Admin UI replication/index.jsp checks for master or slave with the
following code:
   if (true.equals(detailsMap.get(isSlave)))
-  if slave, replication/index.jsp displays the Master and Poll
Intervals, etc. sections (everything up to Cores)
- if false, replication/index.jsp does not display the Master, Poll
Intervals section
-This slave check/UI difference works correctly if the solrconfig.xml has
a  slave but not master section or vice versa

Expected results:
==
Same UI difference would occur in the following scenario:
   a) solrconfig.xml has both master and slave entries
   b) use java.properties (-Dsolr.enable.master -Dsolr.enable.slave) to set
master or slave at runtime

*OR*
c) use solrcore.properties  to set master and slave at runtime

Actual results:
==
If solrconfig.xml has both master and slave entries, replication/index.jsp
shows both master and slave section regardless of system.properties

On Wed, Jan 12, 2011 at 10:35 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Well, slaves to show different things in the replication.jsp page.

 Master  http://10cc:8080/solr/replication
 Poll Interval   00:00:10
 Local Index Index Version: 1294666552434, Generation: 2515
Location: /var/lib/solr/data/index
Size: 4.65 GB
Times Replicated Since Startup: 934

 Where master nodes (or slaves where enabled=false) show:

 Local Index Index Version: 1294666552449, Generation: 2530
Location: /var/lib/solr/data/index
Size: 4.65 GB

 On Wednesday 12 January 2011 17:24:57 Otis Gospodnetic wrote:
  Hi Will,
 
  I don't think we have a clean master or slave label anywhere in the
  Admin UI.
 
  Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
  - Original Message 
 
   From: Will Milspec will.mils...@gmail.com
   To: solr-user@lucene.apache.org
   Sent: Wed, January 12, 2011 11:18:17 AM
   Subject: Where does admin UI visually distinguish between master and
 
  slave?
 
   Hi all,
  
   I'm getting started with a master/slave configuration for two  solr
   instances.  Two distinguish between 'master' and 'slave', I've set  he
   system properties (e.g. -Dmaster.enabled) and using the same
   'solrconfig.xml'.
  
   I can see via the system properties admin UI that the  jvm (and thus
   solr) sees correct values, i.e.:
   enable.master =  false
   enable.slave = true
  
   However, the replication admin UI is  identical for both 'master' and
   'slave'. (i.e.
   http://localhost:8983/solr/production/admin/replication/index.jsp)
  
   I'd  like a clearer visual confirmation that the master node is indeed
 a
   master  and the slave is a slave.
  
   Summary question:
   Does the admin UI   distinguish betwen master and slave?
  
   thanks
  
   will

 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536620 / 06-50258350

Re: Resolve a DataImportHandler datasource based on previous entity

On Wed, Jan 12, 2011 at 8:49 PM, alexei achugu...@gmail.com wrote:
[...]
 Unfortunately reorganizing the data is not an option for me.
 Multiple databases exist and a third party is taking care of
 populating them. Once a database reaches a certain size, a switch
 occurs and a new database is created with the same table structure.

OK, I understand.

 Gora Mohanty-3 wrote:

 I meant a script that runs the query that defines the datasources for all
 fields, writes a Solr DIH configuration file, and then initiates a
 dataimport.

 Ok, so the query would select only the articles for which the data is
 sitting in a specific datasource. Then, only that one datasource would be
 indexed.
 For each additional datasource would the script initiate another full-import
 with the clean attribute set to false?

I do not think that I am completely understanding your use case.
Would it be possible for you to describe it in detail? Here is my
current view of it:
* From some SELECT statement, it is possible for you to tell
  which datasource what field should come from in the next import.
* If so, before the start of a data import, a script can run that same
  SELECT statement, and figure out what belongs where.
* In that case, the script can do the following:
  - Write a DIH configuration file from its knowledge of where the
fields in the next import are coming from.
  - Do a reload-config to get the new DIH configuration.
  - Initiate a data import
* It is not clear to me how a delta import, and similar things fit
  into this scenario. I.e., are you also going to be dealing with
  updates of documents that already exist in the Solr index?
  However, we can cross that bridge when we come to it.

 I tried to make some changes to DIH that comes with Solr 1.4.1
 The getResolvedEntityAttribute(dataSource); method seems to so the trick.
 Here is the modified code. It feels awkward but it seems to work.
[...]
 I hope I am not breaking any other functionality...
 Would it be possible to add something like this to a future release?

I am sorry. As things stand, while I do want to be able to get the
time to become a contributor to Solr code, it is beyond my current
understanding of it to be able to comment on the above. I think that
you have the right idea, but am unable to say for sure. Maybe someone
more well-versed in Solr can chip in. I would definitely recommend
that you open a JIRA ticket, and attach this patch. That way, at least
it remains on record. Please include a description of your use case
in the ticket.

Regards,
Gpra

Specifying returned fields

2011-01-12 Thread Dmitriy Shvadskiy

Hello,

I know you can explicitly specify list of fields returned via
fl=field1,field2,field3

Is there a way to specify return all fields but field1 and field2?

Thanks,
Dmitriy

Re: Specifying returned fields

On Thu, Jan 13, 2011 at 1:11 AM, Dmitriy Shvadskiy dshvads...@gmail.com wrote:
 Hello,

 I know you can explicitly specify list of fields returned via
 fl=field1,field2,field3

 Is there a way to specify return all fields but field1 and field2?

Not that I know of, but below is an earlier discussion thread
on this subject. Please take a look at the links referenced
there. IMHO, this would be a desirable feature.

http://osdir.com/ml/solr-user.lucene.apache.org/2010-12/msg00171.html

Regards,
Gora

Re: Specifying returned fields

2011-01-12 Thread Dmitriy Shvadskiy


Thanks Gora
The workaround of loading fields via LukeRequestHandler and building fl from
it will work for what we need. However it takes 15 seconds per core and we
have 15 cores. 
The query I'm running is /admin/luke?show=schema
Is there a way to limit query to return just fields?

Thanks,
Dmitriy
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Specifying-returned-fields-tp2243423p2243923.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Specifying returned fields

2011-01-12 Thread Erik Hatcher


On Jan 12, 2011, at 12:53 , Dmitriy Shvadskiy wrote:

 
 Thanks Gora
 The workaround of loading fields via LukeRequestHandler and building fl from
 it will work for what we need. However it takes 15 seconds per core and we
 have 15 cores. 
 The query I'm running is /admin/luke?show=schema
 Is there a way to limit query to return just fields?

Yes, add numTerms=0 and it'll speed up the luke request handler dramatically.

Erik

verifying that an index contains ONLY utf-8

2011-01-12 Thread Paul

We've created an index from a number of different documents that are
supplied by third parties. We want the index to only contain UTF-8
encoded characters. I have a couple questions about this:

1) Is there any way to be sure during indexing (by setting something
in the solr configuration?) that the documents that we index will
always be stored in utf-8? Can solr convert documents that need
converting on the fly, or can solr reject documents containing illegal
characters?

2) Is there a way to scan the existing index to find any string
containing non-utf8 characters? Or is there another way that I can
discover if any crept into my index?

StopFilterFactory and qf containing some fields that use it and some that do not

2011-01-12 Thread Dyer, James

I'm running into a problem with StopFilterFactory in conjunction with (e)dismax 
queries that have a mix of fields, only some of which use StopFilterFactory.  
It seems that if even 1 field on the qf parameter does not use 
StopFilterFactory, then stop words are not removed when searching any fields.  
Here's an example of what I mean:

- I have 2 fields indexed:
   Title is textStemmed, which includes StopFilterFactory (see below).
   Contributor is textSimple, which does not include StopFilterFactory (see 
below).
- The is a stop word in stopwords.txt
- q=lifedefType=edismaxqf=Title  ... returns 277,635 results
- q=the lifedefType=edismaxqf=Title ... returns 277,635 results
- q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635 results
- q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results

It seems as if the stop words are not being stripped from the query because 
qf contains a field that doesn't use StopFilterFactory.  I did testing with 
combining Stemmed fields with not Stemmed fields in qf and it seems as if 
stemming gets applied regardless.  But stop words do not.

Does anyone have ideas on what is going on?  Is this a feature or possibly a 
bug?  Any known workarounds?  Any advice is appreciated.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

fieldType name=textSimple class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType

fieldType name=textStemmed class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=1 /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PorterStemFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt 
enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=1 /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldType

Re: StopFilterFactory and qf containing some fields that use it and some that do not

I haven't used edismax but i can imagine its a feature. Ths is because 
inconstent use of stopwords in the analyzers of the fields specified in qf can 
yield really unexpected results because of the mm parameter.

In dismax, if one analyzer removed stopwords and the other doesn't the mm 
parameter goes crazy.

 I'm running into a problem with StopFilterFactory in conjunction with
 (e)dismax queries that have a mix of fields, only some of which use
 StopFilterFactory.  It seems that if even 1 field on the qf parameter
 does not use StopFilterFactory, then stop words are not removed when
 searching any fields.  Here's an example of what I mean:
 
 - I have 2 fields indexed:
Title is textStemmed, which includes StopFilterFactory (see below).
Contributor is textSimple, which does not include StopFilterFactory
(see below).
 
 - The is a stop word in stopwords.txt
 - q=lifedefType=edismaxqf=Title  ... returns 277,635 results
 - q=the lifedefType=edismaxqf=Title ... returns 277,635 results
 - q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635 results
 - q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results
 
 It seems as if the stop words are not being stripped from the query because
 qf contains a field that doesn't use StopFilterFactory.  I did testing
 with combining Stemmed fields with not Stemmed fields in qf and it seems
 as if stemming gets applied regardless.  But stop words do not.
 
 Does anyone have ideas on what is going on?  Is this a feature or possibly
 a bug?  Any known workarounds?  Any advice is appreciated.
 
 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311
 
 fieldType name=textSimple class=solr.TextField
 positionIncrementGap=100 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType
 
 fieldType name=textStemmed class=solr.TextField
 positionIncrementGap=100 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true / filter
 class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/ filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 /fieldType

Re: verifying that an index contains ONLY utf-8

This is supposed to be dealt with outside the index. All input must be UTF-8 
encoded. Failing to do so will give unexpected results.

 We've created an index from a number of different documents that are
 supplied by third parties. We want the index to only contain UTF-8
 encoded characters. I have a couple questions about this:
 
 1) Is there any way to be sure during indexing (by setting something
 in the solr configuration?) that the documents that we index will
 always be stored in utf-8? Can solr convert documents that need
 converting on the fly, or can solr reject documents containing illegal
 characters?
 
 2) Is there a way to scan the existing index to find any string
 containing non-utf8 characters? Or is there another way that I can
 discover if any crept into my index?

Re: StopFilterFactory and qf containing some fields that use it and some that do not

2011-01-12 Thread Jayendra Patil

Have used edismax and Stopword filters as well. But usually use the fq
parameter e.g. fq=title:the life and never had any issues.

Can you turn on the debugQuery and check whats the Query formed for all the
combinations you mentioned.

Regards,
Jayendra

On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James james.d...@ingrambook.comwrote:

 I'm running into a problem with StopFilterFactory in conjunction with
 (e)dismax queries that have a mix of fields, only some of which use
 StopFilterFactory.  It seems that if even 1 field on the qf parameter does
 not use StopFilterFactory, then stop words are not removed when searching
 any fields.  Here's an example of what I mean:

 - I have 2 fields indexed:
   Title is textStemmed, which includes StopFilterFactory (see below).
   Contributor is textSimple, which does not include StopFilterFactory
 (see below).
 - The is a stop word in stopwords.txt
 - q=lifedefType=edismaxqf=Title  ... returns 277,635 results
 - q=the lifedefType=edismaxqf=Title ... returns 277,635 results
 - q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635 results
 - q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results

 It seems as if the stop words are not being stripped from the query because
 qf contains a field that doesn't use StopFilterFactory.  I did testing
 with combining Stemmed fields with not Stemmed fields in qf and it seems
 as if stemming gets applied regardless.  But stop words do not.

 Does anyone have ideas on what is going on?  Is this a feature or possibly
 a bug?  Any known workarounds?  Any advice is appreciated.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311
 
 fieldType name=textSimple class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 /fieldType

 fieldType name=textStemmed class=solr.TextField
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 stemEnglishPossessive=1 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=0 catenateWords=0 catenateNumbers=0
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 stemEnglishPossessive=1 /
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 /analyzer
 /fieldType

Re: StopFilterFactory and qf containing some fields that use it and some that do not


 Have used edismax and Stopword filters as well. But usually use the fq
 parameter e.g. fq=title:the life and never had any issues.

That is because filter queries are not relevant for the mm parameter which is 
being used for the main query.

 
 Can you turn on the debugQuery and check whats the Query formed for all the
 combinations you mentioned.
 
 Regards,
 Jayendra
 
 On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James 
james.d...@ingrambook.comwrote:
  I'm running into a problem with StopFilterFactory in conjunction with
  (e)dismax queries that have a mix of fields, only some of which use
  StopFilterFactory.  It seems that if even 1 field on the qf parameter
  does not use StopFilterFactory, then stop words are not removed when
  searching any fields.  Here's an example of what I mean:
  
  - I have 2 fields indexed:
Title is textStemmed, which includes StopFilterFactory (see below).
Contributor is textSimple, which does not include StopFilterFactory
  
  (see below).
  - The is a stop word in stopwords.txt
  - q=lifedefType=edismaxqf=Title  ... returns 277,635 results
  - q=the lifedefType=edismaxqf=Title ... returns 277,635 results
  - q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
  results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0
  results
  
  It seems as if the stop words are not being stripped from the query
  because qf contains a field that doesn't use StopFilterFactory.  I did
  testing with combining Stemmed fields with not Stemmed fields in qf
  and it seems as if stemming gets applied regardless.  But stop words do
  not.
  
  Does anyone have ideas on what is going on?  Is this a feature or
  possibly a bug?  Any known workarounds?  Any advice is appreciated.
  
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
  
  fieldType name=textSimple class=solr.TextField
  positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  /fieldType
  
  fieldType name=textStemmed class=solr.TextField
  positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
  generateNumberParts=0 catenateWords=0 catenateNumbers=0
  catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
  stemEnglishPossessive=1 /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
  generateNumberParts=0 catenateWords=0 catenateNumbers=0
  catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
  stemEnglishPossessive=1 /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.PorterStemFilterFactory/
  /analyzer
  /fieldType

PHP app not communicating with Solr

2011-01-12 Thread Eric

Web page returns the following message:
Fatal error: Uncaught exception 'Exception' with message '0 Status: 
Communication Error'

This happens in a dev environment, everything on one machine: Windows 7, WAMP, 
CakePHP, Tomcat, Solr, and SolrPHPClient. Error message also references line 
334 of the Service.php file, which is part of the SolrPHPClient.

Everything works perfectly on a different machine so this problem is probably 
related to configuration. On the problem machine, I can reach solr at 
http://localhost:8080/solr/admin and it looks correct (AFAIK). I am documenting 
the setup procedures this time around but don't know what's different between 
the two machines.

Google search on the error message shows the message is not uncommon so the 
answer might be helpful to others as well.

Thanks,
Eric

Re: PHP app not communicating with Solr

2011-01-12 Thread Lukas Kahwe Smith


On 12.01.2011, at 23:50, Eric wrote:

 Web page returns the following message:
 Fatal error: Uncaught exception 'Exception' with message '0 Status: 
 Communication Error'
 
 This happens in a dev environment, everything on one machine: Windows 7, 
 WAMP, CakePHP, Tomcat, Solr, and SolrPHPClient. Error message also references 
 line 334 of the Service.php file, which is part of the SolrPHPClient.
 
 Everything works perfectly on a different machine so this problem is probably 
 related to configuration. On the problem machine, I can reach solr at 
 http://localhost:8080/solr/admin and it looks correct (AFAIK). I am 
 documenting the setup procedures this time around but don't know what's 
 different between the two machines.
 
 Google search on the error message shows the message is not uncommon so the 
 answer might be helpful to others as well.


I ran into this issue compiling PHP with--curl-wrappers.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: solr wildcard queries and analyzers

2011-01-12 Thread Jayendra Patil

Had the same issues with international characters and wildcard searches.

One workaround we implemented, was to index the field with and without the
ASCIIFoldingFilterFactory.
You would have an original field and one with english equivalent to be used
during searching.

Wildcard searches with english equivalent or international terms would match
either of those.
Also, lowere case the search terms if you are using lowercasefilter during
indexing.

Reagrds,
Jayendra

On Wed, Jan 12, 2011 at 7:46 AM, Kári Hreinsson k...@gagnavarslan.iswrote:

Have you made any progress? Since the AnalyzingQueryParser doesn't inherit
from QParserPlugin solr doesn't want to use it but I guess we could
implement a similar parser that does inherit from QParserPlugin?

Switching parser seems to be what is needed? Has really no one solved this
before?

- Kári

- Original Message -
From: Matti Oinas matti.oi...@gmail.com
To: solr-user@lucene.apache.org
Sent: Tuesday, 11 January, 2011 12:47:52 PM
Subject: Re: solr wildcard queries and analyzers

This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

2011/1/11 Matti Oinas matti.oi...@gmail.com:
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.

2011/1/11 Matti Oinas matti.oi...@gmail.com:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

On wildcard and fuzzy searches, no text analysis is performed on the
search word.

2011/1/11 Kári Hreinsson k...@gagnavarslan.is:
Hi,

I am having a problem with the fact that no text analysis are performed
on wildcard queries. I have the following field type (a bit simplified):
fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.TrimFilterFactory /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.ASCIIFoldingFilterFactory /
/analyzer
/fieldType

My problem has to do with Icelandic characters, when I index a document
with a text field including the word sjálfsögðu it gets indexed as
sjalfsogdu (because of the ASCIIFoldingFilterFactory which replaces the
Icelandic characters with their English equivalents). Then, when I search
(without a wildcard) for sjálfsögðu or sjalfsogdu I get that document as
a result. This is convenient since it enables people to search without
using accented characters and yet get the results they want (e.g. if they
are working on computers with English keyboards).

However this all falls apart when using wildcard searches, then the
search string isn't passed through the filters, and even if I search for
sjálf* I don't get any results because the index doesn't contain the
original words (I get result if I search for sjalf*). I know people have
been having a similar problem with the case sensitivity of wildcard queries
and most often the solution seems to be to lowercase the string before
passing it on to solr, which is not exactly an optimal solution (yet a
simple one in that case). The Icelandic characters complicate things a bit
and applying the same solution (doing the lowercasing and character mapping)
in my application seems like unnecessary duplication of code already part of
solr, not to mention complication of my application and possible maintenance
down the road.

Is there any way around this? How are people solving this? Is there a
way to apply the filters to wildcard queries? I guess removing the
ASCIIFoldingFilterFactory is the simplest solution but this
normalization (of the text done by the filter) is often very useful.

I hope I'm not overlooking some obvious explanation. :/

Thanks in advance,
Kári Hreinsson

Re: Can't find source or jar for Solr class JaspellTernarySearchTrie

2011-01-12 Thread Jayendra Patil

Checkout and build the code from -
https://svn.apache.org/repos/asf/lucene/dev/trunk/

Class -
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.java
Regards,
Jayendra

On Wed, Jan 12, 2011 at 8:46 AM, Larry White ljw1...@gmail.com wrote:

 Hi,

 I'm trying to find the source code for class: JaspellTernarySearchTrie.
 It's
 supposed to be used for spelling suggestions.

 It's referenced in the javadoc:

 http://lucene.apache.org/solr/api/org/apache/solr/spelling/suggest/jaspell/JaspellTernarySearchTrie.html

 I realize this is a dumb question, but i've been looking through the
 downloads for several hours.  I can't actually find the
 package org/apache/solr/spelling/suggest/ that it's supposed to be under.

 So if you would be so kind...
 What jar is it compiled into?
 Where is the source in the downloaded source tree?

 thanks.

RE: StopFilterFactory and qf containing some fields that use it and some that do not

2011-01-12 Thread Dyer, James

Here is what debug says each of these queries parse to:

1. q=lifedefType=edismaxqf=Title  ... returns 277,635 results
2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results
3. q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results

1. +DisjunctionMaxQuery((Title:life))
2. +((DisjunctionMaxQuery((Title:life)))~1)
3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life))
4. +((DisjunctionMaxQuery((Contributor:the)) 
DisjunctionMaxQuery((Contributor:life | Title:life)))~2)

I see what's going on here.  Because the is a stop word for Title, it gets 
removed from first part of the expression.  This means that Contributor is 
required to contain the.  dismax does the same thing too.  I guess I should 
have run debug before asking the mail list!

It looks like the only workarounds I have is to either filter out the stopwords 
in the client when this happens, or enable stop words for all the fields that 
are used in qf with stopword-enabled fields.  Unless...someone has a better 
idea??

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, January 12, 2011 4:44 PM
To: solr-user@lucene.apache.org
Cc: Jayendra Patil
Subject: Re: StopFilterFactory and qf containing some fields that use it and 
some that do not


 Have used edismax and Stopword filters as well. But usually use the fq
 parameter e.g. fq=title:the life and never had any issues.

That is because filter queries are not relevant for the mm parameter which is 
being used for the main query.

 
 Can you turn on the debugQuery and check whats the Query formed for all the
 combinations you mentioned.
 
 Regards,
 Jayendra
 
 On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James 
james.d...@ingrambook.comwrote:
  I'm running into a problem with StopFilterFactory in conjunction with
  (e)dismax queries that have a mix of fields, only some of which use
  StopFilterFactory.  It seems that if even 1 field on the qf parameter
  does not use StopFilterFactory, then stop words are not removed when
  searching any fields.  Here's an example of what I mean:
  
  - I have 2 fields indexed:
Title is textStemmed, which includes StopFilterFactory (see below).
Contributor is textSimple, which does not include StopFilterFactory
  
  (see below).
  - The is a stop word in stopwords.txt
  - q=lifedefType=edismaxqf=Title  ... returns 277,635 results
  - q=the lifedefType=edismaxqf=Title ... returns 277,635 results
  - q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
  results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0
  results
  
  It seems as if the stop words are not being stripped from the query
  because qf contains a field that doesn't use StopFilterFactory.  I did
  testing with combining Stemmed fields with not Stemmed fields in qf
  and it seems as if stemming gets applied regardless.  But stop words do
  not.
  
  Does anyone have ideas on what is going on?  Is this a feature or
  possibly a bug?  Any known workarounds?  Any advice is appreciated.
  
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
  
  fieldType name=textSimple class=solr.TextField
  positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
  /fieldType
  
  fieldType name=textStemmed class=solr.TextField
  positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
  generateNumberParts=0 catenateWords=0 catenateNumbers=0
  catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
  stemEnglishPossessive=1 /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory generateWordParts=1
  generateNumberParts=0 catenateWords=0 catenateNumbers=0
  catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
  stemEnglishPossessive=1 /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.PorterStemFilterFactory/
  /analyzer
  /fieldType

Re: verifying that an index contains ONLY utf-8

2011-01-12 Thread Peter Karich


converting on the fly is not supported by Solr but should be relative
easy in Java.
Also scanning is relative simple (accept only a range). Detection too:
http://www.mozilla.org/projects/intl/chardet.html

 We've created an index from a number of different documents that are
 supplied by third parties. We want the index to only contain UTF-8
 encoded characters. I have a couple questions about this:

 1) Is there any way to be sure during indexing (by setting something
 in the solr configuration?) that the documents that we index will
 always be stored in utf-8? Can solr convert documents that need
 converting on the fly, or can solr reject documents containing illegal
 characters?

 2) Is there a way to scan the existing index to find any string
 containing non-utf8 characters? Or is there another way that I can
 discover if any crept into my index?



-- 
http://jetwick.com open twitter search

Exciting Solr Use Cases

2011-01-12 Thread Peter Karich

 Hi all!

Would you mind to write about your Solr project if it has an uncommon
approach or if it is somehow exciting?
I would like to extend my list for a new blog post.

Examples I have in mind at the moment are:
loggly (real time + big index),
solandra (nice solr + cassandra combination),
haiti trust (extrem index size),
...

Kind Regards,
Peter.

Re: PHP app not communicating with Solr

I was unable to get it to compile. From the author, got one reply about the 
benefits of the compiled  version. After submitting my errors to him, have not 
yet received a reply.

##Weird thing 'on the way to the forum' today.##

I remember reading an article a couple of days ago which said the compiled 
version is 10-15% faster than the 'pure PHP' Solr library out there, (and it 
has 
a lot more capability,that's for sure!)

Turns out, this slower pure PHP version uses 'file_get_contents()'(FCG) to do 
the actual query of the Solr Instance. 


http://stackoverflow.com/questions/23/file-get-contents-vs-curl-what-has-better-performance

The article above shows that FCG is on average 22% slower than using cURL in 
basic usage. so modifying the 'pure PHP' library with cURL would make up for 
all 
of the speed that the compiled SolrPHP has.

 Dennis Gearon





- Original Message 
From: Lukas Kahwe Smith m...@pooteeweet.org
To: solr-user@lucene.apache.org
Sent: Wed, January 12, 2011 2:52:46 PM
Subject: Re: PHP app not communicating with Solr


On 12.01.2011, at 23:50, Eric wrote:

 Web page returns the following message:
 Fatal error: Uncaught exception 'Exception' with message '0 Status: 
Communication Error'
 
 This happens in a dev environment, everything on one machine: Windows 7, 
 WAMP, 
CakePHP, Tomcat, Solr, and SolrPHPClient. Error message also references line 
334 
of the Service.php file, which is part of the SolrPHPClient.
 
 Everything works perfectly on a different machine so this problem is probably 
related to configuration. On the problem machine, I can reach solr at 
http://localhost:8080/solr/admin and it looks correct (AFAIK). I am 
documenting 
the setup procedures this time around but don't know what's different between 
the two machines.
 
 Google search on the error message shows the message is not uncommon so the 
answer might be helpful to others as well.


I ran into this issue compiling PHP with--curl-wrappers.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: StopFilterFactory and qf containing some fields that use it and some that do not

Here's another thread on the subject:
http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug-
td493483.html

And slightly off topic: you'd also might want to look at using common grams, 
they are really useful for phrase queries that contain stopwords.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory


 Here is what debug says each of these queries parse to:
 
 1. q=lifedefType=edismaxqf=Title  ... returns 277,635 results
 2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results
 3. q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
 4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results
 
 1. +DisjunctionMaxQuery((Title:life))
 2. +((DisjunctionMaxQuery((Title:life)))~1)
 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life))
 4. +((DisjunctionMaxQuery((Contributor:the))
 DisjunctionMaxQuery((Contributor:life | Title:life)))~2)
 
 I see what's going on here.  Because the is a stop word for Title, it
 gets removed from first part of the expression.  This means that
 Contributor is required to contain the.  dismax does the same thing
 too.  I guess I should have run debug before asking the mail list!
 
 It looks like the only workarounds I have is to either filter out the
 stopwords in the client when this happens, or enable stop words for all
 the fields that are used in qf with stopword-enabled fields. 
 Unless...someone has a better idea??
 
 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, January 12, 2011 4:44 PM
 To: solr-user@lucene.apache.org
 Cc: Jayendra Patil
 Subject: Re: StopFilterFactory and qf containing some fields that use it
 and some that do not
 
  Have used edismax and Stopword filters as well. But usually use the fq
  parameter e.g. fq=title:the life and never had any issues.
 
 That is because filter queries are not relevant for the mm parameter which
 is being used for the main query.
 
  Can you turn on the debugQuery and check whats the Query formed for all
  the combinations you mentioned.
  
  Regards,
  Jayendra
  
  On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James
 
 james.d...@ingrambook.comwrote:
   I'm running into a problem with StopFilterFactory in conjunction with
   (e)dismax queries that have a mix of fields, only some of which use
   StopFilterFactory.  It seems that if even 1 field on the qf parameter
   does not use StopFilterFactory, then stop words are not removed when
   searching any fields.  Here's an example of what I mean:
   
   - I have 2 fields indexed:
 Title is textStemmed, which includes StopFilterFactory (see
 below). Contributor is textSimple, which does not include
 StopFilterFactory
   
   (see below).
   - The is a stop word in stopwords.txt
   - q=lifedefType=edismaxqf=Title  ... returns 277,635 results
   - q=the lifedefType=edismaxqf=Title ... returns 277,635 results
   - q=lifedefType=edismaxqf=Title Contributor  ... returns 277,635
   results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0
   results
   
   It seems as if the stop words are not being stripped from the query
   because qf contains a field that doesn't use StopFilterFactory.  I
   did testing with combining Stemmed fields with not Stemmed fields in
   qf and it seems as if stemming gets applied regardless.  But stop
   words do not.
   
   Does anyone have ideas on what is going on?  Is this a feature or
   possibly a bug?  Any known workarounds?  Any advice is appreciated.
   
   James Dyer
   E-Commerce Systems
   Ingram Content Group
   (615) 213-4311
   
   fieldType name=textSimple class=solr.TextField
   positionIncrementGap=100
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   /analyzer
   /fieldType
   
   fieldType name=textStemmed class=solr.TextField
   positionIncrementGap=100
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory generateWordParts=1
   generateNumberParts=0 catenateWords=0 catenateNumbers=0
   catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
   stemEnglishPossessive=1 /
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PorterStemFilterFactory/
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
   ignoreCase=true expand=true/
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory

Re: PHP app not communicating with Solr

2011-01-12 Thread Eric

Resolved! In a rare flash of clarity, I removed the @ preceeding the 
file_get_contents call. Doing so made it apparent that my app was passing an 
incorrect Solr service port number to the SolrPHPClient code. Correcting the 
port number fixed the issue.

The lesson is... suppressed errors are hard to find.

--- On Wed, 1/12/11, Dennis Gearon gear...@sbcglobal.net wrote:

 From: Dennis Gearon gear...@sbcglobal.net
 Subject: Re: PHP app not communicating with Solr
 To: solr-user@lucene.apache.org
 Date: Wednesday, January 12, 2011, 3:37 PM
 I was unable to get it to compile.
 From the author, got one reply about the 
 benefits of the compiled  version. After submitting my
 errors to him, have not 
 yet received a reply.
 
 ##Weird thing 'on the way to the forum' today.##
 
 I remember reading an article a couple of days ago which
 said the compiled 
 version is 10-15% faster than the 'pure PHP' Solr library
 out there, (and it has 
 a lot more capability,that's for sure!)
 
 Turns out, this slower pure PHP version uses
 'file_get_contents()'(FCG) to do 
 the actual query of the Solr Instance. 
 
 
 http://stackoverflow.com/questions/23/file-get-contents-vs-curl-what-has-better-performance
 
 The article above shows that FCG is on average 22% slower
 than using cURL in 
 basic usage. so modifying the 'pure PHP' library with cURL
 would make up for all 
 of the speed that the compiled SolrPHP has.
 
  Dennis Gearon
 
 
 
 
 
 - Original Message 
 From: Lukas Kahwe Smith m...@pooteeweet.org
 To: solr-user@lucene.apache.org
 Sent: Wed, January 12, 2011 2:52:46 PM
 Subject: Re: PHP app not communicating with Solr
 
 
 On 12.01.2011, at 23:50, Eric wrote:
 
  Web page returns the following message:
  Fatal error: Uncaught exception 'Exception' with
 message '0 Status: 
 Communication Error'
  
  This happens in a dev environment, everything on one
 machine: Windows 7, WAMP, 
 CakePHP, Tomcat, Solr, and SolrPHPClient. Error message
 also references line 334 
 of the Service.php file, which is part of the
 SolrPHPClient.
  
  Everything works perfectly on a different machine so
 this problem is probably 
 related to configuration. On the problem machine, I can
 reach solr at 
 http://localhost:8080/solr/admin and it
 looks correct (AFAIK). I am documenting 
 the setup procedures this time around but don't know
 what's different between 
 the two machines.
  
  Google search on the error message shows the message
 is not uncommon so the 
 answer might be helpful to others as well.
 
 
 I ran into this issue compiling PHP with--curl-wrappers.
 
 regards,
 Lukas Kahwe Smith
 m...@pooteeweet.org

Solr 4.0 = Spatial Search - How to

2011-01-12 Thread caman


Ok, this could be very easy to do but was not able to do this.
Need to enable location search i.e. if someone searches for location 'New
York' = show results for New York and results within 50 miles of New York.
We do have latitude/longitude stored in database for each record but not
sure how to index these values to enable spatial search.
Any help would be much appreciated.

thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245592.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.0 = Spatial Search - How to

2011-01-12 Thread Adam Estrada

I believe this is what you are looking for. I renamed the field called
store to coords in the schema.xml file. The tricky part is building out
the query. I am using SolrNet to do this though and have not yet cracked the
problem.

http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!bbox}sfield=coordspt=32.15,-93.85d=500

Adam

On Wed, Jan 12, 2011 at 8:01 PM, caman aboxfortheotherst...@gmail.comwrote:


 Ok, this could be very easy to do but was not able to do this.
 Need to enable location search i.e. if someone searches for location 'New
 York' = show results for New York and results within 50 miles of New York.
 We do have latitude/longitude stored in database for each record but not
 sure how to index these values to enable spatial search.
 Any help would be much appreciated.

 thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245592.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.0 = Spatial Search - How to

2011-01-12 Thread caman


Adam,

thanks. Yes that helps
but how does coords fields get populated? All I have is 

field name=lat type=tdouble indexed=true stored=true /
field name=lng type=tdouble indexed=true stored=true /

field name=coord type=location indexed=true stored=true /

fields 'lat' and  'lng' get populated by dataimporthandler but coord, am not
sure?

Thanks
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245709.html
Sent from the Solr - User mailing list archive at Nabble.com.

Anyone seen measurable performance improvement using Apache Portable Runtime (APR) with Solr and Tomcat

2011-01-12 Thread Will Milspec

Hi all,

Has anyone seen used Apache Portable Runtime (APR) in conjunction with  Solr
and Tomcat? Has anyone seen (or better, measured) performance improvements
when using APR?

APR is a library that implements some functionality using Native C  (see
http://apr.apache.org/ and
http://en.wikipedia.org/wiki/Apache_Portable_Runtime)

From wikipedia entry:
quote
The range of platform-independent functionality provided by APR includes:
* Memory allocation and memory pool functionality
* Atomic operations
* Dynamic library handling
* File I/O
* Command argument parsing
* Locking
* Hash tables and arrays
* Mmap functionality
* Network sockets and protocols
* Thread, process and mutex functionality
* Shared memory functionality
* Time routines
* User and group ID services
/endquote

I could imagine benefits in file IO  as network IO. But that's pure
conjecture.

Comments?

thanks in advance

Re: Solr 4.0 = Spatial Search - How to

2011-01-12 Thread Adam Estrada

Actually, I by looking at the results from the geofilt filter it would
appear that it's not giving me the results I'm looking for. Or maybe it
is...I need to convert my results to KML to see if it is actually performing
a proper radius query.

http://localhost:8983/solr/select?q=*:*fq={!geofilt%20pt=39.0914154052734,-84.517822265625%20sfield=coords%20d=5000}http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!geofilt%20pt=32.15,-93.85%20sfield=coords%20d=5000}

http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!geofilt%20pt=32.15,-93.85%20sfield=coords%20d=5000}Please
let me know what you find.

Adam

On Wed, Jan 12, 2011 at 8:24 PM, Adam Estrada estrada.adam.gro...@gmail.com
 wrote:

 I believe this is what you are looking for. I renamed the field called
 store to coords in the schema.xml file. The tricky part is building out
 the query. I am using SolrNet to do this though and have not yet cracked the
 problem.


 http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq={!bbox}sfield=coordspt=32.15,-93.85d=500http://localhost:8983/solr/select?q=*:*+AND+eventdate:[2006-01-21T00:00:000Z+TO+2007-01-21T00:00:000Z]fq=%7B!bbox%7Dsfield=coordspt=32.15,-93.85d=500

 Adam

 On Wed, Jan 12, 2011 at 8:01 PM, caman aboxfortheotherst...@gmail.comwrote:


 Ok, this could be very easy to do but was not able to do this.
 Need to enable location search i.e. if someone searches for location 'New
 York' = show results for New York and results within 50 miles of New
 York.
 We do have latitude/longitude stored in database for each record but not
 sure how to index these values to enable spatial search.
 Any help would be much appreciated.

 thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245592.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.0 = Spatial Search - How to

2011-01-12 Thread Adam Estrada

In my case, I am getting data from a database and am able to concatenate the
lat/long as a coordinate pair to store in my coords field. To test this, I
randomized the lat/long values and generated about 6000 documents.

Adam

On Wed, Jan 12, 2011 at 8:29 PM, caman aboxfortheotherst...@gmail.comwrote:


 Adam,

 thanks. Yes that helps
 but how does coords fields get populated? All I have is

 field name=lat type=tdouble indexed=true stored=true /
 field name=lng type=tdouble indexed=true stored=true /

 field name=coord type=location indexed=true stored=true /

 fields 'lat' and  'lng' get populated by dataimporthandler but coord, am
 not
 sure?

 Thanks
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245709.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Multi-word exact keyword case-insensitive search suggestions

2011-01-12 Thread Chamnap Chhorn

Hi all,

I'm just stuck with exact keyword for several days. Hope you guys could help
me. Here is the scenario:

   1. It need to be matched with multi-word keyword and case insensitive
   2. Partial word or single word matching with this field is not allowed

I want to know the field type definition for this field and sample solr
query. I need to combine this search with my full text search which uses
dismax query.

Thanks
-- 
Chhorn Chamnap
http://chamnapchhorn.blogspot.com/

Re: Exciting Solr Use Cases

When I have it running with a permission system (through both API and front 
end), I will share i with everyone. It's beginning tohappen.

The search if fairly primative for now. But we hope to learn or hire skills ot 
better match it to the business model as we grow/get funding.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Peter Karich peat...@yahoo.de
To: solr-user@lucene.apache.org
Sent: Wed, January 12, 2011 3:37:12 PM
Subject: Exciting Solr Use Cases

Hi all!

Would you mind to write about your Solr project if it has an uncommon
approach or if it is somehow exciting?
I would like to extend my list for a new blog post.

Examples I have in mind at the moment are:
loggly (real time + big index),
solandra (nice solr + cassandra combination),
haiti trust (extrem index size),
...

Kind Regards,
Peter.

Re: spell suggest response

2011-01-12 Thread satya swaroop

Hi Juan,
 yeah.. i tried of onlyMorePopular and got some results but are
not similar words or near words to the word i have given in the query..
Here i state you the output..

http://localhost:8080/solr/spellcheckCompRH?q=javarows=0spellcheck=truespellcheck.collate=truespellcheck.onlyMorePopular=truespellcheck.count=20

the o/p i get is
-arr name=suggestion
strdata/str
strhave/str
strcan/str
strany/str
strall/str
strhas/str
streach/str
strpart/str
strmake/str
strthan/str
stralso/str
/arr



but this words are not similar to the given word 'java' the near words
would be javac,javax,data,java.io... etc.., the stated words are present in
the index..


Regards,
satya

Question on deleting all rows for an index

2011-01-12 Thread Wilson, Robert

We are just staring with Solr and have a multi core implementation and need to 
delete all the rows in the index to clean things up.

When running an update via a url we are using something like the following 
which works fine:
http://localhost:8983/solr/template/update/csv?commit=trueescape=\stream.file=/opt/TEMPLATE_DATA.csv

Not clear on how to delete all the rows in this index. The documentation gives 
this example:
deletequerytimestamp:[* TO NOW-12HOUR]/query/delete

I'm not clear on the context of this command - is this through the Solr admin 
or can you run this via the restful call?

Trying to add this to a restful call does not work like this attempt:
http://localhost:8983/solr/template/deletequerytimestamp:[* TO 
NOW-12HOUR]/query/delete

Any thoughts appreciated.

Bob

Re: Question on deleting all rows for an index

On Thu, Jan 13, 2011 at 6:08 AM, Wilson, Robert
rwil...@constantcontact.com wrote:
 We are just staring with Solr and have a multi core implementation and need 
 to delete all the rows in the index to clean things up.

 When running an update via a url we are using something like the following 
 which works fine:
 http://localhost:8983/solr/template/update/csv?commit=trueescape=\stream.file=/opt/TEMPLATE_DATA.csv

 Not clear on how to delete all the rows in this index. The documentation 
 gives this example:
 deletequerytimestamp:[* TO NOW-12HOUR]/query/delete
[...]

Not sure where you got that from. The proper delete query to delete
*all* records would be:
deletequery*:*/query/delete

Please note that you have to follow the delete with a commit. You
can use curl to call Solr for both the delete and commit. Please see
http://wiki.apache.org/solr/UpdateXmlMessages for details.

Regards,
Gora

Re: Question on deleting all rows for an index

2011-01-12 Thread Daniel Ostermeier


Hi Robert,

You can find an example of something similar to this in the examples 
that are part of the solr distribution.  The tutorial ( 
http://lucene.apache.org/solr/tutorial.html) describes how to post data 
to the solr server via the post.jar


user:~/solr/example/exampledocs$ *java -jar post.jar solr.xml monitor.xml*

If you take a look at the solr.xml file, you will see

add
doc
field name=idSOLR1000/field
field name=nameSolr, the Enterprise Search Server/field
/doc
/add

I think you can post your delete query to the server in the same way.

Hope this helps.

-Daniel



We are just staring with Solr and have a multi core implementation and need to 
delete all the rows in the index to clean things up.

When running an update via a url we are using something like the following 
which works fine:
http://localhost:8983/solr/template/update/csv?commit=trueescape=\stream.file=/opt/TEMPLATE_DATA.csv

Not clear on how to delete all the rows in this index. The documentation gives 
this example:
deletequerytimestamp:[* TO NOW-12HOUR]/query/delete

I'm not clear on the context of this command - is this through the Solr admin 
or can you run this via the restful call?

Trying to add this to a restful call does not work like this attempt:
http://localhost:8983/solr/template/deletequerytimestamp:[* TO 
NOW-12HOUR]/query/delete

Any thoughts appreciated.

Bob

basic document crud in an index