Re: Is it posible to exclude results from other languages?

2010-02-09 Thread Shalin Shekhar Mangar
On Wed, Feb 10, 2010 at 10:09 AM, Lance Norskog  wrote:

>
> Thanks for the pointer to ngramj (LGPL license), which then leads to
> another contender, http://tcatng.sourceforge.net/ (BSD license). The
> latter would make a great DIH Transformer that could go into contrib/
> (hint hint).
>
>
SOLR-1768 :)

-- 
Regards,
Shalin Shekhar Mangar.


Re: Question on Tokenizing email address

2010-02-09 Thread abhishes

Thank you! it works very well.

I think that the field type suggested by you will index words like DOT, AT,
com also

In order to prevent these words from getting indexed, I have changed the
field type to 


  
  





  


I have added the words dot, com to the stoplist file (at was already there).

Is this correct?

-- 
View this message in context: 
http://old.nabble.com/Question-on-Tokenizing-email-address-tp27518673p27527033.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: "after flush: fdx size mismatch" on query durring writes

2010-02-09 Thread Lance Norskog
We need more information. How big is the index in disk space? How many
documents? How many fields? What's the schema? What OS? What Java
version?

Do you run this on a local hard disk or is it over an NFS mount?

Does this software commit before shutting down?

If you run with asserts on do you get errors before this happens.
-ea:org.apache.lucene... as a JVM argument

On Tue, Feb 9, 2010 at 5:08 PM, Acadaca  wrote:
>
> We are using Solr 1.4 in a multi-core setup with replication.
>
> Whenever we write to the master we get the following exception:
>
> java.lang.RuntimeException: after flush: fdx size mismatch: 1285 docs vs 0
> length in bytes of _gqg.fdx file exists?=false
> at
> org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:97)
> at
> org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50)
>
> Has anyone had any success debugging this one?
>
> thx.
> --
> View this message in context: 
> http://old.nabble.com/%22after-flush%3A-fdx-size-mismatch%22-on-query-durring-writes-tp27524755p27524755.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr/Drupal Integration - Query Question

2010-02-09 Thread Lance Norskog
The admin/form.jsp is supposed to prepopulate fl= with '*,score' which
means bring back all fields and the calculated relevance score.

This is the Drupal search, decoded. I changed the %2B to + signs for
readability. Have a look at the filter query fq= and the facet date
range.

Also, in Solr 1.4 the 'rord' function has become very slow. So the
Drupal integration needs some updating anyway.

INFO: [] webapp=/solr path=/select
params={spellcheck=true&
spellcheck.q=video&
fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&

bf=recip(rord(created),4,19,19)^200.0&

&hl.simple.post=&
hl.simple.pre=&hl=&version=1.2&
hl.fragsize=&
hl.fl=&
hl.snippets=&

facet=true&facet.limit=20&
facet.field=uid&facet.field=type&facet.field=language&
facet.mincount=1&

fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)&
qf=name^3.0&facet.date=changed&
json.nl=map&wt=json&

f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&
f.changed.facet.date.end=2010-02-09T17:44:16Z+1HOUR/HOUR&
f.changed.facet.date.gap=+1HOUR

rows=10&start=0&facet.sort=true&
q=video}
hits=0 status=0 QTime=0

On Tue, Feb 9, 2010 at 1:28 PM, jaybytez  wrote:
>
> I know this is not Drupal, but thought this question maybe more around the
> Solr query.
>
> For instance, I pulled down LucidImaginations Solr install, just like the
> apache solr install and ran the example solr and loaded the documents from
> the exampledocs.
>
> I can go to:
>
> http://localhost:8983/solr/admin/
>
> And search for video and get responses
>
> But on my solr if I go to the full interface and use the defaults, I get no
> results back because of search fields, etc.
>
> http://localhost:8983/solr/admin/form.jsp
>
> So my admin Solr search query looks like this when searching "video":
>
> Feb 9, 2010 1:25:49 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select
> params={explainOther=&fl=&indent=on&start=0&q=video&hl.fl=&qt=&wt=&fq=&version=2.2&rows=10}
> hits=2 status=0 QTime=0
>
> But if I go into Drupal and search "video", this is the query and no results
> come back:
>
> Feb 9, 2010 1:27:33 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select
> params={spellcheck=true&f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&facet=true&facet.limit=20&spellcheck.q=video&hl.simple.pre=&hl=&version=1.2&fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&bf=recip(rord(created),4,19,19)^200.0&f.changed.facet.date.gap=%2B1HOUR&hl.simple.post=&facet.field=uid&facet.field=type&facet.field=language&fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)&hl.fragsize=&facet.mincount=1&qf=name^3.0&facet.date=changed&hl.fl=&json.nl=map&wt=json&f.changed.facet.date.end=2010-02-09T17:44:16Z%2B1HOUR/HOUR&rows=10&hl.snippets=&start=0&facet.sort=true&q=video}
> hits=0 status=0 QTime=0
>
> Any thoughts on the search query that gets generated by the Drupal/Solr
> module?
>
> Thanks...jay
> --
> View this message in context: 
> http://old.nabble.com/Solr-Drupal-Integration---Query-Question-tp27522362p27522362.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Is it posible to exclude results from other languages?

2010-02-09 Thread Lance Norskog
That's what I was going to look up :)

The nutch thing works reasonably well. It comes with a training
database from various languages. It had some UTF-8 problems in the
files. The trick here is to come up with a balanced volume of text for
all languages so that one language's patterns do not overwhelm.

Thanks for the pointer to ngramj (LGPL license), which then leads to
another contender, http://tcatng.sourceforge.net/ (BSD license). The
latter would make a great DIH Transformer that could go into contrib/
(hint hint).

On Tue, Feb 9, 2010 at 7:21 AM, Jan Høydahl / Cominvent
 wrote:
> Much more efficient to tag documents with language at index time. Look for 
> language identification tools such as 
> http://www.sematext.com/products/language-identifier/index.html or 
> http://ngramj.sourceforge.net/ or 
> http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 9. feb. 2010, at 05.19, Lance Norskog wrote:
>
>> There is
>>
>> On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch  wrote:
>>>
>>>
>>> Yes, It's true that we could do it in index time if we had a way to know. I
>>> was thinking in some solution in search time, maybe measuring the % of
>>> stopwords of each document. Normally, a document of another language won't
>>> have any stopword of its main language.
>>>
>>> If you know some external software to detect the language of a source text,
>>> it would be useful too.
>>>
>>> Thanks,
>>> Raimon Bosch.
>>>
>>>
>>>
>>> Ahmet Arslan wrote:


> In our indexes, sometimes we have some documents written in
> other languages
> different to the most common index's language. Is there any
> way to give less
> boosting to this documents?

 If you are aware of those documents, at index time you can boost those
 documents with a value less than 1.0:

 
   
     // document written in other languages
     ...
     ...
   
 

 http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22





>>>
>>> --
>>> View this message in context: 
>>> http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Autosuggest and highlighting

2010-02-09 Thread Lance Norskog
To select the whole string, I think you want hl.fragmenter=regex and
to create a regex pattern for your entire strings:

http://www.lucidimagination.com/search/document/CDRG_ch07_7.9?q=highlighter+multi-valued

This will let you select the entire string field. But I don't know how
to avoid the non-matching prefixes. That's a really interesting quirk
of highlighting.

On Tue, Feb 9, 2010 at 6:18 AM, gwk  wrote:
> On 2/9/2010 2:57 PM, Ahmet Arslan wrote:
>>>
>>> I'm trying to improve the search box on our website by
>>> adding an autosuggest field. The dataset is a set of
>>> properties in the world (mostly europe) and the searchbox is
>>> intended to be filled with a country-, region- or city name.
>>> To do this I've created a separate, simple core with one
>>> document per geographic location, for example the document
>>> for the country "France" contains several fields including
>>> the number of properties (so we can show the approximate
>>> amount of results in the autosuggest box) and the name of
>>> the country France in several languages and some other
>>> bookkeeping information. The name of the property is stored
>>> in two fields: "name" which simple contains the canonical
>>> name of the country, region or city and "names" which is a
>>> multivalued field containing the name in several different
>>> languages. Both fields use an EdgeNGramFilter during
>>> analysis so the query "Fr" can match "France".
>>>
>>> This all seems to work, the autosuggest box gives
>>> appropriate suggestions. But when I turn on highlighting the
>>> results are less than desirable, for example the query "rho"
>>> using dismax  (and hl.snippets=5) returns the
>>> following:
>>>
>>> 
>>> 
>>> Région
>>> Rhône-Alpes
>>> Rhône-Alpes
>>> Rhône-Alpes
>>> Rhône-Alpes
>>> Rhône-Alpes
>>> 
>>> 
>>> Région
>>> Rhône-Alpes
>>> 
>>> 
>>> 
>>> 
>>> Département du
>>> Rhône
>>> Département du
>>> Rhône
>>> Rhône
>>> Département du
>>> Rhône
>>> Rhône
>>> 
>>> 
>>> Département du
>>> Rhône
>>> 
>>> 
>>>
>>> As you can see, no matter where the match is, the first 3
>>> characters are highlighted. Obviously not correct for many
>>> of the fields. Is this because of the NGramFilterFactory or
>>> am I doing something wrong?
>>>
>> I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime
>> ago. It was giving correct highlights.
>>
> I just ran a test with the NGramFilter removed (and reindexing) which did
> give correct highlighting results but I had to query using the whole word.
> I'll try the PrefixingFilterFactory next although according to the comments
> it's nothing but a subset of the EdgeNGramFilterFactory so unless I'm
> configuring it wrong it should yield the same results...
>
>> However we are now using
>> http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It
>> automatically makes bold matching characters without using solr
>> highlighting.
>>
> Using a pure javascript based solution isn't really an option for us as that
> wouldn't work for the diacritical marks without a lot of transliteration
> brouhaha.
>
> Regards,
>
> gwk
>



-- 
Lance Norskog
goks...@gmail.com


analysing wild carded terms

2010-02-09 Thread Joe Calderon
hello *, quick question, what would i have to change in the query
parser to allow wildcarded terms to go through text analysis?


Re: Distributed search and haproxy and connection build up

2010-02-09 Thread Lance Norskog
This goes through the Apache Commons HTTP client library:
http://hc.apache.org/httpclient-3.x/

We used 'balance' at another project and did not have any problems.

On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor  wrote:
> I have been using distributed search with haproxy but noticed that I am
> suffering a little from tcp connections building up waiting for the OS level
> closing/time out:
>
> netstat -a
> ...
> tcp6       1      0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
> CLOSE_WAIT
> tcp6       1      0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
> CLOSE_WAIT
> tcp6       1      0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
> CLOSE_WAIT
> tcp6       0      0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
> TIME_WAIT
> tcp6       1      0 10.0.16.170%34654:41782 10.0.16.181%363574:
> CLOSE_WAIT
> tcp6       1      0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
> CLOSE_WAIT
> tcp6       1      0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
> CLOSE_WAIT
> tcp6       0      0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
> TIME_WAIT
> tcp6       1      0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
> CLOSE_WAIT
> ...
>
> Digging a little into the haproxy documentation, it seems that they do not
> support persistent connections.
>
> Does solr normally persist the connections between shards (would this
> problem happen even without haproxy)?
>
> Ian.
>



-- 
Lance Norskog
goks...@gmail.com


Re: Posting pdf file and posting from remote

2010-02-09 Thread Lance Norskog
stream.file= means read a local file from the server that solr runs
on. It has to be a complete path that works from that server. To load
the file over HTTP you have to use @filename to have curl open it.
This path has to work from the program you run curl on, and relative
paths work.

Also, tika does not save the PDF binary, it only pulls words out of
the PDF and stores those.

There's a tika example in solr/trunk/example/exampleDIH in the current
solr trunk. (I don't remember if it's in the solr 1.4 release.) With
this you can save the pdf binary in one field and save the extracted
text in another field. I'm doing this now with html.

On Tue, Feb 9, 2010 at 2:08 AM, alendo  wrote:
>
> Ok I'm going ahead (may be:).
> I tried another curl command to send the file from remote:
>
> http://mysolr:/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf
>
> and the behaviour has been changed: now I get an error in solr log file:
>
> HTTP Status 500 - files/attach-8514.pdf (No such file or directory)
> java.io.FileNotFoundException: files/attach-8514.pdf (No such file or
> directory) at java.io.FileInputStream.open(Native Method) at
> java.io.FileInputStream.(FileInputStream.java:106) at
> org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
>
> etc etc...
>
> --
> View this message in context: 
> http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing / querying multiple data types

2010-02-09 Thread Lance Norskog
A couple of minor problems:

The qt parameter (Que Tee) selects the parser for the q (Q for query)
parameter. I think you mean 'qf':

http://wiki.apache.org/solr/DisMaxRequestHandler#qf_.28Query_Fields.29

Another problems with atomID, atomId, atomid: Solr field names are
case-sensitive. I don't know how this plays out.

Now, to the main part:  the  part does not create
a column named name1.
The two queries only populate the same namespace of four fields: id,
atomID, name, description.

If you want data from each entity to have a constant field
distinguishing it, you have to create a new field with a constant
value. You do this with the TemplateTransformer.

http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer

Add this as an entity attribute to both entities:
transformer="TemplateTransformer"
and add this as a column to each entity:
 and then "name2".

You may have to do something else for these to appear in the document.

On Tue, Feb 9, 2010 at 12:41 AM,   wrote:
> Sven
>
> In my data-config.xml I have the following
>        
>                
>                
>        
>
> In my schema.xml I have
>    />
>   
>    required="true" />
>   
>
> And in my solrconfig.xml I have
>          class="org.apache.solr.handler.dataimport.DataImportHandler">
>    
>                data-config.xml
>    
>  
>
>        
>                
>                        dismax
>                        explicit
>                        0.01
>                        name^1.5 description^1.0
>                
>        
>
>        
>                
>                        dismax
>                        explicit
>                        0.01
>                        name^1.5 description^1.0
>                
>        
>
> And the
>  
> Has been untouched
>
> So when I run
> http://localhost:7001/solr/select/?q=food&qt=name1
> I was expecting to get results form the data that had been indexed by  name="name1"
>
>
> Regards
> Stefan Maric
>



-- 
Lance Norskog
goks...@gmail.com


Copying dynamic fields into default text field messing up fieldNorm?

2010-02-09 Thread Yu-Shan Fung
Hi All,

I'm trying to create an index of documents, where for each document, I am
trying to associate with it a set of related keywords, each with individual
boost values that I compute externally.

eg:
Document Title: Democrats
  related keywords:
liberal: 4.0
politics: 1.5
obama: 2.0
etc. (hundreds of related keywords)

Since boosts in solr is per field instead of per field-instance, I am trying
to get around this by creating dynamic fields for each related keyword, and
setting boost values accordingly. To be able to surface this document by
searching the related keywords, I have the schema setup to copy these
related keyword fields into the default text field.

But when I query any of these related keywords, I get back fieldNorms with
the max value:

  1.5409492E10 = (MATCH) weight(text:liberal in 11), product of:
0.8608541 = queryWeight(text:liberal), product of:
  1.6840147 = idf(docFreq=109, maxDocs=218)
  0.51119155 = queryNorm
1.79002368E10 = (MATCH) fieldWeight(text:liberal in 11), product of:
  1.4142135 = tf(termFreq(text:liberal)=2)
  1.6840147 = idf(docFreq=109, maxDocs=218)

According to this email exchange between Koji and Mat Brown,

http://www.mail-archive.com/solr-user@lucene.apache.org/msg23759.html

The boost value from copyField's shouldn't be accumulated into the boost for
the text field, can anyone else verify this? This seem to go against what
I'm observing. When I turn off copyField, the fieldNorm goes back to normal
(in the single digit range).

Any idea what could be causing this? I'm running Solr 1.4 in case that
matters.

Any pointers/advice would be greatly appreciated! Thanks,
Yu-Shan


Re: Bigram term vectors and weights possible with Solr?

2010-02-09 Thread Mike Hughes
Thank you Ahmet, this is exactly what I was looking for.  Looks like
the shingle filter can produce 3+-gram terms as well, that's great.
I'm going to try this with both western and CJK language tokenizers
and see how it turns out.

On Tue, Feb 9, 2010 at 5:07 PM, Ahmet Arslan  wrote:
>> I've been looking at the Solr TermVectorComponent
>> (http://wiki.apache.org/solr/TermVectorComponent) and it
>> seems to have
>> something similar to this, but it looks to me like this is
>> a component
>> that is processed at query time (?) and is limited to
>> 1-gram terms.
>
> If you use  outputUnigrams="false"/> it can give you info about 2-gram terms.
>
>> Also, the tf/idf scores are a little different as they come
>> back in integer values as separate components.
>
> In wiki, example output only tf and df values - which are integer - are 
> displayed. You can calculate tf*idf (double) with these parameters:
>
> &qt=tvrh&tv=true&fl=yourFieldName&tv.tf=true&tv.df=true&tv.tf_idf=true
>
>
>
>


Re: Solr usage with Auctions/Classifieds?

2010-02-09 Thread Lance Norskog
The class was added in 2007 and hasn't changed. I don't know if anyone uses it.

Presumably sort-by-function will use it.

On Tue, Feb 9, 2010 at 5:59 AM, Jan Høydahl / Cominvent
 wrote:
> With the new sort by function in 1.5 
> (http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function), will it now be 
> possible to include the ExternalFileField value in the sort formula? If so, 
> we could sort on last bid price or last bid time without updating the 
> document itself.
>
> However, to display the result with the fresh values, we need to go to DB, or 
> is there someone working on the possibility to return ExternalFileField 
> values for result view?
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 4. feb. 2010, at 06.25, Lance Norskog wrote:
>
>> Oops, forgot to add the link:
>>
>> http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4
>>
>> On Wed, Feb 3, 2010 at 9:17 PM, Andy  wrote:
>>> How do I set up and use this external file?
>>>
>>> Can I still use such a field in fq or boost?
>>>
>>> Can you point me to the right documentation? Thanks
>>>
>>> --- On Wed, 2/3/10, Lance Norskog  wrote:
>>>
>>> From: Lance Norskog 
>>> Subject: Re: Solr usage with Auctions/Classifieds?
>>> To: solr-user@lucene.apache.org
>>> Date: Wednesday, February 3, 2010, 10:03 PM
>>>
>>> This field type allows you to have an external file that gives a float
>>> value for a field. You can only use functions on it.
>>>
>>> On Sat, Jan 30, 2010 at 7:05 AM, Jan Høydahl / Cominvent
>>>  wrote:
 A follow-up on the auction use case.

 How do you handle the need for frequent updates of only one field, such as 
 the last bid field (needed for sort on price, facets or range)?
 For high traffic sites, the document update rate becomes very high if you 
 re-send the whole document every time the bid price changes.

 --
 Jan Høydahl  - search architect
 Cominvent AS - www.cominvent.com

 On 10. des. 2009, at 19.52, Grant Ingersoll wrote:

>
> On Dec 8, 2009, at 6:37 PM, regany wrote:
>
>>
>> hello!
>>
>> just wondering if anyone is using Solr as their search for an auction /
>> classified site, and if so how have you managed your setup in general? 
>> ie.
>> searching against listings that may have expired etc.
>
>
> I know several companies using Solr for classifieds/auctions.  Some 
> remove the old listings while others leave them in and filter them or 
> even allow users to see old stuff (but often for reasons other than users 
> finding them, i.e. SEO).  For those that remove, it's typically a batch 
> operation that takes place at night.
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
> Solr/Lucene:
> http://www.lucidimagination.com/search
>


>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>
>



-- 
Lance Norskog
goks...@gmail.com


"after flush: fdx size mismatch" on query durring writes

2010-02-09 Thread Acadaca

We are using Solr 1.4 in a multi-core setup with replication.

Whenever we write to the master we get the following exception:

java.lang.RuntimeException: after flush: fdx size mismatch: 1285 docs vs 0
length in bytes of _gqg.fdx file exists?=false
at
org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:97)
at
org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50)

Has anyone had any success debugging this one?

thx.
-- 
View this message in context: 
http://old.nabble.com/%22after-flush%3A-fdx-size-mismatch%22-on-query-durring-writes-tp27524755p27524755.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Bigram term vectors and weights possible with Solr?

2010-02-09 Thread Ahmet Arslan
> I've been looking at the Solr TermVectorComponent
> (http://wiki.apache.org/solr/TermVectorComponent) and it
> seems to have
> something similar to this, but it looks to me like this is
> a component
> that is processed at query time (?) and is limited to
> 1-gram terms.

If you use  it can give you info about 2-gram terms.

> Also, the tf/idf scores are a little different as they come
> back in integer values as separate components.

In wiki, example output only tf and df values - which are integer - are 
displayed. You can calculate tf*idf (double) with these parameters:

&qt=tvrh&tv=true&fl=yourFieldName&tv.tf=true&tv.df=true&tv.tf_idf=true


  


Bigram term vectors and weights possible with Solr?

2010-02-09 Thread Mike Hughes
Hello,

One of the commercial search platforms I work with has the concept of
'document vectors', which are 1-gram and 2-gram phrases and their
associated tf/idf weights on a 0-1 scale, i.e. ["banana pie", 0.99]
means banana pie is very relevant for this document.

During the ingest/indexing process you can configure the engine to
store the top N vectors (those with the highest weights) from a
document into a field that is indexed along with the original content
and is returned in a result set.  This is great for reporting and
other statistical analysis, and even some basic result clustering at
query time.

I've been looking at the Solr TermVectorComponent
(http://wiki.apache.org/solr/TermVectorComponent) and it seems to have
something similar to this, but it looks to me like this is a component
that is processed at query time (?) and is limited to 1-gram terms.
Also, the tf/idf scores are a little different as they come back in
integer values as separate components.

Does anyone know if Solr/Lucene has anything like what the commercial
platform has as I described above?

Thanks, appreciate any responses.

Michael Hughes
Lightcrest LLC


How to add SpellCheckResponse to Solritas?

2010-02-09 Thread Jan Høydahl / Cominvent
Hi,

I'm using the /itas requestHandler, and would like to add spell-check 
suggestions to the output.
I'm having spell-check configured and working in the XML response writer, but 
nothing is output in Velocity. Debugging the JSON $response object, I cannot 
find any representation of spellcheck response in there.

Where do I plug that in?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com



Re: Question on Tokenizing email address

2010-02-09 Thread Jan Høydahl / Cominvent
Hi,

To match 1, 2, 3, 4 below you could use a fieldtype based on TextField, with 
just a simple WordDelimiterFactory. However, this would also match abc-def, 
def.alpha, xyz-com and a...@def, because all punctuation is treated the same. 
To avoid this, you could do some custom handling of "-", "." and "@":



  





  


You will see that this splits "foo@apache.org" into "foo DOT bar AT apache 
DOT org" on both index and query side, and thus avoids false matches as above.

To support the "must match" case, you could use the "lowercase" fieldtype, 
which will give a case insensitive match for the whole content of the field 
only.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 9. feb. 2010, at 18.13, Abhishek Srivastava wrote:

> Hello Everyone,
> 
> I have a field in my solr schema which stores emails. The way I want the
> emails to be tokenized is like this.
> if the email address is abc@alpha-xyz.com
> User should be able to search on
> 
> 1. abc@alpha-xyz.com  (whole address)
> 2. abc
> 3. def
> 4. alpha-xyz
> 
> Which tokenizer should I use?
> 
> Also, is there a feature like "Must Match" in solr? in my schema there is
> field called "from" which contains the email address of the person who sent
> an email. For this field, I don't want any tokenization. When the user
> issues a search. The users email ID must exactly match the "for" column
> value for that document/record to be returned.
> How can I do this?
> 
> Regards,
> Abhishek



Solr/Drupal Integration - Query Question

2010-02-09 Thread jaybytez

I know this is not Drupal, but thought this question maybe more around the
Solr query.

For instance, I pulled down LucidImaginations Solr install, just like the
apache solr install and ran the example solr and loaded the documents from
the exampledocs.

I can go to:

http://localhost:8983/solr/admin/

And search for video and get responses

But on my solr if I go to the full interface and use the defaults, I get no
results back because of search fields, etc.

http://localhost:8983/solr/admin/form.jsp

So my admin Solr search query looks like this when searching "video":

Feb 9, 2010 1:25:49 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={explainOther=&fl=&indent=on&start=0&q=video&hl.fl=&qt=&wt=&fq=&version=2.2&rows=10}
hits=2 status=0 QTime=0

But if I go into Drupal and search "video", this is the query and no results
come back:

Feb 9, 2010 1:27:33 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={spellcheck=true&f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&facet=true&facet.limit=20&spellcheck.q=video&hl.simple.pre=&hl=&version=1.2&fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&bf=recip(rord(created),4,19,19)^200.0&f.changed.facet.date.gap=%2B1HOUR&hl.simple.post=&facet.field=uid&facet.field=type&facet.field=language&fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)&hl.fragsize=&facet.mincount=1&qf=name^3.0&facet.date=changed&hl.fl=&json.nl=map&wt=json&f.changed.facet.date.end=2010-02-09T17:44:16Z%2B1HOUR/HOUR&rows=10&hl.snippets=&start=0&facet.sort=true&q=video}
hits=0 status=0 QTime=0

Any thoughts on the search query that gets generated by the Drupal/Solr
module?

Thanks...jay
-- 
View this message in context: 
http://old.nabble.com/Solr-Drupal-Integration---Query-Question-tp27522362p27522362.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Michael McCandless
On Tue, Feb 9, 2010 at 2:56 PM, Tom Burton-West  wrote:

> I'm not sure I understand.  CheckIndex reported a negative number:
> -16777214.

Right, we are overflowing the positive ints, which wraps around to the
smallest int (-2.1 billion), and then dividing by 128 = ~ -1677214.

Lucene has an array of the indexed (every 128th) terms, keyed by int,
and it has an API to seek to any of those indexed terms.  The problem
is, in setting the position (a long) in the term enum, it multiplies
128 by the index term, but fails to do this as a long multiply, so it
overflows.

I think your index isn't actually corrupt... it's just a limitation in
Lucene that hopefully the patch will fix.

Mike


Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Michael McCandless
I attached a patch to the issue that may fix it.

Maybe start by running CheckIndex first?

Mike

On Tue, Feb 9, 2010 at 2:56 PM, Tom Burton-West  wrote:
>
> Thanks Michael,
>
> I'm not sure I understand.  CheckIndex reported a negative number:
> -16777214.
>
> But in any case we can certainly try running CheckIndex from a patched
> lucene   We could also run a patched lucene on our dev server.
>
> Tom
>
>
>
> Yes, the term count reported by CheckIndex is the total number of unique
> terms.
>
> It indeed looks like you are exceeding the unique term count limit --
> 16777214 * 128 (= the default term index interval) is 2147483392 which
> is mighty close to max/min 32 bit int value.  This makes sense,
> because CheckIndex steps through the terms in order, one by one.  So
> the first term just over the limit triggered the exception.
>
> Hmm -- can you try a patched Lucene in your area?  I have one small
> change to try that may increase the limit to termIndexInterval
> (default 128) * 2.1 billion.
>
> Mike
>
>
> --
> View this message in context: 
> http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p2752.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West

Thanks Michael,

I'm not sure I understand.  CheckIndex reported a negative number:
-16777214. 

But in any case we can certainly try running CheckIndex from a patched
lucene   We could also run a patched lucene on our dev server.   

Tom



Yes, the term count reported by CheckIndex is the total number of unique
terms.

It indeed looks like you are exceeding the unique term count limit --
16777214 * 128 (= the default term index interval) is 2147483392 which
is mighty close to max/min 32 bit int value.  This makes sense,
because CheckIndex steps through the terms in order, one by one.  So
the first term just over the limit triggered the exception.

Hmm -- can you try a patched Lucene in your area?  I have one small
change to try that may increase the limit to termIndexInterval
(default 128) * 2.1 billion.

Mike


-- 
View this message in context: 
http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p2752.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Michael McCandless
I opened a Lucene issue w/ patch to try:

   https://issues.apache.org/jira/browse/LUCENE-2257

Tom let me know if you're able to test this... thanks!

Mike

On Tue, Feb 9, 2010 at 2:09 PM, Michael McCandless
 wrote:
> Yes, the term count reported by CheckIndex is the total number of unique 
> terms.
>
> It indeed looks like you are exceeding the unique term count limit --
> 16777214 * 128 (= the default term index interval) is 2147483392 which
> is mighty close to max/min 32 bit int value.  This makes sense,
> because CheckIndex steps through the terms in order, one by one.  So
> the first term just over the limit triggered the exception.
>
> Hmm -- can you try a patched Lucene in your area?  I have one small
> change to try that may increase the limit to termIndexInterval
> (default 128) * 2.1 billion.
>
> Mike
>
> On Tue, Feb 9, 2010 at 12:23 PM, Tom Burton-West  
> wrote:
>>
>> Thanks Lance and Michael,
>>
>>
>> We are running Solr 1.3.0.2009.09.03.11.14.39  (Complete version info from
>> Solr admin panel appended below)
>>
>> I tried running CheckIndex (with the -ea:  switch ) on one of the shards.
>> CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
>> segment containing 500K+ documents. (Complete CheckIndex output appended
>> below)
>>
>> Is it likely that all 10 shards are corrupted?  Is it possible that we have
>> simply exceeded some lucene limit?
>>
>> I'm wondering if we could have exceeded the lucene limit of unique terms of
>> 2.1 billion as mentioned towards the end of the Lucene Index File Formats
>> document.  If the small 731 document index has nine million unique terms as
>> reported by check index, then even though many terms are repeated, it is
>> concievable that the 500,000 document index could have more than 2.1 billion
>> terms.
>>
>> Do you know if  the number of terms reported by CheckIndex is the number of
>> unique terms?
>>
>> On the other hand, we previously optimized a 1 million document index down
>> to 1 segment and had no problems.  That was with an earlier version of Solr
>> and did not include CommonGrams which could conceivably increase the number
>> of terms in the index by 2 or 3 times.
>>
>>
>> Tom
>> ---
>>
>>        Solr Specification Version: 1.3.0.2009.09.03.11.14.39
>>        Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 
>> 11:14:39
>>        Lucene Specification Version: 2.9-dev
>>        Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55
>>
>>
>> [tburt...@slurm-4 ~]$  java -Xmx4096m  -Xms4096m -cp
>> /l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib
>> -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
>> /l/solrs/1/.snapshot/serve-2010-02-07/data/index
>>
>> Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index
>>
>> Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene
>> 2.9]
>>  1 of 2: name=_29dn docCount=554799
>>    compound=false
>>    hasProx=true
>>    numFiles=9
>>    size (MB)=267,131.261
>>    diagnostics = {optimize=true, mergeFactor=2,
>> os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
>> lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
>> os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>    has deletions [delFileName=_29dn_7.del]
>>    test: open reader.OK [184 deleted docs]
>>    test: fields, norms...OK [6 fields]
>>    test: terms, freq, prox...FAILED
>>    WARNING: fixIndex() would remove reference to this segment; full
>> exception:
>> java.lang.ArrayIndexOutOfBoundsException: -16777214
>>        at
>> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
>>        at
>> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
>>        at
>> org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
>>        at
>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474)
>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715)
>>
>>  2 of 2: name=_29im docCount=731
>>    compound=false
>>    hasProx=true
>>    numFiles=8
>>    size (MB)=421.261
>>    diagnostics = {optimize=true, mergeFactor=3,
>> os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
>> lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
>> os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>    no deletions
>>    test: open reader.OK
>>    test: fields, norms...OK [6 fields]
>>    test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs;
>> 144869629 tokens]
>>    test: stored fields...OK [3550 total field count; avg 4.856 fields
>> per doc]
>>    test: term vectorsOK [0 total vector count; avg 0 term/freq
>> vector fields per doc]
>>
>> WARNING: 1 broken 

Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Michael McCandless
Yes, the term count reported by CheckIndex is the total number of unique terms.

It indeed looks like you are exceeding the unique term count limit --
16777214 * 128 (= the default term index interval) is 2147483392 which
is mighty close to max/min 32 bit int value.  This makes sense,
because CheckIndex steps through the terms in order, one by one.  So
the first term just over the limit triggered the exception.

Hmm -- can you try a patched Lucene in your area?  I have one small
change to try that may increase the limit to termIndexInterval
(default 128) * 2.1 billion.

Mike

On Tue, Feb 9, 2010 at 12:23 PM, Tom Burton-West  wrote:
>
> Thanks Lance and Michael,
>
>
> We are running Solr 1.3.0.2009.09.03.11.14.39  (Complete version info from
> Solr admin panel appended below)
>
> I tried running CheckIndex (with the -ea:  switch ) on one of the shards.
> CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
> segment containing 500K+ documents. (Complete CheckIndex output appended
> below)
>
> Is it likely that all 10 shards are corrupted?  Is it possible that we have
> simply exceeded some lucene limit?
>
> I'm wondering if we could have exceeded the lucene limit of unique terms of
> 2.1 billion as mentioned towards the end of the Lucene Index File Formats
> document.  If the small 731 document index has nine million unique terms as
> reported by check index, then even though many terms are repeated, it is
> concievable that the 500,000 document index could have more than 2.1 billion
> terms.
>
> Do you know if  the number of terms reported by CheckIndex is the number of
> unique terms?
>
> On the other hand, we previously optimized a 1 million document index down
> to 1 segment and had no problems.  That was with an earlier version of Solr
> and did not include CommonGrams which could conceivably increase the number
> of terms in the index by 2 or 3 times.
>
>
> Tom
> ---
>
>        Solr Specification Version: 1.3.0.2009.09.03.11.14.39
>        Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 
> 11:14:39
>        Lucene Specification Version: 2.9-dev
>        Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55
>
>
> [tburt...@slurm-4 ~]$  java -Xmx4096m  -Xms4096m -cp
> /l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib
> -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
> /l/solrs/1/.snapshot/serve-2010-02-07/data/index
>
> Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index
>
> Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene
> 2.9]
>  1 of 2: name=_29dn docCount=554799
>    compound=false
>    hasProx=true
>    numFiles=9
>    size (MB)=267,131.261
>    diagnostics = {optimize=true, mergeFactor=2,
> os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
> lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
> os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>    has deletions [delFileName=_29dn_7.del]
>    test: open reader.OK [184 deleted docs]
>    test: fields, norms...OK [6 fields]
>    test: terms, freq, prox...FAILED
>    WARNING: fixIndex() would remove reference to this segment; full
> exception:
> java.lang.ArrayIndexOutOfBoundsException: -16777214
>        at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
>        at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
>        at
> org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
>        at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474)
>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715)
>
>  2 of 2: name=_29im docCount=731
>    compound=false
>    hasProx=true
>    numFiles=8
>    size (MB)=421.261
>    diagnostics = {optimize=true, mergeFactor=3,
> os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
> lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
> os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>    no deletions
>    test: open reader.OK
>    test: fields, norms...OK [6 fields]
>    test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs;
> 144869629 tokens]
>    test: stored fields...OK [3550 total field count; avg 4.856 fields
> per doc]
>    test: term vectorsOK [0 total vector count; avg 0 term/freq
> vector fields per doc]
>
> WARNING: 1 broken segments (containing 554615 documents) detected
> WARNING: would write new segments file, and 554615 documents would be lost,
> if -fix were specified
>
>
> [tburt...@slurm-4 ~]$
>
>
> The index is corrupted. In some places ArrayIndex and NPE are not
> wrapped as CorruptIndexException.
>
> Try running your code with the Lucene assertions on. A

RE: HTTP caching and distributed search

2010-02-09 Thread Charlie Jackson
I tried your suggestion, Hoss, but committing to the new coordinator
core doesn't change the indexVersion and therefore the ETag value isn't
changed.

I opened a new JIRA issue for this
http://issues.apache.org/jira/browse/SOLR-1765


Thanks,
Charlie


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, February 04, 2010 2:16 PM
To: solr-user@lucene.apache.org
Subject: Re: HTTP caching and distributed search


: >
http://localhost:8080/solr/core1/select/?q=google&start=0&rows=10&shards
: > =localhost:8080/solr/core1,localhost:8080/solr/core2

: You are right, etag is calculated using the searcher on core1 only and
it
: does not take other shards into account. Can you open a Jira issue?

...as a possible work arround i would suggest creating a seperate 
"coordinator" core that is neither core1 nor core2 ... it doesn't have
to 
have any docs in it, it just has to have consistent schemas with the
other 
two cores.  That way you can use a distinct  settings on 
the coordinator core (perhaps never304="true" but with an explicit 
 setting? ... or lastModifiedFrom="openTime" and then you

could send an explicit "commit" to the (empty) coordinator core anytime 
you modify one of the shards.



-Hoss



Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West

Thanks Lance and Michael,


We are running Solr 1.3.0.2009.09.03.11.14.39  (Complete version info from
Solr admin panel appended below)

I tried running CheckIndex (with the -ea:  switch ) on one of the shards.
CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
segment containing 500K+ documents. (Complete CheckIndex output appended
below)

Is it likely that all 10 shards are corrupted?  Is it possible that we have
simply exceeded some lucene limit?

I'm wondering if we could have exceeded the lucene limit of unique terms of
2.1 billion as mentioned towards the end of the Lucene Index File Formats
document.  If the small 731 document index has nine million unique terms as
reported by check index, then even though many terms are repeated, it is
concievable that the 500,000 document index could have more than 2.1 billion
terms.

Do you know if  the number of terms reported by CheckIndex is the number of
unique terms?

On the other hand, we previously optimized a 1 million document index down
to 1 segment and had no problems.  That was with an earlier version of Solr
and did not include CommonGrams which could conceivably increase the number
of terms in the index by 2 or 3 times.


Tom
---

Solr Specification Version: 1.3.0.2009.09.03.11.14.39
Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 11:14:39
Lucene Specification Version: 2.9-dev
Lucene Implementation Version: 2.9-dev 779312 - 2009-05-27 17:19:55


[tburt...@slurm-4 ~]$  java -Xmx4096m  -Xms4096m -cp
/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib/lucene-core-2.9-dev.jar:/l/local/apache-tomcat-serve/webapps/solr-sdr-search/serve-10/WEB-INF/lib
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex
/l/solrs/1/.snapshot/serve-2010-02-07/data/index 

Opening index @ /l/solrs/1/.snapshot/serve-2010-02-07/data/index

Segments file=segments_zo numSegments=2 version=FORMAT_DIAGNOSTICS [Lucene
2.9]
  1 of 2: name=_29dn docCount=554799
compound=false
hasProx=true
numFiles=9
size (MB)=267,131.261
diagnostics = {optimize=true, mergeFactor=2,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_29dn_7.del]
test: open reader.OK [184 deleted docs]
test: fields, norms...OK [6 fields]
test: terms, freq, prox...FAILED
WARNING: fixIndex() would remove reference to this segment; full
exception:
java.lang.ArrayIndexOutOfBoundsException: -16777214
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
at
org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:57)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:474)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:715)

  2 of 2: name=_29im docCount=731
compound=false
hasProx=true
numFiles=8
size (MB)=421.261
diagnostics = {optimize=true, mergeFactor=3,
os.version=2.6.18-164.6.1.el5, os=Linux, mergeDocStores=true,
lucene.version=2.9-dev 779312 - 2009-05-27 17:19:55, source=merge,
os.arch=amd64, java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.OK
test: fields, norms...OK [6 fields]
test: terms, freq, prox...OK [9504552 terms; 34864047 terms/docs pairs;
144869629 tokens]
test: stored fields...OK [3550 total field count; avg 4.856 fields
per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
vector fields per doc]

WARNING: 1 broken segments (containing 554615 documents) detected
WARNING: would write new segments file, and 554615 documents would be lost,
if -fix were specified


[tburt...@slurm-4 ~]$ 


The index is corrupted. In some places ArrayIndex and NPE are not
wrapped as CorruptIndexException.

Try running your code with the Lucene assertions on. Add this to the
JVM arguments:  -ea:org.apache.lucene...


-- 
View this message in context: 
http://old.nabble.com/TermInfosReader.get-ArrayIndexOutOfBoundsException-tp27506243p27518800.html
Sent from the Solr - User mailing list archive at Nabble.com.



Question on Tokenizing email address

2010-02-09 Thread Abhishek Srivastava
Hello Everyone,

I have a field in my solr schema which stores emails. The way I want the
emails to be tokenized is like this.
if the email address is abc@alpha-xyz.com
User should be able to search on

1. abc@alpha-xyz.com  (whole address)
2. abc
3. def
4. alpha-xyz

Which tokenizer should I use?

Also, is there a feature like "Must Match" in solr? in my schema there is
field called "from" which contains the email address of the person who sent
an email. For this field, I don't want any tokenization. When the user
issues a search. The users email ID must exactly match the "for" column
value for that document/record to be returned.
How can I do this?

Regards,
Abhishek


Re: unloading a solr core doesn't free any memory

2010-02-09 Thread Jason Rutherglen
Tim,

The GC just automagically works right?

:)

There's been issues around thread local in Lucene.  The main code for
core management is CoreContainer, which I believe is fairly easy to
digest.  If there's an issue you may find it there.

Jason

2010/2/9 Tim Terlegård :
> If I unload the core and then click "Perform GC" in jconsole nothing
> happens. The 8 GB RAM is still used.
>
> If I load the core again and then run the query with the sort fields,
> then jconsole shows that the memory usage immediately drops to 1 GB
> and then rises to 8 GB again as it caches the stuff.
>
> So my suspicion is that the sort cache still references all these
> objects even after the core is unloaded. But somehow it knows that the
> current sort cache is obsolete. After loading the core again and
> executing the query with sort fields the sort cache references a new
> object and the memory usage drops.
>
> Bug? I could check the source code, but don't know where to look. Any hints?
>
> /Tim
>
> 2010/2/9 Lance Norskog :
>> The 'jconsole' program lets you monitor GC operation in real-time.
>>
>> http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
>>
>> On Mon, Feb 8, 2010 at 8:44 AM, Simon Rosenthal
>>  wrote:
>>> What Garbage Collection parameters is the JVM using ?   the memory will not
>>> always be freed immediately after an event like unloading a core or starting
>>> a new searcher.
>>>
>>> 2010/2/8 Tim Terlegård 
>>>
 To me it doesn't look like unloading a Solr Core frees the memory that
 the core has used. Is this how it should be?

 I have a big index with 50 million documents. After loading a core it
 takes 300 MB RAM. After a query with a couple of sort fields Solr
 takes about 8 GB RAM. Then I unload (CoreAdminRequest.unloadCore) the
 core. The core is not shown in /solr/ anymore. Solr still takes 8 GB
 RAM. Creating new cores is super slow because I have hardly any memory
 left. Do I need to free the memory explicitly somehow?

 /Tim

>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>


Re: Replication and querying

2010-02-09 Thread Jan Høydahl / Cominvent
Hi,

Index replication in Solr makes an exact copy of the original index.
Is it not possible to add the 6 extra fields to both instances?
An alternative to replication is to feed two independent Solr instances -> full 
control :)
Please elaborate on your specific use case if this is not useful answer to you.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 9. feb. 2010, at 13.21, Julian Hille wrote:

> Hi,
> 
> id like to know if its possible to have a solr Server with a schema and lets 
> say 10 fields indexed.
> I know want to replicate this whole index to another solr server which has a 
> slightly different schema.
> There are additional 6 fields these fields change the sort order for a 
> product which base is our solr database.
> 
> Is this kind of replication possible?
> 
> Is there another way to interact with data in solr? We'd like to calculate 
> some fields when they will be added.
> I cant seem to find a good documentation about the possible calls in the 
> query itself nor documentaion about queries/calculation  which should be done 
> on update.
> 
> 
> so far,
> Julian Hille
> 
> 
> ---
> NetImpact KG
> Altonaer Straße 8
> 20357 Hamburg
> 
> Tel: 040 / 6738363 2
> Mail: jul...@netimpact.de
> 
> Geschäftsführer: Tarek Müller
> 



Re: joining two field for query

2010-02-09 Thread Jan Høydahl / Cominvent
You may also want to play with other highlighting parameters to select how much 
text to do highlighting on, how many fragments etc. See 
http://wiki.apache.org/solr/HighlightingParameters

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 9. feb. 2010, at 13.08, Ahmet Arslan wrote:

> 
>> I am searching by "nokia" and resulting (listing) 1,2,3
>> field with short
>> description.
>> There is link on search list(like google), by clicking on
>> link performing
>> new search (opening doc from index), for this search
>> 
>> I want to join two fields:
>> id:1 + queryString ("nokia samsung") to return only id:1
>> record and want to
>> highlight the field "nokia samsung".
>> something like : "q=id:1 + body:nokia samsung"
>> 
>> basically I want to highlight the query string when
>> clicking on link and
>> opening the new windows (like google cache).
> 
> When the user clicks document (id=1), you can use these parameters:
> q=body:(nokia samsung)&fq=id:1&hl=true&hl.fl=body
> 
> 
> 
> 



Re: Is it posible to exclude results from other languages?

2010-02-09 Thread Jan Høydahl / Cominvent
Much more efficient to tag documents with language at index time. Look for 
language identification tools such as 
http://www.sematext.com/products/language-identifier/index.html or 
http://ngramj.sourceforge.net/ or 
http://lucene.apache.org/nutch/apidocs-1.0/org/apache/nutch/analysis/lang/LanguageIdentifier.html

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 9. feb. 2010, at 05.19, Lance Norskog wrote:

> There is
> 
> On Thu, Feb 4, 2010 at 10:07 AM, Raimon Bosch  wrote:
>> 
>> 
>> Yes, It's true that we could do it in index time if we had a way to know. I
>> was thinking in some solution in search time, maybe measuring the % of
>> stopwords of each document. Normally, a document of another language won't
>> have any stopword of its main language.
>> 
>> If you know some external software to detect the language of a source text,
>> it would be useful too.
>> 
>> Thanks,
>> Raimon Bosch.
>> 
>> 
>> 
>> Ahmet Arslan wrote:
>>> 
>>> 
 In our indexes, sometimes we have some documents written in
 other languages
 different to the most common index's language. Is there any
 way to give less
 boosting to this documents?
>>> 
>>> If you are aware of those documents, at index time you can boost those
>>> documents with a value less than 1.0:
>>> 
>>> 
>>>   
>>> // document written in other languages
>>> ...
>>> ...
>>>   
>>> 
>>> 
>>> http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_on_.22doc.22
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> View this message in context: 
>> http://old.nabble.com/Is-it-posible-to-exclude-results-from-other-languages--tp27455759p27457165.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com



Re: DIH: delta-import not working

2010-02-09 Thread Jorg Heymans
indeed that made it work. Looking back at the documentation, it's all there
but one needs to read every single line with care :-)

2010/2/9 Noble Paul നോബിള്‍ नोब्ळ् 

> try this
>
> deltaImportQuery="select id, bytes from attachment where application =
>  'MYAPP' and id = '${dataimporter.delta.id}'"
>
> be aware that the names are case sensitive . if the id comes as 'ID'
> this will not work
>
>
>
> On Tue, Feb 9, 2010 at 3:15 PM, Jorg Heymans 
> wrote:
> > Hi,
> >
> > I am having problems getting the delta-import to work for my schema.
> > Following what i have found in the list, jira and the wiki below
> > configuration should just work but it doesn't.
> >
> > 
> >   > url="jdbc:oracle:thin:@." user="" password=""/>
> >  
> >  
> > >  deltaImportQuery="select id, bytes from attachment where application
> =
> > 'MYAPP' and id = '${dataimporter.attachment.id}'"
> >  deltaQuery="select id from attachment where application = 'MYAPP'
> and
> > modified_on > to_date('${dataimporter.attachment.last_index_time}',
> > '-mm-dd hh24:mi:ss')">
> >  
> >   > url="bytes" dataField="attachment.bytes">
> >
> >  
> >
> >  
> > 
> >
> > The sql generated in the deltaquery is correct, the timestamp is passed
> > correctly. When i execute that query manually in the DB it returns the pk
> of
> > the rows that were added. However no documents are added to the index.
> What
> > am i missing here ?? I'm using a build snapshot from 03/02.
> >
> >
> > Thanks
> > Jorg
> >
>
>
>
> --
> -
> Noble Paul | Systems Architect| AOL | http://aol.com
>


Re: Faceting

2010-02-09 Thread Jan Høydahl / Cominvent
NOTE: Please start a new email thread for a new topic (See 
http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)

Your strategy could work. You might want to look into dedicated entity 
extraction frameworks like
http://opennlp.sourceforge.net/
http://nlp.stanford.edu/software/CRF-NER.shtml
http://incubator.apache.org/uima/index.html

Or if that is too much work, look at 
http://issues.apache.org/jira/browse/SOLR-1725 for a way to plug in your entity 
extraction code into Solr itself using a scripting language.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 5. feb. 2010, at 20.10, José Moreira wrote:

> Hello,
> 
> I'm planning to index a 'content' field for search and from that
> fields text content i would like to facet (probably) according to if
> the content has e-mails, urls and within urls, url's to pictures,
> videos and others.
> 
> As i'm a relatively new user to Solr, my plan was to regexp the
> content in my application and add tags to a Solr field according to
> the content, so for example the content "m...@email.com
> http://www.site.com"; would have the tags "email, link".
> 
> If i follow this path can i then facet on "email" and/or "link" ? For
> example combining facet field with facet value params?
> 
> Best
> 
> -- 
> http://pt.linkedin.com/in/josemoreira
> josemore...@irc.freenode.net
> http://djangopeople.net/josemoreira/



Re: Autosuggest and highlighting

2010-02-09 Thread gwk

On 2/9/2010 2:57 PM, Ahmet Arslan wrote:

I'm trying to improve the search box on our website by
adding an autosuggest field. The dataset is a set of
properties in the world (mostly europe) and the searchbox is
intended to be filled with a country-, region- or city name.
To do this I've created a separate, simple core with one
document per geographic location, for example the document
for the country "France" contains several fields including
the number of properties (so we can show the approximate
amount of results in the autosuggest box) and the name of
the country France in several languages and some other
bookkeeping information. The name of the property is stored
in two fields: "name" which simple contains the canonical
name of the country, region or city and "names" which is a
multivalued field containing the name in several different
languages. Both fields use an EdgeNGramFilter during
analysis so the query "Fr" can match "France".

This all seems to work, the autosuggest box gives
appropriate suggestions. But when I turn on highlighting the
results are less than desirable, for example the query "rho"
using dismax  (and hl.snippets=5) returns the
following:



Région
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes


Région
Rhône-Alpes




Département du
Rhône
Département du
Rhône
Rhône
Département du
Rhône
Rhône


Département du
Rhône



As you can see, no matter where the match is, the first 3
characters are highlighted. Obviously not correct for many
of the fields. Is this because of the NGramFilterFactory or
am I doing something wrong?


I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime ago. It 
was giving correct highlights.

I just ran a test with the NGramFilter removed (and reindexing) which 
did give correct highlighting results but I had to query using the whole 
word. I'll try the PrefixingFilterFactory next although according to the 
comments it's nothing but a subset of the EdgeNGramFilterFactory so 
unless I'm configuring it wrong it should yield the same results...



However we are now using 
http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It automatically 
makes bold matching characters without using solr highlighting.

Using a pure javascript based solution isn't really an option for us as 
that wouldn't work for the diacritical marks without a lot of 
transliteration brouhaha.


Regards,

gwk


Re: How to send web pages(urls) to solr cell via solrj?

2010-02-09 Thread Jan Høydahl / Cominvent
Hi,

I did not try this, but could you not read the URL client side and pass it to 
SolrJ as a ContentStream?

ContentStream urlStream = 
ContentStreamBase.URLStream("http://my.site/file.html";);
req.addContentStream(urlStream);

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 4. feb. 2010, at 10.47, dhamu wrote:

> 
> Hi,
> I am newbie to solr and exploring solr last few days.
> I am using solr cell with tika for parsing, indexing and searching
> Posting the rich text documents via Solrj.
> My actual requirement is instead of using local documents(pdf, doc & docx),
> i want to use webpages(urls for eg..,(http://www.apache.org)). 
> 
> eg..,
> req.addFile(new File("docs/mailing_lists.html"));
> instead
> req.url(new urlconnection("http://www.apache.org";)
> anything like the above is there in solrj.
> 
> Actually i am using curl for testing. it works fine
> 
> curl
> "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
> -F "stream.url=http://wiki.apache.org/solr/SolrConfigXml"; 
> 
> but i am in need to use otherthan curl.
> Below code works fine for local document indexing and searching. But instead
> i want to post urls.
> 
> here is my code.,
> 
>String url = "http://localhost:8983/solr";;
>SolrServer server = new CommonsHttpSolrServer(url);
>   ContentStreamUpdateRequest req = new ContentStreamUpdateRequest(
>   "/update/extract");
>   req.addFile(new File("docs/mailing_lists.html"));
>   req.setParam("literal.id", "index1");
>   req.setParam("uprefix", "attr_");
>   req.setParam("fmap.content", "attr_content");
>   req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>   NamedList result = server.request(req);
>   assertNotNull("Couldn't upload index.pdf", result);
>   QueryResponse rsp = server.query(new SolrQuery("*:*"));
>   Assert.assertEquals(1, rsp.getResults().getNumFound());
> 
> any suggestion or answer will be appreciated.
> 
> 
> -- 
> View this message in context: 
> http://old.nabble.com/How-to-send-web-pages%28urls%29-to-solr-cell-via-solrj--tp27450083p27450083.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: Autosuggest and highlighting

2010-02-09 Thread Ahmet Arslan
> I'm trying to improve the search box on our website by
> adding an autosuggest field. The dataset is a set of
> properties in the world (mostly europe) and the searchbox is
> intended to be filled with a country-, region- or city name.
> To do this I've created a separate, simple core with one
> document per geographic location, for example the document
> for the country "France" contains several fields including
> the number of properties (so we can show the approximate
> amount of results in the autosuggest box) and the name of
> the country France in several languages and some other
> bookkeeping information. The name of the property is stored
> in two fields: "name" which simple contains the canonical
> name of the country, region or city and "names" which is a
> multivalued field containing the name in several different
> languages. Both fields use an EdgeNGramFilter during
> analysis so the query "Fr" can match "France".
> 
> This all seems to work, the autosuggest box gives
> appropriate suggestions. But when I turn on highlighting the
> results are less than desirable, for example the query "rho"
> using dismax  (and hl.snippets=5) returns the
> following:
> 
> 
> 
> Région
> Rhône-Alpes
> Rhône-Alpes
> Rhône-Alpes
> Rhône-Alpes
> Rhône-Alpes
> 
> 
> Région
> Rhône-Alpes
> 
> 
> 
> 
> Département du
> Rhône
> Département du
> Rhône
> Rhône
> Département du
> Rhône
> Rhône
> 
> 
> Département du
> Rhône
> 
> 
> 
> As you can see, no matter where the match is, the first 3
> characters are highlighted. Obviously not correct for many
> of the fields. Is this because of the NGramFilterFactory or
> am I doing something wrong?

I used https://issues.apache.org/jira/browse/SOLR-357 for this sometime ago. It 
was giving correct highlights. 

However we are now using 
http://www.ajaxupdates.com/mootools-autocomplete-ajax-script/ It automatically 
makes bold matching characters without using solr highlighting.






Distributed search and haproxy and connection build up

2010-02-09 Thread Ian Connor
I have been using distributed search with haproxy but noticed that I am
suffering a little from tcp connections building up waiting for the OS level
closing/time out:

netstat -a
...
tcp6   1  0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
CLOSE_WAIT
tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
TIME_WAIT
tcp6   1  0 10.0.16.170%34654:41782 10.0.16.181%363574:
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
CLOSE_WAIT
tcp6   1  0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
CLOSE_WAIT
tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
TIME_WAIT
tcp6   1  0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
CLOSE_WAIT
...

Digging a little into the haproxy documentation, it seems that they do not
support persistent connections.

Does solr normally persist the connections between shards (would this
problem happen even without haproxy)?

Ian.


Re: Solr usage with Auctions/Classifieds?

2010-02-09 Thread Jan Høydahl / Cominvent
With the new sort by function in 1.5 
(http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function), will it now be 
possible to include the ExternalFileField value in the sort formula? If so, we 
could sort on last bid price or last bid time without updating the document 
itself.

However, to display the result with the fresh values, we need to go to DB, or 
is there someone working on the possibility to return ExternalFileField values 
for result view?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 4. feb. 2010, at 06.25, Lance Norskog wrote:

> Oops, forgot to add the link:
> 
> http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4
> 
> On Wed, Feb 3, 2010 at 9:17 PM, Andy  wrote:
>> How do I set up and use this external file?
>> 
>> Can I still use such a field in fq or boost?
>> 
>> Can you point me to the right documentation? Thanks
>> 
>> --- On Wed, 2/3/10, Lance Norskog  wrote:
>> 
>> From: Lance Norskog 
>> Subject: Re: Solr usage with Auctions/Classifieds?
>> To: solr-user@lucene.apache.org
>> Date: Wednesday, February 3, 2010, 10:03 PM
>> 
>> This field type allows you to have an external file that gives a float
>> value for a field. You can only use functions on it.
>> 
>> On Sat, Jan 30, 2010 at 7:05 AM, Jan Høydahl / Cominvent
>>  wrote:
>>> A follow-up on the auction use case.
>>> 
>>> How do you handle the need for frequent updates of only one field, such as 
>>> the last bid field (needed for sort on price, facets or range)?
>>> For high traffic sites, the document update rate becomes very high if you 
>>> re-send the whole document every time the bid price changes.
>>> 
>>> --
>>> Jan Høydahl  - search architect
>>> Cominvent AS - www.cominvent.com
>>> 
>>> On 10. des. 2009, at 19.52, Grant Ingersoll wrote:
>>> 
 
 On Dec 8, 2009, at 6:37 PM, regany wrote:
 
> 
> hello!
> 
> just wondering if anyone is using Solr as their search for an auction /
> classified site, and if so how have you managed your setup in general? ie.
> searching against listings that may have expired etc.
 
 
 I know several companies using Solr for classifieds/auctions.  Some remove 
 the old listings while others leave them in and filter them or even allow 
 users to see old stuff (but often for reasons other than users finding 
 them, i.e. SEO).  For those that remove, it's typically a batch operation 
 that takes place at night.
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com/
 
 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
 Solr/Lucene:
 http://www.lucidimagination.com/search
 
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Lance Norskog
>> goks...@gmail.com
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com



Autosuggest and highlighting

2010-02-09 Thread gwk

Hi,

I'm trying to improve the search box on our website by adding an 
autosuggest field. The dataset is a set of properties in the world 
(mostly europe) and the searchbox is intended to be filled with a 
country-, region- or city name. To do this I've created a separate, 
simple core with one document per geographic location, for example the 
document for the country "France" contains several fields including the 
number of properties (so we can show the approximate amount of results 
in the autosuggest box) and the name of the country France in several 
languages and some other bookkeeping information. The name of the 
property is stored in two fields: "name" which simple contains the 
canonical name of the country, region or city and "names" which is a 
multivalued field containing the name in several different languages. 
Both fields use an EdgeNGramFilter during analysis so the query "Fr" can 
match "France".


This all seems to work, the autosuggest box gives appropriate 
suggestions. But when I turn on highlighting the results are less than 
desirable, for example the query "rho" using dismax  (and hl.snippets=5) 
returns the following:




Région Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes
Rhône-Alpes


Région Rhône-Alpes




Département du Rhône
Département du Rhône
Rhône
Département du Rhône
Rhône


Département du Rhône



As you can see, no matter where the match is, the first 3 characters are 
highlighted. Obviously not correct for many of the fields. Is this 
because of the NGramFilterFactory or am I doing something wrong?


The field definition for 'name' and 'names' is:




generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="0" splitOnCaseChange="1"/>



maxGramSize="20"/>




ignoreCase="true" expand="true"/>
words="stopwords.txt"/>
generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
catenateAll="0" splitOnCaseChange="1"/>









Regards,

gwk


Re: joining two field for query (Solved)

2010-02-09 Thread Ranveer Kumar
Hi Ahmet,

Thank you very much..
my problem solved..

with regards


On Tue, Feb 9, 2010 at 5:38 PM, Ahmet Arslan  wrote:

>
> > I am searching by "nokia" and resulting (listing) 1,2,3
> > field with short
> > description.
> > There is link on search list(like google), by clicking on
> > link performing
> > new search (opening doc from index), for this search
> >
> > I want to join two fields:
> > id:1 + queryString ("nokia samsung") to return only id:1
> > record and want to
> > highlight the field "nokia samsung".
> > something like : "q=id:1 + body:nokia samsung"
> >
> > basically I want to highlight the query string when
> > clicking on link and
> > opening the new windows (like google cache).
>
> When the user clicks document (id=1), you can use these parameters:
> q=body:(nokia samsung)&fq=id:1&hl=true&hl.fl=body
>
>
>
>
>


Replication and querying

2010-02-09 Thread Julian Hille
Hi,

id like to know if its possible to have a solr Server with a schema and lets 
say 10 fields indexed.
I know want to replicate this whole index to another solr server which has a 
slightly different schema.
There are additional 6 fields these fields change the sort order for a product 
which base is our solr database.

Is this kind of replication possible?

Is there another way to interact with data in solr? We'd like to calculate some 
fields when they will be added.
I cant seem to find a good documentation about the possible calls in the query 
itself nor documentaion about queries/calculation  which should be done on 
update.


so far,
Julian Hille


---
NetImpact KG
Altonaer Straße 8
20357 Hamburg

Tel: 040 / 6738363 2
Mail: jul...@netimpact.de

Geschäftsführer: Tarek Müller



Re: joining two field for query

2010-02-09 Thread Ahmet Arslan
 
> I am searching by "nokia" and resulting (listing) 1,2,3
> field with short
> description.
> There is link on search list(like google), by clicking on
> link performing
> new search (opening doc from index), for this search
> 
> I want to join two fields:
> id:1 + queryString ("nokia samsung") to return only id:1
> record and want to
> highlight the field "nokia samsung".
> something like : "q=id:1 + body:nokia samsung"
> 
> basically I want to highlight the query string when
> clicking on link and
> opening the new windows (like google cache).

When the user clicks document (id=1), you can use these parameters:
q=body:(nokia samsung)&fq=id:1&hl=true&hl.fl=body



  


Re: Call URL, simply parse the results using SolrJ

2010-02-09 Thread Noble Paul നോബിള്‍ नोब्ळ्
you can also try

URL urlo = new URL(url);// ensure that the url has wt=javabin in that
NamedList namedList = new
JavaBinCodec().unmarshal(urlo.openConnection().getInputStream());
QueryResponse response = new QueryResponse(namedList, null);


On Mon, Feb 8, 2010 at 11:49 PM, Jason Rutherglen
 wrote:
> Here's what I did to resolve this:
>
> XMLResponseParser parser = new XMLResponseParser();
> URL urlo = new URL(url);
> InputStreamReader isr = new
> InputStreamReader(urlo.openConnection().getInputStream());
> NamedList namedList = parser.processResponse(isr);
> QueryResponse response = new QueryResponse(namedList, null);
>
> On Mon, Feb 8, 2010 at 10:03 AM, Jason Rutherglen
>  wrote:
>> So here's what happens if I pass in a URL with parameters, SolrJ chokes:
>>
>> Exception in thread "main" java.lang.RuntimeException: Invalid base
>> url for solrj.  The base URL must not contain parameters:
>> http://locahost:8080/solr/main/select?q=video&qt=dismax
>>        at 
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:205)
>>        at 
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:180)
>>        at 
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:152)
>>        at org.apache.solr.util.QueryTime.main(QueryTime.java:20)
>>
>>
>> On Mon, Feb 8, 2010 at 9:32 AM, Jason Rutherglen
>>  wrote:
>>> Sorry for the poorly worded title... For SOLR-1761 I want to pass in a
>>> URL and parse the query response... However it's non-obvious to me how
>>> to do this using the SolrJ API, hence asking the experts here. :)
>>>
>>
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


Re: DIH: delta-import not working

2010-02-09 Thread Noble Paul നോബിള്‍ नोब्ळ्
try this

deltaImportQuery="select id, bytes from attachment where application =
 'MYAPP' and id = '${dataimporter.delta.id}'"

be aware that the names are case sensitive . if the id comes as 'ID'
this will not work



On Tue, Feb 9, 2010 at 3:15 PM, Jorg Heymans  wrote:
> Hi,
>
> I am having problems getting the delta-import to work for my schema.
> Following what i have found in the list, jira and the wiki below
> configuration should just work but it doesn't.
>
> 
>   url="jdbc:oracle:thin:@." user="" password=""/>
>  
>  
>          deltaImportQuery="select id, bytes from attachment where application =
> 'MYAPP' and id = '${dataimporter.attachment.id}'"
>      deltaQuery="select id from attachment where application = 'MYAPP' and
> modified_on > to_date('${dataimporter.attachment.last_index_time}',
> '-mm-dd hh24:mi:ss')">
>      
>       url="bytes" dataField="attachment.bytes">
>        
>      
>    
>  
> 
>
> The sql generated in the deltaquery is correct, the timestamp is passed
> correctly. When i execute that query manually in the DB it returns the pk of
> the rows that were added. However no documents are added to the index. What
> am i missing here ?? I'm using a build snapshot from 03/02.
>
>
> Thanks
> Jorg
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com


joining two field for query

2010-02-09 Thread Ranveer Kumar
Hi all,

I need logic in solr to join two field in query;
I indexed two field : id and body(text type).

5 rows are indexed:
id=1 : text= nokia samsung
id=2 : text= sony vaio nokia samsung
id=3 : text= vaio nokia
etc..

I am searching by "q=id:1" returning result perfectly, returning "nokia
samsung".

I am searching by "nokia" and resulting (listing) 1,2,3 field with short
description.
There is link on search list(like google), by clicking on link performing
new search (opening doc from index), for this search

I want to join two fields:
id:1 + queryString ("nokia samsung") to return only id:1 record and want to
highlight the field "nokia samsung".
something like : "q=id:1 + body:nokia samsung"

basically I want to highlight the query string when clicking on link and
opening the new windows (like google cache).

please help..
thanks


Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Michael McCandless
Which version of Solr/Lucene are you using?

Can you run Lucene's CheckIndex tool (java -ea:org.apache.lucene
org.apache.lucene.index.CheckIndex /path/to/index) and then post the
output?

Have you altered any of IndexWriter's defaults (via solrconfig.xml)?
Eg the termIndexInterval?

Mike

On Mon, Feb 8, 2010 at 4:02 PM, Burton-West, Tom  wrote:
> Hello all,
>
> After optimizing rather large indexes on 10 shards (each index holds about 
> 500,000 documents and is  about 270-300 GB in size) we started getting  
> intermittent TermInfosReader.get()  ArrayIndexOutOfBounds exceptions.  The 
> exceptions sometimes seem to occur on all 10 shards at the same time and 
> sometimes on one shard but not the others.   We also sometimes get an 
> "Internal Server Error" but that might be either a cause or an effect of the 
> array index out of bounds.  Here is the top part of the message:
>
>
> java.lang.ArrayIndexOutOfBoundsException: -14127432
>        at 
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
>
> Any suggestions for troubleshooting would be appreciated.
>
> Trace from tomcat logs appended below.
>
> Tom Burton-West
>
> ---
>
> Feb 5, 2010 8:09:02 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.ArrayIndexOutOfBoundsException: -14127432
>        at 
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246)
>        at 
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:218)
>        at 
> org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:943)
>        at 
> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:308)
>        at 
> org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:144)
>        at org.apache.lucene.search.Similarity.idf(Similarity.java:481)
>        at 
> org.apache.lucene.search.TermQuery$TermWeight.(TermQuery.java:44)
>        at org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:146)
>        at 
> org.apache.lucene.search.BooleanQuery$BooleanWeight.(BooleanQuery.java:186)
>        at 
> org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:366)
>        at org.apache.lucene.search.Query.weight(Query.java:95)
>        at org.apache.lucene.search.Searcher.createWeight(Searcher.java:230)
>        at org.apache.lucene.search.Searcher.search(Searcher.java:171)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:651)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:545)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:581)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:903)
>        at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
>        at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
>        at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:176)
>        at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
>        at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>        at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
>        at 
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:548)
>        at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>        at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
>        at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
>        at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
>        at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
>        at 
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
>        at 
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
>        at 
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
>        at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
>   

Re: unloading a solr core doesn't free any memory

2010-02-09 Thread Tim Terlegård
If I unload the core and then click "Perform GC" in jconsole nothing
happens. The 8 GB RAM is still used.

If I load the core again and then run the query with the sort fields,
then jconsole shows that the memory usage immediately drops to 1 GB
and then rises to 8 GB again as it caches the stuff.

So my suspicion is that the sort cache still references all these
objects even after the core is unloaded. But somehow it knows that the
current sort cache is obsolete. After loading the core again and
executing the query with sort fields the sort cache references a new
object and the memory usage drops.

Bug? I could check the source code, but don't know where to look. Any hints?

/Tim

2010/2/9 Lance Norskog :
> The 'jconsole' program lets you monitor GC operation in real-time.
>
> http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
>
> On Mon, Feb 8, 2010 at 8:44 AM, Simon Rosenthal
>  wrote:
>> What Garbage Collection parameters is the JVM using ?   the memory will not
>> always be freed immediately after an event like unloading a core or starting
>> a new searcher.
>>
>> 2010/2/8 Tim Terlegård 
>>
>>> To me it doesn't look like unloading a Solr Core frees the memory that
>>> the core has used. Is this how it should be?
>>>
>>> I have a big index with 50 million documents. After loading a core it
>>> takes 300 MB RAM. After a query with a couple of sort fields Solr
>>> takes about 8 GB RAM. Then I unload (CoreAdminRequest.unloadCore) the
>>> core. The core is not shown in /solr/ anymore. Solr still takes 8 GB
>>> RAM. Creating new cores is super slow because I have hardly any memory
>>> left. Do I need to free the memory explicitly somehow?
>>>
>>> /Tim
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Posting pdf file and posting from remote

2010-02-09 Thread alendo

Ok I'm going ahead (may be:).
I tried another curl command to send the file from remote:

http://mysolr:/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf
 

and the behaviour has been changed: now I get an error in solr log file:

HTTP Status 500 - files/attach-8514.pdf (No such file or directory)
java.io.FileNotFoundException: files/attach-8514.pdf (No such file or
directory) at java.io.FileInputStream.open(Native Method) at
java.io.FileInputStream.(FileInputStream.java:106) at
org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at 

etc etc...

-- 
View this message in context: 
http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: unloading a solr core doesn't free any memory

2010-02-09 Thread Tim Terlegård
I don't use any garbage collection parameters.

/Tim

2010/2/8 Simon Rosenthal :
> What Garbage Collection parameters is the JVM using ?   the memory will not
> always be freed immediately after an event like unloading a core or starting
> a new searcher.
>
> 2010/2/8 Tim Terlegård 
>
>> To me it doesn't look like unloading a Solr Core frees the memory that
>> the core has used. Is this how it should be?
>>
>> I have a big index with 50 million documents. After loading a core it
>> takes 300 MB RAM. After a query with a couple of sort fields Solr
>> takes about 8 GB RAM. Then I unload (CoreAdminRequest.unloadCore) the
>> core. The core is not shown in /solr/ anymore. Solr still takes 8 GB
>> RAM. Creating new cores is super slow because I have hardly any memory
>> left. Do I need to free the memory explicitly somehow?
>>
>> /Tim
>>
>


DIH: delta-import not working

2010-02-09 Thread Jorg Heymans
Hi,

I am having problems getting the delta-import to work for my schema.
Following what i have found in the list, jira and the wiki below
configuration should just work but it doesn't.


  
  
  

  
  

  

  


The sql generated in the deltaquery is correct, the timestamp is passed
correctly. When i execute that query manually in the DB it returns the pk of
the rows that were added. However no documents are added to the index. What
am i missing here ?? I'm using a build snapshot from 03/02.


Thanks
Jorg


Re: Dynamic fields with more than 100 fields inside

2010-02-09 Thread Xavier Schepler

Shalin Shekhar Mangar a écrit :

On Tue, Feb 9, 2010 at 2:43 PM, Xavier Schepler <
xavier.schep...@sciences-po.fr> wrote:

  

Shalin Shekhar Mangar a écrit :

 On Mon, Feb 8, 2010 at 9:47 PM, Xavier Schepler <


xavier.schep...@sciences-po.fr> wrote:



  

Hey,

I'm thinking about using dynamic fields.

I need one or more user specific field in my schema, for example,
"concept_user_*", and I will have maybe more than 200 users using this
feature.
One user will send and retrieve values from its field. It will then be
used
to filter result.

How would it impact query performance ?






Can you give an example of such a query?



  

Hi,

it could be queries such as :

allFr: état-unis AND concept_researcher_99 = 303

modalitiesFr: exactement AND questionFr: correspond AND
concept_researcher_2 = 101

and facetting like this :


q=%2A%3A%2A&fl=variableXMLFr,lang&start=0&rows=10&facet=true&facet.field=concept_researcher_2&facet.field=studyDateAndStudyTitle&facet.sort=lex




It doesn't impact query performance any more than filtering on other fields.
Is there a performance problem or were you just asking generally?

  

I was asking generally, thanks for your response.




Re: Dynamic fields with more than 100 fields inside

2010-02-09 Thread Shalin Shekhar Mangar
On Tue, Feb 9, 2010 at 2:43 PM, Xavier Schepler <
xavier.schep...@sciences-po.fr> wrote:

> Shalin Shekhar Mangar a écrit :
>
>  On Mon, Feb 8, 2010 at 9:47 PM, Xavier Schepler <
>> xavier.schep...@sciences-po.fr> wrote:
>>
>>
>>
>>> Hey,
>>>
>>> I'm thinking about using dynamic fields.
>>>
>>> I need one or more user specific field in my schema, for example,
>>> "concept_user_*", and I will have maybe more than 200 users using this
>>> feature.
>>> One user will send and retrieve values from its field. It will then be
>>> used
>>> to filter result.
>>>
>>> How would it impact query performance ?
>>>
>>>
>>>
>>>
>> Can you give an example of such a query?
>>
>>
>>
> Hi,
>
> it could be queries such as :
>
> allFr: état-unis AND concept_researcher_99 = 303
>
> modalitiesFr: exactement AND questionFr: correspond AND
> concept_researcher_2 = 101
>
> and facetting like this :
>
>
> q=%2A%3A%2A&fl=variableXMLFr,lang&start=0&rows=10&facet=true&facet.field=concept_researcher_2&facet.field=studyDateAndStudyTitle&facet.sort=lex
>
>
It doesn't impact query performance any more than filtering on other fields.
Is there a performance problem or were you just asking generally?

-- 
Regards,
Shalin Shekhar Mangar.


Posting pdf file and posting from remote

2010-02-09 Thread alendo

I understand that tika is able to index pdf content: its true? I tried to
post a pdf from local and I've seen in the solr/admin schema browser another
document, but when I search only the document id is available, the documents
doesn't seem indexed. Do I need other products to index pdf content?

Moreover I want to send a file from remote: it seems I must configure tika
with a tika-config.xml file, enabling remote streaming as in the following:



but I'm not able to find a tika-config.xml example... 
thanks a lot
Alessandra
-- 
View this message in context: 
http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512455.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dynamic fields with more than 100 fields inside

2010-02-09 Thread Xavier Schepler

Shalin Shekhar Mangar a écrit :

On Mon, Feb 8, 2010 at 9:47 PM, Xavier Schepler <
xavier.schep...@sciences-po.fr> wrote:

  

Hey,

I'm thinking about using dynamic fields.

I need one or more user specific field in my schema, for example,
"concept_user_*", and I will have maybe more than 200 users using this
feature.
One user will send and retrieve values from its field. It will then be used
to filter result.

How would it impact query performance ?




Can you give an example of such a query?

  

Hi,

it could be queries such as :

allFr: état-unis AND concept_researcher_99 = 303

modalitiesFr: exactement AND questionFr: correspond AND 
concept_researcher_2 = 101


and facetting like this :

q=%2A%3A%2A&fl=variableXMLFr,lang&start=0&rows=10&facet=true&facet.field=concept_researcher_2&facet.field=studyDateAndStudyTitle&facet.sort=lex

Thanks in advance,

Xavier S.


Unsubscribe from mailing list

2010-02-09 Thread Abin Mathew
Please unsubscribe me from Mailing list


RE: Indexing / querying multiple data types

2010-02-09 Thread stefan.maric
Sven

In my data-config.xml I have the following 





In my schema.xml I have

   

   

And in my solrconfig.xml I have
 

data-config.xml

  



dismax
explicit
0.01
name^1.5 description^1.0





dismax
explicit
0.01
name^1.5 description^1.0



And the 
  
Has been untouched

So when I run
http://localhost:7001/solr/select/?q=food&qt=name1
I was expecting to get results form the data that had been indexed by