Re: Want to support "did you mean xxx" but is Chinese

2011-10-23 Thread Floyd Wu
Hi Li Li,

Thanks for your detail explanation. Basically I have similar
implementation like yours. I just want to know if there is a better
and total solution. I'll keep trying and see if I have any improvement
that can share with you and the community.

Any idea or advice are welcome .

Floyd



2011/10/21 Li Li :
>we have implemented one supporting "did you mean" and preffix suggestion
> for Chinese. But we base our working on solr 1.4 and we did many
> modifications so it will cost time to integrate it to current solr/lucene.
>
> Here are our solution. glad to see any advices.
>
> 1. offline words and phrases discovery.
>   we discovery new words and new phrases by mining query logs
>
> 2. online matching algorithm
>   for each word, e.g., 贝多芬
>   we convert it to pinyin bei duo fen, then we indexing it using
> n-gram, which means gram3:bei gram3:eid ...
>   to get "did you mean" result, we convert query 背朵分 into n-gram,
> it's a boolean or query, so there are many results( the words' pinyin
> similar to query will be ranked top)
>  Then we reranks top 500 results by fine-grained algorithm
>  we use edit distance to align query and result, we also take
> character into consideration. e.g query 十度,matches are 十渡 and 是度,their
> pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in
> both query and match
>  also you need consider the hotness(popular degree) of different
> words/phrases. which can be known from query logs
>
>  Another question is to convert Chinese into pinyin. because some
> character has more than one pinyin.
> e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and
> words/phrases first. word segmentation is a basic problem is Chinese IR
>
>
> 2011/10/21 Floyd Wu 
>
>> Does anybody know how to implement this idea in SOLR. Please kindly
>> point me a direction.
>>
>> For example, when user enter a keyword in Chinese "��多芬" (this is
>> Beethoven in Chinese)
>> but key in a wrong combination of characters  "背多分" (this is
>> pronouncation the same with previous keyword "��多芬").
>>
>> There in solr index exist token "��多芬" actually. How to hit documents
>> where "��多芬" exist when "背多分" is enter.
>>
>> This is basic function of commercial search engine especially in
>> Chinese processing. I wonder how to implements in SOLR and where is
>> the start point.
>>
>> Floyd
>>
>


Re: Dismax and phrases

2011-10-23 Thread Hyttinen Lauri

On 10/23/2011 09:34 PM, Erick Erickson wrote:

Hmmm dismax is, indeed, different. Note that dismax doesn't respect
the default operator at all, so don't be mislead there.

Could you paste the debug output for both the queries? Perhaps something
will jump out at us.

Best
Erick

Thank you Erick. I've tried to paste the query results here.
First one is the query with ""'s around the terms and returns 6888 results.
I've hid the explain parts of most of the results (and timing) just to 
keep the email reasonably short.

If you need to see them let me know.
+ designates hidden "subtree".

Best regards,
Lauri



0
91



on

standard
2.2
10
*,score
on
0
"asuntojen hinnat"
dismax





+



asuntojenhinnat

"asuntojen hinnat"
"asuntojen hinnat"

+DisjunctionMaxQuery((table.title_t:"asuntojen 
hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen 
hinnat" | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto 
table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | 
graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto 
graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto 
table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | 
text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | 
(table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto 
title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0 
FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0)


+(table.title_t:"asuntojen hinnat"^2.0 
| title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" | 
(text_fi:asunto text_fi:hinta) | (table.description_fi:asunto 
table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | 
graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto 
graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto 
table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | 
text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | 
(table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto 
title_fi:hinta)^2.0))~0.01 () type:tie^6.0 type:kuv^2.0 type:tau^2.0 
(1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0



name="/media/nss/DATA2/data/wwwprod/til/ashi/2011/07/ashi_2011_07_2011-08-26_tie_001_fi.html">

3.1653805 = (MATCH) sum of:
  1.9299976 = (MATCH) max plus 0.01 times others of:
1.9211313 = weight(title_t:"asuntojen hinnat"^2.0 in 5891), product of:
  0.26658234 = queryWeight(title_t:"asuntojen hinnat"^2.0), product of:
2.0 = boost
14.413042 = idf(title_t: asuntojen=250 hinnat=329)
0.009247955 = queryNorm
  7.206521 = fieldWeight(title_t:"asuntojen hinnat" in 5891), 
product of:

1.0 = tf(phraseFreq=1.0)
14.413042 = idf(title_t: asuntojen=250 hinnat=329)
0.5 = fieldNorm(field=title_t, doc=5891)
0.03292808 = (MATCH) sum of:
  0.016520109 = (MATCH) weight(text_fi:asunto in 5891), product of:
0.044221584 = queryWeight(text_fi:asunto), product of:
  4.781769 = idf(docFreq=3251, maxDocs=142742)
  0.009247955 = queryNorm
0.3735757 = (MATCH) fieldWeight(text_fi:asunto in 5891), 
product of:

  1.0 = tf(termFreq(text_fi:asunto)=1)
  4.781769 = idf(docFreq=3251, maxDocs=142742)
  0.078125 = fieldNorm(field=text_fi, doc=5891)
  0.016407972 = (MATCH) weight(text_fi:hinta in 5891), product of:
0.03705935 = queryWeight(text_fi:hinta), product of:
  4.0073023 = idf(docFreq=7054, maxDocs=142742)
  0.009247955 = queryNorm
0.44274852 = (MATCH) fieldWeight(text_fi:hinta in 5891), 
product of:

  1.4142135 = tf(termFreq(text_fi:hinta)=2)
  4.0073023 = idf(docFreq=7054, maxDocs=142742)
  0.078125 = fieldNorm(field=text_fi, doc=5891)
0.34379265 = (MATCH) sum of:
  0.19207533 = (MATCH) weight(graphic.title_fi:asunto in 5891), 
product of:

0.10662244 = queryWeight(graphic.title_fi:asunto), product of:
  5.76465 = idf(docFreq=1216, maxDocs=142742)
  0.01849591 = queryNorm
1.8014531 = (MATCH) fieldWeight(graphic.title_fi:asunto in 
5891), product of:

  1.0 = tf(termFreq(graphic.title_fi:asunto)=1)
  5.76465 = idf(docFreq=1216, maxDocs=142742)
  0.3125 = fieldNorm(field=graphic.title_fi, doc=5891)
  0.15171732 = (MATCH) weight(graphic.title_fi:hinta in 5891), 
product of:

0.09476117 = queryWeight(graphic.title_fi:hinta), product of:
  5.1233582 = idf(docFreq=2310, maxDocs=142742)
  0.01849591 = queryNorm
1.6010494 = (MATCH) fieldWeight(graphic.title_fi:hinta in 
5891), product of:

  1.0 = tf(termFreq(graphic.title_fi:hinta)=1)
  5.1233582 = idf(docFreq=2310, maxDocs=142742)
  0.3125 = fieldNorm(field=graphic.title_fi, doc=5891)
0.5099132 = (MATCH) sum of:
  0.302103 = (MATCH) weight(title_fi:asunto in 5891), product of:
 

Re: questions on query format

2011-10-23 Thread Ahmet Arslan
> 2. If send solr the following query:
>   q=*:*
> 
>   I get nothing just:
>     name="response" numFound="0" start="0"
> maxScore="0.0"/> name="highlighting"/>
> 
> Would appreciate some insight into what is going on.

If you are using dismax as query parser, then *:* won't function as match all 
docs query. To retrieve all docs - with dismax - use q.alt=*:* parameter. Also, 
adding debugQuery=on will display information about parsed query.


data-import problem

2011-10-23 Thread Radha Krishna Reddy
Hi,

I am trying to comfigure solr on aws ubuntu instance.I have mysql on a
different server.so i created a ssh tunnel for mysql on port 3309.

Download the mysql jdbc driver and copied it to lib folder.

*I edited the example/solr/conf/solrconfig.xml*



  data-config.xml




*example/solr/conf/data-config.xml*


  
  

   

  


*started the server*

java -Djetty.port=80 -jar start.jar

*when i tried to import data.*

http:///solr/dataimport?command=fullimport

i* am getting the following response*



05data-config.xmlfullimportidleThis response format is experimental.  It is likely to
change in the future.



Can someone help me on this?Also where can i find the logs.

Thanks and Regards,
Radha Krishna.


Re: schema.xml bloat?

2011-10-23 Thread Erik Hatcher

On Oct 23, 2011, at 20:23 , Fred Zimmerman wrote:

> So, basically, yes, it is a real problem and there is no designed solution?

Hmmm problem?  Not terribly so, is it?  

Certainly I'm more for a de-XMLification of configuration myself though.  And 
we probably should bake-in all the basic field types so they aren't explicitly 
declared (but could still be overridden if desired).

> e.g. optional sub-schema files that can be turned off and on?

Hmmm... you can use XInclude stuff.  Not sure that gives you the "optional" 
part of things exactly, but there is also the ${sys.prop.name[:default_value]} 
syntax that can be used in the configuration to pull off conditional types of 
tricks in some cases.

Right, so no designed solution to the problem, I suppose.  It's what it is at 
this point.

I'm curious to hear more elaboration on the specifics of how this is a problem 
though.  Certainly there is much room for improvement in most things.  

Erik

> 
> On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher wrote:
> 
>> 
>> On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote:
>>> it seems from my limited experience thus far that as new data types are
>>> added, schema.xml will tend to become bloated with many different field
>> and
>>> fieldtype definitions.  Is this a problem in real life, and if so, what
>>> strategies are used to address it?
>> 
>> ... by keeping your schema lean and clean, only with what YOU need in it.
>> Granted, I'd personally keep all the built-in Solr primitive field types
>> defined even if I didn't use them, but there aren't very many and don't
>> really clutter things up.
>> 
>> Defined fields should ONLY be what you need for your application, and
>> generally that should be a tractable (and necessary) reasonably sized set.
>> 
>>   Erik
>> 



Re: schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
So, basically, yes, it is a real problem and there is no designed solution?
 e.g. optional sub-schema files that can be turned off and on?

On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher wrote:

>
> On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote:
> > it seems from my limited experience thus far that as new data types are
> > added, schema.xml will tend to become bloated with many different field
> and
> > fieldtype definitions.  Is this a problem in real life, and if so, what
> > strategies are used to address it?
>
> ... by keeping your schema lean and clean, only with what YOU need in it.
>  Granted, I'd personally keep all the built-in Solr primitive field types
> defined even if I didn't use them, but there aren't very many and don't
> really clutter things up.
>
> Defined fields should ONLY be what you need for your application, and
> generally that should be a tractable (and necessary) reasonably sized set.
>
>Erik
>


questions on query format

2011-10-23 Thread Memory Makers
Hi,

I've spent quite some time reading up on the query format and can't seem to
solve this problem:

1. If send solr the following query:
  q={!lucene}profile_description:*

  I get what I would expect.

2. If send solr the following query:
  q=*:*

  I get nothing just:
   

Would appreciate some insight into what is going on.

Thanks.


Re: schema.xml bloat?

2011-10-23 Thread Erik Hatcher

On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote:
> it seems from my limited experience thus far that as new data types are
> added, schema.xml will tend to become bloated with many different field and
> fieldtype definitions.  Is this a problem in real life, and if so, what
> strategies are used to address it?

... by keeping your schema lean and clean, only with what YOU need in it.  
Granted, I'd personally keep all the built-in Solr primitive field types 
defined even if I didn't use them, but there aren't very many and don't really 
clutter things up.

Defined fields should ONLY be what you need for your application, and generally 
that should be a tractable (and necessary) reasonably sized set.

Erik
 

schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
Hi,

it seems from my limited experience thus far that as new data types are
added, schema.xml will tend to become bloated with many different field and
fieldtype definitions.  Is this a problem in real life, and if so, what
strategies are used to address it?

FredZ


Re: where is solr data import handler looking for my file?

2011-10-23 Thread Fred Zimmerman
Figured it out.  See step 12 in
http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/.
 Thanks!

On Sun, Oct 23, 2011 at 1:31 PM, Erick Erickson wrote:

> I think you need to back up and state the problem you're trying to
> solve. Offhand, it looks as though you're trying to do something
> with DIH that it wasn't intended to do. But that's just a guess
> since the details of what you're trying to do are so sparse...
>
> Best
> Erick
>
> On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman 
> wrote:
> > Solr dataimport is reporting file not found when it looks for foo.xml.
> >
> > Where is it looking for /data? is this an url off the apache2/htdocs on
> the
> > server, or is it an URL within example/solr/...?
> >
> >
> >   >processor="XPathEntityProcessor"
> >stream="true"
> >forEach="/mediawiki/page/"
> >url="/data/foo.xml"
> >transformer="RegexTransformer,DateFormatTransformer"
> >>
> >
>


Re: Date boosting with dismax question

2011-10-23 Thread Erick Erickson
Define "not working". Show what you're getting and what you
expect to find. Show your data. Note that the example given
boosts on quite coarse dates, it *tends* to make documents
published in a particular *year* score higher.

You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Sun, Oct 23, 2011 at 11:08 PM, Craig Stadler  wrote:
> Yes I have and I cannot get it to work. Perhaps something is out of version
> for my setup?
> I tried for 3 hours to get ever example I could find to work.
>
> - Original Message - From: "Erick Erickson"
> 
> To: 
> Sent: Sunday, October 23, 2011 5:07 PM
> Subject: Re: Date boosting with dismax question
>
>
> Have you seen this?
>
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
>
> Best
> Erick
>
>
> On Sat, Oct 22, 2011 at 3:26 AM, Craig Stadler 
> wrote:
>>
>> Solr Specification Version: 1.4.0
>> Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06
>> 12:33:40
>> Lucene Specification Version: 2.9.1
>> Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
>>
>> > precisionStep="6" positionIncrementGap="0"/>
>>
>> > stored="false" omitNorms="true" required="false"
>> omitTermFreqAndPositions="true" />
>>
>> I am using 'created' as the name of the date field.
>>
>> My dates are being populated as such :
>> 1980-01-01T00:00:00Z
>>
>> Search handler (solrconfig) :
>>
>> 
>> 
>> dismax
>> explicit
>> 0.1
>> name0^2 other ^1
>> name0^2 other ^1
>> 3
>> 3
>> *:*
>> 
>> 
>>
>> --
>>
>> Query :
>>
>> /solr/ftf/dismax/?q=libya
>> &debugQuery=off
>> &hl=true
>> &start=
>> &rows=10
>> --
>>
>> I am trying to factor in created to the SCORE. (boost) I have tried a
>> million ways to do this, no success. I know the dates are populating
>> correctly because I can sort by them. Can anyone help me implement date
>> boosting with dismax under this scenario???
>>
>> -Craig
>>
>
>


Re: questions about autocommit & committing documents

2011-10-23 Thread Erick Erickson
A full commit of all pending documents is performed whenever
the first trigger is reached.

So, maxdocs = 1000. Max time=1 minute.

Index a packet with 999 docs. Index another packet with
50 documents immediately after. One commit of 1049 documents
happens

Index a packet of 999 docs. Do nothing for a minute. One commit of
999 docs happens because of maxtime...

But I have to ask, "why do you care"? What high level problem
are you trying to handle?

Best
Erick

On Sun, Oct 23, 2011 at 3:03 PM, darul  wrote:
> May someone explain me different use case when both or only one AutoCommit
> parameters is filled ?
>
> I really need to understand it.
>
> For example with these configurations :
>
> 
>  1
> 
>
> or
>
> 
>  1000
> 
>
> or
>
> 
>  1
>  1000
> 
>
> Thanks to everyone
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p3445607.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Update document field with solrj

2011-10-23 Thread Erick Erickson
You cannot update a single field in a document in Solr, you need to
replace the entire document. multiValued is irrelevant to this problem..

Or did I misunderstand your problem?

Best
Erick

On Sun, Oct 23, 2011 at 1:32 PM, hadi  wrote:
> I want to edit document filed in solr,for example edit the author name,so i
> use the following code in solrj:
>
> params.set("literal.author","anaconda")
>
> but the author multivalued="true" in schema and because of that "anaconde"
> is not replace with it's previous name and add to the end of the author
> name,
> also if i omit the multivalued field or set it to false the bad request
> exception happen in re-indexing file with new author field,how can i solve
> this problem and delete or modify the previous document field in solrj? or
> does it any config i miss in schema? thanks
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Update-document-field-with-solrj-tp3445488p3445488.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Date boosting with dismax question

2011-10-23 Thread Craig Stadler
Yes I have and I cannot get it to work. Perhaps something is out of version 
for my setup?

I tried for 3 hours to get ever example I could find to work.

- Original Message - 
From: "Erick Erickson" 

To: 
Sent: Sunday, October 23, 2011 5:07 PM
Subject: Re: Date boosting with dismax question


Have you seen this?

http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents

Best
Erick


On Sat, Oct 22, 2011 at 3:26 AM, Craig Stadler  
wrote:

Solr Specification Version: 1.4.0
Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06
12:33:40
Lucene Specification Version: 2.9.1
Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25





I am using 'created' as the name of the date field.

My dates are being populated as such :
1980-01-01T00:00:00Z

Search handler (solrconfig) :



dismax
explicit
0.1
name0^2 other ^1
name0^2 other ^1
3
3
*:*



--

Query :

/solr/ftf/dismax/?q=libya
&debugQuery=off
&hl=true
&start=
&rows=10
--

I am trying to factor in created to the SCORE. (boost) I have tried a
million ways to do this, no success. I know the dates are populating
correctly because I can sort by them. Can anyone help me implement date
boosting with dismax under this scenario???

-Craig





Re: Date boosting with dismax question

2011-10-23 Thread Erick Erickson
Have you seen this?

http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents

Best
Erick


On Sat, Oct 22, 2011 at 3:26 AM, Craig Stadler  wrote:
> Solr Specification Version: 1.4.0
> Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06
> 12:33:40
> Lucene Specification Version: 2.9.1
> Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
>
>  precisionStep="6" positionIncrementGap="0"/>
>
>  stored="false" omitNorms="true"  required="false"
> omitTermFreqAndPositions="true" />
>
> I am using 'created' as the name of the date field.
>
> My dates are being populated as such :
> 1980-01-01T00:00:00Z
>
> Search handler (solrconfig) :
>
> 
> 
> dismax
> explicit
> 0.1
> name0^2 other ^1
> name0^2 other ^1
> 3
> 3
> *:*
> 
> 
>
> --
>
> Query :
>
> /solr/ftf/dismax/?q=libya
> &debugQuery=off
> &hl=true
> &start=
> &rows=10
> --
>
> I am trying to factor in created to the SCORE. (boost) I have tried a
> million ways to do this, no success. I know the dates are populating
> correctly because I can sort by them. Can anyone help me implement date
> boosting with dismax under this scenario???
>
> -Craig
>


Re: Can Solr handle large text files?

2011-10-23 Thread Erick Erickson
Also be aware that by default Solr is configured to only index the
first 10,000 lines
of text. See maxFieldLength in solrconfig.xml

Best
Erick

On Fri, Oct 21, 2011 at 7:34 PM, Peter Spam  wrote:
> Thanks for your note, Anand.  What was the maximum chunk size for you?  Could 
> you post the relevant portions of your configuration file?
>
>
> Thanks!
> Pete
>
> On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:
>
>> Hi,
>>
>> I was also facing the issue of highlighting the large text files. I applied 
>> the solution proposed here and it worked. But I am getting following error :
>>
>>
>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I 
>> get this file from. Its reference is present in browse.vm
>>
>> 
>>  #if($response.response.get('grouped'))
>>    #foreach($grouping in $response.response.get('grouped'))
>>      #parse("hitGrouped.vm")
>>    #end
>>  #else
>>    #foreach($doc in $response.results)
>>      #parse("hit.vm")
>>    #end
>>  #end
>> 
>>
>>
>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config 
>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath 
>> or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268)
>>  at 
>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42)
>>  at org.apache.velocity.Template.process(Template.java:98) at 
>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446)
>>  at
>>
>> Thanks & Regards,
>> Anand
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506
>>
>>
>> -Original Message-
>> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
>> Sent: 21 October 2011 14:58
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>>
>> Hi Peter,
>>
>> highlighting in large text files can not be fast without dividing the 
>> original text in small piece.
>> So take a look in
>> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
>> and in
>> http://www.lucidimagination.com/blog/2010/09/16/2446/
>>
>> Which means that you should divide your files and use Result Grouping / 
>> Field Collapsing to list only one hit per original document.
>>
>> (xtf also would solve your problem "out of the box" but xtf does not use 
>> solr).
>>
>> Best regards
>>  Karsten
>>
>>  Original-Nachricht 
>>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>>> Von: Peter Spam 
>>> An: solr-user@lucene.apache.org
>>> Betreff: Can Solr handle large text files?
>>
>>> I have about 20k text files, some very small, but some up to 300MB,
>>> and would like to do text searching with highlighting.
>>>
>>> Imagine the text is the contents of your syslog.
>>>
>>> I would like to type in some terms, such as "error" and "mail", and
>>> have Solr return the syslog lines with those terms PLUS two lines of 
>>> context.
>>> Pretty much just like Google's highlighting.
>>>
>>> 1) Can Solr handle this?  I had extremely long query times when I
>>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I
>>> tried breaking the files into 1MB pieces, but searching would be wonky
>>> => return the wrong number of documents (ie. if one file had a term 5
>>> times, and that was the only file that had the term, I want 1 result, not 5 
>>> results).
>>>
>>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>>>
>>>   >> multiValued="false" termVectors="true" termPositions="true"
>>> termOffsets="true" />
>>>
>>>    
>>>      
>>>        
>>>        
>>>        >> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>> catenateNumbers="0"
>>> catenateAll="0" splitOnCaseChange="0"/>
>>>      
>>>    
>>>
>>>
>>> Thanks!
>>> Pete
>>
>> ***
>> The Royal Bank of Scotland plc. Registered in Scotland No 90312.
>> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB.
>> Authorised and regulated by the Financial Services Authority. The
>> Royal Bank of Scotland N.V. is authorised and regulated by the
>> De Nederlandsche Bank and has its seat at Amsterdam, the
>> Netherlands, and is registered in the Commercial Register under
>> number 33002587. Registered Office: Gustav Mahlerlaan 350,
>> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and
>> The Royal Bank of Scotland plc are authorised to act as agent for each
>> other in certain jurisdictions.
>>
>> This e-mail message is confidential and for use by the addressee only.
>> If the message is received by anyone other than the addressee, please
>> return the message to the sender by replying to it and then delete the
>> message from your computer. Inter

Re: SOLRNET combine LocalParams with SolrMultipleCriteriaQuery?

2011-10-23 Thread Erick Erickson
Hmmm, this is the Java forum, you might get a faster respons on the Solr .net
users list Especially since I don't find any reference to
SolrMultipleCriteriaQuery
in the Java 3.x code

Best
Erick

On Fri, Oct 21, 2011 at 1:44 PM, Grüger, Joscha
 wrote:
> Hello,
>
> does anybody know how to combine SolrMultipleCriteriaQuery and LocalParams 
> (in SOLRnet)?
>
> I've tried things like that (don't worry about bad the code, it's just to 
> test)
>
>  var test = solr.Query(BuildQuery(parameters), new QueryOptions
>                    {
>                        FilterQueries = bq(),
>                        Facet = new FacetParameters
>                        {
>                            Queries = new[] {
>                new SolrFacetFieldQuery(new LocalParams {{"ex", "dt"}} + 
> "ju_success") , new SolrFacetFieldQuery(new LocalParams {{"ex", "dt"}} + 
> "dr_success")
>            }
>                        }
>                    });
> ...
>
>     public ICollection bq()
>            {
>                List i = new List();
>                i.Add(new LocalParams { { "tag", "dt" } } +   
> Query.Field("dr_success").Is("simple"));
>                List MultiListItems = new List();
>                var t = new SolrMultipleCriteriaQuery(i, "OR");
>                    MultiListItems.Add(t);
> return MultiListItems();
>    }
>
>
>
> What I try to do are multi-select-facets with a "OR" operator.
>
> Thanks for all the help!
>
> Grüger
>
>


Re: inconsistent results when faceting on multivalued field

2011-10-23 Thread Erick Erickson
I think the key here is you are a bit confused about what
the multiValued thing is all about. The fq clause says,
essentially, "restrict all my search results to the documents
where 1213206 occurs in sou_codeMetier.
That's *all* the fq clause does.

Now, by saying facet.field=sou_codeMetier you're asking Solr
to count the number of documents that exist for each unique
value in that field. A single document can be counted many
times. Each "bucket" is a unique value in the field.

On the other hand, saying
facet.query=sou_codeMetier:[1213206 TO
1213206] you're asking Solr to count all the documents
that make it through your query (*:* in this case) with
*any* value in the indicated range.

Facet queries really have nothing to do with filter queries. That is,
facet queries in no way restrict the documents that are returned,
they just indicate ways of counting documents into buckets

Best
Erick

On Fri, Oct 21, 2011 at 10:01 AM, Darren Govoni  wrote:
> My interpretation of your results are that your FQ found 1281 documents
> with 1213206 value in sou_codeMetier field. Of those results, 476 also
> had 1212104 as a value...and so on. Since ALL the results will have
> the field value in your FQ, then I would expect the "other" values to
> be equal or less occurring from the result set, which they appear to be.
>
>
>
> On 10/21/2011 03:55 AM, Alain Rogister wrote:
>>
>> Pravesh,
>>
>> Not exactly. Here is the search I do, in more details (different field
>> name,
>> but same issue).
>>
>> I want to get a count for a specific value of the sou_codeMetier field,
>> which is multivalued. I expressed this by including a fq clause :
>>
>>
>> /select/?q=*:*&facet=true&facet.field=sou_codeMetier&fq=sou_codeMetier:1213206&rows=0
>>
>> The response (excerpt only):
>>
>> 
>> 
>> 1281
>> 476
>> 285
>> 260
>> 208
>> 171
>> 152
>> ...
>>
>> As you see, I get back both the expected results and extra results I would
>> expect to be filtered out by the fq clause.
>>
>> I can eliminate the extra results with a
>> 'f.sou_codeMetier.facet.prefix=1213206' clause.
>>
>> But I wonder if Solr's behavior is correct and how the fq filtering works
>> exactly.
>>
>> If I replace the facet.field clause with a facet.query clause, like this:
>>
>> /select/?q=*:*&facet=true&facet.query=sou_codeMetier:[1213206 TO
>> 1213206]&rows=0
>>
>> The results contain a single item:
>>
>> 
>> 1281
>> 
>>
>> The 'fq=sou_codeMetier:1213206' clause isn't necessary here and does not
>> affect the results.
>>
>> Thanks,
>>
>> Alain
>>
>> On Fri, Oct 21, 2011 at 9:18 AM, pravesh  wrote:
>>
>>> Could u clarify on below:
>
> When I make a search on facet.qua_code=1234567 ??
>>>
>>> Are u trying to say, when u fire a fresh search for a facet item, like;
>>> q=qua_code:1234567??
>>>
>>> This this would fetch for documents where qua_code fields contains either
>>> the terms 1234567 OR both terms (1234567&  9384738.and others terms).
>>> This would be since its a multivalued field and hence if you see the
>>> facet,
>>> then its shown for both the terms.
>>>
> If I reword the query as 'facet.query=qua_code:1234567 TO 1234567', I
>>>
>>> only
>>> get the expected counts
>>>
>>> You will get facet for documents which have term 1234567 only
>>> (facet.query
>>> would apply to the facets,so as to which facet to be picked/shown)
>>>
>>> Regds
>>> Pravesh
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
>>> http://lucene.472066.n3.nabble.com/inconsistent-results-when-faceting-on-multivalued-field-tp3438991p3440128.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>
>


Question about dismax and score boost with date

2011-10-23 Thread Craig Stadler

Solr Specification Version: 1.4.0
Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 
12:33:40

Lucene Specification Version: 2.9.1
Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

precisionStep="6" positionIncrementGap="0"/>


stored="false" omitNorms="true"  required="false" 
omitTermFreqAndPositions="true" />


I am using 'created' as the name of the date field.

My dates are being populated as such :
1980-01-01T00:00:00Z

Search handler (solrconfig) :



dismax
explicit
0.1
name0^2 other ^1
name0^2 other ^1
3
3
*:*



--

Query :

/solr/ftf/dismax/?q=libya
&debugQuery=off
&hl=true
&start=
&rows=10
--

I am trying to factor in created to the SCORE. (boost) I have tried a 
million ways to do this, no success. I know the dates are populating 
correctly because I can sort by them. Can anyone help me implement date 
boosting with dismax under this scenario???


-Craig 



Re: OS Cache - Solr

2011-10-23 Thread Erick Erickson
Think about using cores rather than instances if you really must
have this kind of separation. Otherwise you might have much
better luck combining these into a single index.

Best
Erick

On Fri, Oct 21, 2011 at 7:07 AM, Sujatha Arun  wrote:
> Yes its same ,we have a base static schema and wherever required we
> use dynamic.
>
> Regards,
> Sujatha
>
>
> On Thu, Oct 20, 2011 at 6:26 PM, Jaeger, Jay - DOT 
> wrote:
>
>> I wonder.  What if, instead of 200 instances, you had one instance, but
>> built a uniqueKey up out of whatever you have now plus whatever information
>> currently segregates the instances.  Then this would be much more
>> manageable.
>>
>> In other words, what is different about each of the 200 instances?  Is the
>> schema for each essentially the same, as I am guessing?
>>
>> JRJ
>>
>> -Original Message-
>> From: Sujatha Arun [mailto:suja.a...@gmail.com]
>> Sent: Thursday, October 20, 2011 12:21 AM
>> To: solr-user@lucene.apache.org
>> Cc: Otis Gospodnetic
>> Subject: Re: OS Cache - Solr
>>
>> Yes 200 Individual Solr Instances not solr cores.
>>
>> We get an avg response time of below 1 sec.
>>
>> The number of documents is  not many most of the isntances ,some of the
>> instnaces have about 5 lac documents on average.
>>
>> Regards
>> Sujahta
>>
>> On Thu, Oct 20, 2011 at 3:35 AM, Jaeger, Jay - DOT > >wrote:
>>
>> > 200 instances of what?  The Solr application with lucene, etc. per usual?
>> >  Solr cores? ???
>> >
>> > Either way, 200 seems to be very very very many: unusually so.  Why so
>> > many?
>> >
>> > If you have 200 instances of Solr in a 20 GB JVM, that would only be
>> 100MB
>> > per Solr instance.
>> >
>> > If you have 200 instances of Solr all accessing the same physical disk,
>> the
>> > results are not likely to be satisfactory - the disk head will go nuts
>> > trying to handle all of the requests.
>> >
>> > JRJ
>> >
>> > -Original Message-
>> > From: Sujatha Arun [mailto:suja.a...@gmail.com]
>> > Sent: Wednesday, October 19, 2011 12:25 AM
>> > To: solr-user@lucene.apache.org; Otis Gospodnetic
>> > Subject: Re: OS Cache - Solr
>> >
>> > Thanks ,Otis,
>> >
>> > This is our Solr Cache  Allocation.We have the same Cache allocation for
>> > all
>> > our *200+ instances* in the single Server.Is this too high?
>> >
>> > *Query Result Cache*:LRU Cache(maxSize=16384, initialSize=4096,
>> > autowarmCount=1024, )
>> >
>> > *Document Cache *:LRU Cache(maxSize=16384, initialSize=16384)
>> >
>> >
>> > *Filter Cache* LRU Cache(maxSize=16384, initialSize=4096,
>> > autowarmCount=4096, )
>> >
>> > Regards
>> > Sujatha
>> >
>> > On Wed, Oct 19, 2011 at 4:05 AM, Otis Gospodnetic <
>> > otis_gospodne...@yahoo.com> wrote:
>> >
>> > > Maybe your Solr Document cache is big and that's consuming a big part
>> of
>> > > that JVM heap?
>> > > If you want to be able to run with a smaller heap, consider making your
>> > > caches smaller.
>> > >
>> > > Otis
>> > > 
>> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > > Lucene ecosystem search :: http://search-lucene.com/
>> > >
>> > >
>> > > >
>> > > >From: Sujatha Arun 
>> > > >To: solr-user@lucene.apache.org
>> > > >Sent: Tuesday, October 18, 2011 12:53 AM
>> > > >Subject: Re: OS Cache - Solr
>> > > >
>> > > >Hello Jan,
>> > > >
>> > > >Thanks for your response and  clarification.
>> > > >
>> > > >We are monitoring the JVM cache utilization and we are currently using
>> > > about
>> > > >18 GB of the 20 GB assigned to JVM. Out total index size being abt
>> 14GB
>> > > >
>> > > >Regards
>> > > >Sujatha
>> > > >
>> > > >On Tue, Oct 18, 2011 at 1:19 AM, Jan Høydahl 
>> > > wrote:
>> > > >
>> > > >> Hi Sujatha,
>> > > >>
>> > > >> Are you sure you need 20Gb for Tomcat? Have you profiled using
>> > JConsole
>> > > or
>> > > >> similar? Try with 15Gb and see how it goes. The reason why this is
>> > > >> beneficial is that you WANT your OS to have available memory for
>> disk
>> > > >> caching. If you have 17Gb free after starting Solr, your OS will be
>> > able
>> > > to
>> > > >> cache all index files in memory and you get very high search
>> > > performance.
>> > > >> With your current settings, there is only 12Gb free for both caching
>> > the
>> > > >> index and for your MySql activities.  Chances are that when you
>> backup
>> > > >> MySql, the cached part of your Solr index gets flushed from disk
>> > caches
>> > > and
>> > > >> need to be re-cached later.
>> > > >>
>> > > >> How to interpret memory stats vary between OSes, and seing 163Mb
>> free
>> > > may
>> > > >> simply mean that your OS has used most RAM for various caches and
>> > > paging,
>> > > >> but will flush it once an application asks for more memory. Have you
>> > > seen
>> > > >> http://wiki.apache.org/solr/SolrPerformanceFactors ?
>> > > >>
>> > > >> You should also slim down your index maximally by setting
>> stored=false
>> > > and
>> > > >> indexed=false wherever possible. I would also upgrade to a more
>> > curr

Re: how to handle large relational data in Solr

2011-10-23 Thread Erick Erickson
In addition to Otis' suggestion, think about using multivalued fields
with an increment gap of,
say, 100 (assuming your accessories had less than 100 fields). Then
you can do proximity
searches with a size < 100 (e.g. "red swing"~90) would not match
across your multiple
entries

If this is clear as mud, write back with what you've tried and maybe we can help

Best
Erick

On Thu, Oct 20, 2011 at 7:23 PM, Jonathan Carothers
 wrote:
> Actually, that's the root of my concern.  It looks like it product will 
> average ~20,000 associated accessories, still workable, but starting to look 
> painful.  Coming back the other way, I would guess each accessory would be 
> associated with 100 products on average.
>
> Given that there would be searchable fields in both the product and accessory 
> data, I assume I would have to either split them  into separate indexes and 
> merge the results, or have one document per product/accessory combo so that I 
> don't get a mix of accessories matching the search term.  For example, if a 
> product had two accessories, one with the description of "Blue Swing" and 
> another with "Red Ball" and I did a search for "Red Swing" it would rank 
> about the same as a document that actually had a "Red Swing".
>
> So it sounds like you are suggesting the external map, in which case is there 
> a good way to merge the two searches?  Basically on search on product 
> attributes and a second search on the attributes of related accessories?
>
> many thanks,
> Jonathan
> 
> From: Robert Stewart [bstewart...@gmail.com]
> Sent: Thursday, October 20, 2011 12:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to handle large relational data in Solr
>
> If your "documents" are products, then 100,000 documents is a pretty small 
> index for solr.  Do you know approximately how many accessories are related 
> to each product on average?  If # if relatively small (around 100 or less), 
> then it should be ok to create product documents with all the related 
> accessories as fields on the document, something like:
>
> 
>        PRODUCT_ID
>        PRODUCT_NAME
>        accessory one
>        accessory two
>        
>        accessory N
> 
>
>
> And then you can search for products by accessory, and show accessory facets 
> over products, etc.
>
> Even if # of accessories per product is large (1000 or more), you can still 
> do it this way, but it may be better to store some small accessory ID as 
> integers instead of larger names, and maybe use some external mapping to 
> resolve names for search and display.
>
> Bob
>
>
> On Oct 20, 2011, at 11:08 AM, Jonathan Carothers wrote:
>
>> Agreed, this will just be a read only view of the existing database for 
>> search purposes.  Sorry for the confusion.
>> 
>> From: Brandon Ramirez [brandon_rami...@elementk.com]
>> Sent: Thursday, October 20, 2011 10:50 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: how to handle large relational data in Solr
>>
>> I would not recommend removing your relational database altogether.  You 
>> should treat that as your system of record.  By replacing it, you are 
>> forcing Solr to store the unmodified value for everything even when not 
>> needed.  You also lose normalization.   And if you ever need to add some 
>> data to your system that isn't search-related, you have no choice but to add 
>> it to your search index.
>>
>>
>> Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848
>> Software Engineer II | Element K | www.elementk.com
>>
>>
>> -Original Message-
>> From: Jonathan Carothers [mailto:jonathan.caroth...@amentra.com]
>> Sent: Thursday, October 20, 2011 10:12 AM
>> To: solr-user@lucene.apache.org
>> Subject: how to handle large relational data in Solr
>>
>> All,
>>
>> We are attempting to convert a fairly large relational database into Solr 
>> index(es).
>>
>> There are ~100,000 products with ~1,000,000 accessories that can be related 
>> to any number of the products.  So if I include the search terms and the 
>> relationships in the same index, we're looking at a pretty huge index.
>>
>> If we break it out into three indexes, one for the product search, one for 
>> the accessories search, and one for their relationship, is there a good way 
>> to merge the results?
>>
>> Is there a better way to structure the indexes?
>>
>> We will have a relational database available if it makes sense to do some 
>> sort of a hybrid approach.
>>
>> many thanks,
>> Jonathan
>>
>
>


Re: Question about near query order

2011-10-23 Thread Erick Erickson
Just to chime in here... You will get different results
for "A B"~2 and "B A"~2. In the simple two-term case,
changing the order requires an extra move(s). There's
a very good explanation of this in Lucene In Action II.

Best
Erick

On Thu, Oct 20, 2011 at 3:35 PM, Jason, Kim  wrote:
> Which one is better performance of setting inOrder=false in solrconfig.xml
> and quering with "A B"~1 AND "B A"~1 if performance differences?
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Question-about-near-query-order-tp3427312p3437701.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dismax and phrases

2011-10-23 Thread Erick Erickson
Hmmm dismax is, indeed, different. Note that dismax doesn't respect
the default operator at all, so don't be mislead there.

Could you paste the debug output for both the queries? Perhaps something
will jump out at us.

Best
Erick

On Thu, Oct 20, 2011 at 11:08 AM, Hyttinen Lauri  wrote:
> Thank you Otis for the answer.
>
> I've played around with the solr admin query interface and I've managed to
> confuse myself even more.
> If I query without the quotes solr seems to form two parsedqueries
>
> +((DisjunctionMaxQuery(( -first word stuff- )) DisjunctionMaxQuery(( -second
> word stuff- ))
>
> and then based on the query give out results which have -both- words.
> Default operator is OR in schema.xml.
>
> With quotes the query is different with only one DisjunctionMaxQuery in
> parsedquery but the results (of which there are more than double) have pages
> in them
> which have only one of the words (granted these results are much lower than
> the ones with both words)
>
> I set qs to 0. (and I even played with pf and ps before commenting them out
> since they relate to automaticed phrased queries?)
>
> Best regards,
> Lauri
>
> PS. I am not unhappy with the results so to speak but perplexed and don't
> know how to explain this number discrepancy to project members other than
> "Dismax is different."
>
>
> On 10/19/2011 04:28 PM, Otis Gospodnetic wrote:
>>
>> Lauri,
>>
>> Start with adding&debugQuery=true to your URL calls to Solr and look at
>> how the queries are getting rewritten to understand what is going on.  What
>> you are seeing is actually expected, so if you want your phrase query to be
>> a strict phrase query, just use standard request handler, not dismax.
>>
>> Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>> 
>>> From: Hyttinen Lauri
>>> To: solr-user@lucene.apache.org
>>> Sent: Wednesday, October 19, 2011 5:02 AM
>>> Subject: Dismax and phrases
>>>
>>> Hello,
>>>
>>> I've inherited a solr-lucene project which I continue to develop. This
>>> particular SOLR (1.4.1) uses dismax for the queries but I am getting some
>>> results that I do not understand. Mainly when I search for two terms I get
>>> some results however when I put quotes around the two terms I get a lot more
>>> results which goes against my understanding of what should happen ie. a
>>> lesser set of results. Where should I start digging for the answer?
>>> solrconfiq.xql or some other place?
>>>
>>> Best regards,
>>> Lauri Hyttinen
>>>
>>>
>>>
>
>
> --
> Lauri Hyttinen
> Tietopalvelusuunnittelija
> Tilastokeskus
> Yksikkö
> Käyntiosoite: Työpajankatu 13, 00580 Helsinki
> Postiosoite: PL 3 A, 00022 Tilastokeskus
> puh. 09 1734 
> lauri.hytti...@tilastokeskus.fi
> www.tilastokeskus.fi
>
>


Re: where is solr data import handler looking for my file?

2011-10-23 Thread Erick Erickson
I think you need to back up and state the problem you're trying to
solve. Offhand, it looks as though you're trying to do something
with DIH that it wasn't intended to do. But that's just a guess
since the details of what you're trying to do are so sparse...

Best
Erick

On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman  wrote:
> Solr dataimport is reporting file not found when it looks for foo.xml.
>
> Where is it looking for /data? is this an url off the apache2/htdocs on the
> server, or is it an URL within example/solr/...?
>
>
>                  processor="XPathEntityProcessor"
>                stream="true"
>                forEach="/mediawiki/page/"
>                url="/data/foo.xml"
>                transformer="RegexTransformer,DateFormatTransformer"
>                >
>


Re: Find Documents with field = maxValue

2011-10-23 Thread Erick Erickson
Right, but consider the general case. You could potentially return every
document in your index in a single packet with this functionality. I suspect
that this is an edge case that you'll have to
1> implement the two-or-more query solution
2> write your own component that investigates the terms in the field
 in question and accomplishes your task.

Best
Erick

On Wed, Oct 19, 2011 at 2:40 PM, Alireza Salimi
 wrote:
> What I'm looking for is to do everything in single shot in Solr.
> I'm not even sure if it's possible or not.
> Finding the max value and then running another query is NOT my ideal
> solution.
>
> Thanks everybody
>
>
> On Tue, Oct 18, 2011 at 6:28 PM, Sujit Pal  wrote:
>
>> Hi Alireza,
>>
>> Would this work? Sort the results by age desc, then loop through the
>> results as long as age == age[0].
>>
>> -sujit
>>
>> On Tue, 2011-10-18 at 15:23 -0700, Otis Gospodnetic wrote:
>> > Hi,
>> >
>> > Are you just looking for:
>> >
>> > age:
>> >
>> > This will return all documents/records where age field is equal to target
>> age.
>> >
>> > But maybe you want
>> >
>> > age:[0 TO ]
>> >
>> > This will include people aged from 0 to target age.
>> >
>> > Otis
>> > 
>> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > Lucene ecosystem search :: http://search-lucene.com/
>> >
>> >
>> > >
>> > >From: Alireza Salimi 
>> > >To: solr-user@lucene.apache.org
>> > >Sent: Tuesday, October 18, 2011 10:15 AM
>> > >Subject: Re: Find Documents with field = maxValue
>> > >
>> > >Hi Ahmet,
>> > >
>> > >Thanks for your reply, but I want ALL documents with age = max_age.
>> > >
>> > >
>> > >On Tue, Oct 18, 2011 at 9:59 AM, Ahmet Arslan 
>> wrote:
>> > >
>> > >>
>> > >>
>> > >> --- On Tue, 10/18/11, Alireza Salimi 
>> wrote:
>> > >>
>> > >> > From: Alireza Salimi 
>> > >> > Subject: Find Documents with field = maxValue
>> > >> > To: solr-user@lucene.apache.org
>> > >> > Date: Tuesday, October 18, 2011, 4:10 PM
>> > >> > Hi,
>> > >> >
>> > >> > It might be a naive question.
>> > >> > Assume we have a list of Document, each Document contains
>> > >> > the information of
>> > >> > a person,
>> > >> > there is a numeric field named 'age', how can we find those
>> > >> > Documents whose
>> > >> > *age* field
>> > >> > is *max(age) *in one query.
>> > >>
>> > >> May be http://wiki.apache.org/solr/StatsComponent?
>> > >>
>> > >> Or sort by age?  q=*:*&start=0&rows=1&sort=age desc
>> > >>
>> > >
>> > >
>> > >
>> > >--
>> > >Alireza Salimi
>> > >Java EE Developer
>> > >
>> > >
>> > >
>>
>>
>
>
> --
> Alireza Salimi
> Java EE Developer
>


Re: use lucene to create index(with synonym) and solr query index

2011-10-23 Thread Erick Erickson
I'm not quite sure what you're asking, but the values returned for
documents to the client are the *stored* values, not the indexed
values. So your synonyms will never be returned as part of a
document.

Does that help?

Best
Erick

On Wed, Oct 19, 2011 at 4:23 AM, cmd  wrote:
> 1.use lucene to create index(with synonym)
> 2.config solr open synonym functionality
> 3.user solr to query lucene index but the result missing the synonym word
> why? and how can i do with each other. thanks!
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/use-lucene-to-create-index-with-synonym-and-solr-query-index-tp3433124p3433124.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: multiple document types in a core

2011-10-23 Thread Erick Erickson
Yes, stored fields are placed verbatim for every doc. But I wonder
at the utility of trying to share stored information. The stored
info is put in certain files in the index, see:
http://lucene.apache.org/java/3_0_2/fileformats.html#file-names

and the files that store data are pretty much irrelevant to searching,
the data in them is only referenced when assembling the document
for return. So by adding this complexity you'll be saving a bit
on file transfers when replicating your index, but not much else.

Is it worth it? If so, why?

Best
Erick

On Mon, Oct 17, 2011 at 11:07 AM, lee carroll
 wrote:
> Just as a follow up
>
> it looks like stored fields are stored verbatim for every doc.
>
> hotel index and store dest attributes
> index size: 131M
> number of records 49147
>
> hotel index only dest attributes
>
> index size: 111m
> number of records 49147
>
>
> ~400 chars(bytes) of destination data * 49147 (number of hotel docs) = ~19m
>
> basically everything is being stored
>
> No difference in time to index (very rough and not scientific :-) )
>
> So it does seem an ok strategy to denormalise docs with index fields
> but normalise with stored fields ?
> Or have i missed some problems with this ?
>
> cheers lee c
>
>
>
> On 16 October 2011 11:54, lee carroll  wrote:
>> Hi Chris thanks for the response
>>
>>> It's an inverted index, so *tems* exist once (per segment) and those terms
>>> "point" to the documents -- so having the same terms (in the same fields)
>>> for multiple types of documents in one index is going to take up less
>>> overall space then having distinct collections for each type of document.
>>
>> I'm not asking about the indexed terms but rather the stored values.
>> By having two doc types are we gaining anything by "storing"
>> attributes only for that doc type
>>
>> cheers lee c
>>
>


Re: Solr indexing plugin: skip single faulty document?

2011-10-23 Thread Erick Erickson
Some work has been done in this general area, see SOLR-445. That
might give you some pointers

Best
Erick

On Mon, Oct 17, 2011 at 11:00 AM, samuele.mattiuzzo  wrote:
> Hi all, as far as i know, when solr finds a faulty document (inside an xml
> containing let say 1000 docs) it skips the whole file and the indexing
> process exits with exception (am i correct?)
>
> I'm using a custom indexing plugin, and i can trap the exception. Instead of
> using "default" values if that exception is raised, i would like to skip the
> document raising the error (example: sometimes i try to insert a string
> inside a "string" field, but solr exits saying it's expecting a multiValued
> field... i guess it's because of some ascii chars within the text, something
> like \n or sort...) maybe logging it somewhere, and pass to the next one.
> We're indexing millions of them, and we don't care much if we loose 10-20%
> of them, so the best solution is skip the single faulty doc and continue
> with the rest.
>
> I guess i have to work on the super.processAdd() call, but i don't know
> where i can find info about it. Can anybody help me? Is there a book talking
> about advanced solr plugin developement i could read?
>
> Thanks!
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-indexing-plugin-skip-single-faulty-document-tp3427646p3427646.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Selective Result Grouping

2011-10-23 Thread Martijn v Groningen
> The current grouping functionality using group.field is basically
> all-or-nothing: all documents will be grouped by the field value or none
> will. So there would be no way to, for example, collapse just the videos or
> images like they do in google.
When using the group.field option values must be the same otherwise
they don't get grouped together. Maybe fuzzy grouping would be nice.
Grouping videos and images based on mimetype should be easy, right?
Videos have a mimetype that start with video/ and images have a
mimetype that start with image/. Storing the mime type's subtype and
type in separate fields and group on the type field would do the job.
Off course you need to know the mimetype during indexing, but
solutions like Apache Tika can do that for you.

-- 
Met vriendelijke groet,

Martijn van Groningen


Re: questions about autocommit & committing documents

2011-10-23 Thread darul
May someone explain me different use case when both or only one AutoCommit
parameters is filled ?

I really need to understand it.

For example with these configurations :

  
  1 


or 

  
  1000  


or 

  
  1
  1000  


Thanks to everyone

--
View this message in context: 
http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p3445607.html
Sent from the Solr - User mailing list archive at Nabble.com.


Update document field with solrj

2011-10-23 Thread hadi
I want to edit document filed in solr,for example edit the author name,so i
use the following code in solrj:

params.set("literal.author","anaconda")

but the author multivalued="true" in schema and because of that "anaconde"
is not replace with it's previous name and add to the end of the author
name,
also if i omit the multivalued field or set it to false the bad request
exception happen in re-indexing file with new author field,how can i solve
this problem and delete or modify the previous document field in solrj? or
does it any config i miss in schema? thanks


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Update-document-field-with-solrj-tp3445488p3445488.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Implement Custom Soundex

2011-10-23 Thread Momo..Lelo ..

thank you for this information. 

> Subject: Re: Implement Custom Soundex
> From: p...@hoplahup.net
> Date: Sun, 23 Oct 2011 10:58:49 +0200
> To: solr-user@lucene.apache.org
> 
> Momo,
> 
> if you have the conversion text to tokens then all you need to do is 
> implement a custom analyzer, deploy it inside the solr webapp, then plug it 
> into the schema.
> 
> Is that the part that is hard?
> I thought the wiki was helpful there but may some other issue is holding you.
> One zoology of such analyzers is at:
>   http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> 
> If that is the issue, here's a one sentence explanation: if you have a new 
> analyzer you want to declare a new field-type and field with that analyzer; 
> queries should be going through it as well as indexing. Matching word A with 
> word B will then happen if word A and B are converted by your analyzer to the 
> same token (this is how cat and cats match when using the PorterStemmer for 
> example).
> 
> paul
> 
> 
> Le 16 oct. 2011 à 14:09, Momo..Lelo .. a écrit :
> 
> > 
> > Dear Gora, 
> > 
> > Thank you for the quick response. 
> > 
> > Actually I 
> > need to do Soundex for Arabic language. The code is already done in Java. 
> > But I 
> > couldn't understand how can I implement it as Solr filter. 
> > 
> > Regards,
> > 
> > 
> > 
> >> From: g...@mimirtech.com
> >> Date: Sun, 16 Oct 2011 16:19:48 +0530
> >> Subject: Re: Implement Custom Soundex
> >> To: solr-user@lucene.apache.org
> >> 
> >> 2011/10/16 Momo..Lelo .. :
> >>> 
> >>> Dear,
> >>> 
> >>> Does anyone there has an experience of developing a custom Soundex.
> >>> 
> >>> If you have an experience doing this and can offer some help and share 
> >>> experience I'd really appreciate it.
> >> 
> >> I presume that this is in the context of Solr, and spell-checking.
> >> We did this as an exercise for Indian-language words transliterated
> >> into English, hooking into the open-source spell-checking library,
> >> aspell, which provided us  with a soundex-like algorithm (the actual
> >> algorithm is quite different, but works better than soundex, at
> >> least for our use case). We were quite satisfied with the results,
> >> though unfortunately this never went into production.
> >> 
> >> Would be glad to help, though I am going to be really busy the
> >> next few days. Please do provide us with more details on your
> >> requirements.
> >> 
> >> Regards,
> >> Gora
> >   
> 
  

Re: Implement Custom Soundex

2011-10-23 Thread Paul Libbrecht
Momo,

if you have the conversion text to tokens then all you need to do is implement 
a custom analyzer, deploy it inside the solr webapp, then plug it into the 
schema.

Is that the part that is hard?
I thought the wiki was helpful there but may some other issue is holding you.
One zoology of such analyzers is at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

If that is the issue, here's a one sentence explanation: if you have a new 
analyzer you want to declare a new field-type and field with that analyzer; 
queries should be going through it as well as indexing. Matching word A with 
word B will then happen if word A and B are converted by your analyzer to the 
same token (this is how cat and cats match when using the PorterStemmer for 
example).

paul


Le 16 oct. 2011 à 14:09, Momo..Lelo .. a écrit :

> 
> Dear Gora, 
> 
> Thank you for the quick response. 
> 
> Actually I 
> need to do Soundex for Arabic language. The code is already done in Java. But 
> I 
> couldn't understand how can I implement it as Solr filter. 
> 
> Regards,
> 
> 
> 
>> From: g...@mimirtech.com
>> Date: Sun, 16 Oct 2011 16:19:48 +0530
>> Subject: Re: Implement Custom Soundex
>> To: solr-user@lucene.apache.org
>> 
>> 2011/10/16 Momo..Lelo .. :
>>> 
>>> Dear,
>>> 
>>> Does anyone there has an experience of developing a custom Soundex.
>>> 
>>> If you have an experience doing this and can offer some help and share 
>>> experience I'd really appreciate it.
>> 
>> I presume that this is in the context of Solr, and spell-checking.
>> We did this as an exercise for Indian-language words transliterated
>> into English, hooking into the open-source spell-checking library,
>> aspell, which provided us  with a soundex-like algorithm (the actual
>> algorithm is quite different, but works better than soundex, at
>> least for our use case). We were quite satisfied with the results,
>> though unfortunately this never went into production.
>> 
>> Would be glad to help, though I am going to be really busy the
>> next few days. Please do provide us with more details on your
>> requirements.
>> 
>> Regards,
>> Gora
> 



help needed on solr-uima integration

2011-10-23 Thread Xue-Feng Yang
Hi,

After google online, some parts in the "puzzle" still missing. The best is to 
find a simple example showing the whole process. Is there any example like 
apache-uima/examples/descriptors/tutorial/ex3 RoomNumber and DateTime 
integrated into solr?  In particular, how to feed "text" into solr for indexing 
which has at least two fields?

Thanks,

Xue-Feng