date:20110421

Hi Kevin,

I think you made OS06Y the uniqueKey, right?
So, in entity 1 you specify values for it, but in entity 2 you do so as
well. 
I am not absolutely sure about this, but: It seems like your two entities
create two documents and the second will overwrite the first.

Have a look at this page:
http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_tables_into_Solr

I think it will help you in rewriting your queries to fit your usecase.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846296.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Apache Spam Filter Blocking Messages

This really helps at the mailinglists. 
If you send your mails with Thunderbird, be sure to check that you enforce
plain-text-emails. If not, it will often send HTML-mails.

Regards,
Em


Marvin Humphrey wrote:
 
 On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
 (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
 
 Note the HTML_MESSAGE in the list of things SpamAssassin didn't like.
 
 Apparently I sound like spam when I write perfectly good English and
 include
 some xml and a link to a jira ticket in my e-mail (I tried a couple
 different variations).  Anyone know a way around this filter, or should I
 just respond to those involved in the e-mail chain directly and avoid the
 mailing list?
 
 Send plain text email instead of HTML.  That solves the problem 99% of the
 time.
 
 Marvin Humphrey
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to return score without using _val_

Hi,

I agree with Yonik here - I do not understand what you would like to do as
well.
But some additional note from my side:
Your FQs never influences the score! Of course you can specify the same
query twice, once as a filter - query and once as a regular query but I do
not see the reason to do so. It sounds like unnecessary effort without a
win. 

Regards,
Em 


Bill Bell wrote:
 
 I would like to influence the score but I would rather not mess with the
 q=
 field since I want the query to dismax for Q.
 
 Something like:
 
 fq={!type=dismax qf=$qqf v=$qspec}
 fq={!type=dismax qt=dismaxname v=$qname}
 q=_val_:{!type=dismax qf=$qqf  v=$qspec} _val_:{!type=dismax
 qt=dismaxname v=$qname}
 
 Is there a way to do a filter and add the FQ to the score by doing it
 another way? 
 
 Also does this do multiple queries? Is this the right way to do it?
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-return-score-without-using-val-tp2841443p2846317.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: entity name issue

Hi Tjong,

seems like your XML was invalid.

Try the following and compare it to your original config:

entity name=e_a query=select myschema.table_a.aid as id,
myschema.table_a.aid as a_aid from myschema.table_a where
'${dataimporter.request.clean}' != 'false' and myschema.table_a.aid
entity name=e_b query=select col as c_col from myschema.table_b 
where
myschema.table_b.aid='${ea.a_aid}'/
/entity  

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2846326.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Kevin Xiang

Thanks Em.
Yes, OS06Y is the uniqueKey.
Table1 and Table2 is parallel in my example.
In the Url:
http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_table
s_into_Solr
The tables don't have parallel relations in the above URL example
I want to know that can solr implement the case?
1.Get data from database table1;
2.Get data from database table2;
3.merge the fields of table1 and table2;

The configuration of db-data-config.xml is the following:
document name=allperf
entity name=PerformanceData1
dataSource=getTrailingTotalReturnForMonthEnd1 query=SELECT Perfo
rmanceId,Trailing1MonthReturn,Trailing2MonthReturn,Trailing3MonthReturn,
FROM  Table1
field column=PerformanceId name=OS06Y /
field column=Trailing1MonthReturn name=PM004 /
field column=Trailing2MonthReturn name=PM133 /
field column=Trailing3MonthReturn name=PM006 /
/entity
entity name=PerformanceData2
dataSource=getTrailingTotalReturnForMonthEnd2 query=SELECT Performan
ceId,Trailing10YearReturn,Trailing15YearReturn,TrailingYearToDateReturn,
SinceInceptionReturn FROM Table2
field column=PerformanceId name=OS06Y /
field column=Trailing10YearReturn name=PM00I /
field column=Trailing15YearReturn name=PM00K /
field column=TrailingYearToDateReturn name=PM00A /
field column=SinceInceptionReturn name=PM00M /
/entity
/document

Because I don't want to get one id and data from table1 and then get the
data by id from table2,it may met performance issue.

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 4:38 PM
To: solr-user@lucene.apache.org
Subject: Re: The issue of import data from database using Solr DIH

Hi Kevin,

I think you made OS06Y the uniqueKey, right?
So, in entity 1 you specify values for it, but in entity 2 you do so as
well. 
I am not absolutely sure about this, but: It seems like your two
entities
create two documents and the second will overwrite the first.

Have a look at this page:
http://wiki.apache.org/solr/DIHQuickStart#Index_data_from_multiple_table
s_into_Solr

I think it will help you in rewriting your queries to fit your usecase.

Regards,
Em

--
View this message in context:
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas
e-using-Solr-DIH-tp2845318p2846296.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: stemming filter analyzers, any favorites?

Hi Robert,

we often ran into the same issue with stemmers. This is why we created more
than one field, each field with different stemmers. It adds some overhead
but worked quite well.

Regarding your off-topic-question:
Look at the debugging-output of your searches. Sometimes you configured your
tools, especially the WDF, wrong and the queryParser creates an unexpected
result which leads to unmatched but still relevant documents.

Please, show us your debugging-output and the field-definition so that we
can provide you some help!

Regards,
Em


Robert Petersen-3 wrote:
 
 I have been doing that, and for Bags example the trailing 's' is not being
 removed by the Kstemmer so if indexing the word bags and searching on bag
 you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
 Kstemmer is dictionary based so bags isn't in the dictionary?   That
 trailing 's' should always be dropped no?  That seems like it would be
 better, we don't want to make synonyms for basic use cases like this.  I
 fear I will have to return to the Porter stemmer.  Are there other better
 ones is my main question.
 
 Off topic secondary question: sometimes I am puzzled by the output of the
 analysis page.  It seems like there should be a match, but I don't get the
 results during a search that I'd expect...  
 
 Like in the case if the WordDelimiterFilterFactory splits up a term into a
 bunch of terms before the K-stemmer is applied, sometimes if the matching
 term is in position two of the final analysis but the searcher had the
 partial term just alone and so thereby in position 1 in the analysis stack
 then when searching there wasn't a match.  Am I reading this correctly? 
 Is that right or should that match and I am misreading my analysis output?  
 
 Thanks!
 
 Robi
 
 PS  I have a category named Bags and am catching flack for it not coming
 up in a search for bag.  hah
 PPS the term is not in protwords.txt
 
 
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term position 1
 term text bags
 term type word
 source start,end  0,4
 payload   
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, April 20, 2011 10:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: stemming filter analyzers, any favorites?
 
 You can get a better sense of exactly what tranformations occur when
 if you look at the analysis page (be sure to check the verbose
 checkbox).
 
 I'm surprised that bags doesn't match bag, what does the analysis
 page say?
 
 Best
 Erick
 
 On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt;
 wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory


 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/
                                
                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                        analyzer type=query
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=query_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/

RE: The issue of import data from database using Solr DIH

Not sure I understood you correct:

You expect that OS06Y stores *two* different performanceIds? One from table1
and the other from table2?
I think this may be a problem.

If both OS06Y-keys are equal, than you can use the syntax as mentioned in
the wiki without any problems. You just have to rewrite your config to make
the second entity a sub-entity and to add a WHERE-clause.

If this is really not possible for you, just a guess, what happens if you
remove the OS06Y-field from your second entity?

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846347.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Need to create dyanamic indexies base on different document workspaces

2011-04-21 Thread Gaurav Shingala


Is it possible to create solr core dyanamically?

In our case we want each workspace to have its own solr index.

 

Thanks
 
 From: chandan.tamra...@nepasoft.com
 Date: Thu, 21 Apr 2011 11:57:53 +0545
 Subject: Re: Need to create dyanamic indexies base on different document 
 workspaces
 To: solr-user@lucene.apache.org
 
 It depends on your application design how you want your index
 
 
 There is a feature called solr core . http://wiki.apache.org/solr/CoreAdmin
 You could still have a single index but a field to differentiate the items
 in index
 
 thanks
 
 
 On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala 
 gaurav.shing...@hotmail.com wrote:
 
 
 
 
 
  Hi,
 
  Is there a way to create different solr indexes for different categories?
  We have different document workspaces and ideally want each workspace to
  have its own solr index.
 
  Thanks,
  Gaurav
 
 
 
 
 
 -- 
 Chandan Tamrakar
 *
 *

Re: how to abort a running optimize

Hi Stockii,

how did you configured your segments-number in Solrconfig.xml?
Decrease the number to speed up things automatically.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-abort-a-running-optimize-tp2838721p2846369.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Need to create dyanamic indexies base on different document workspaces

Yes, have a look at the wiki-page. It explains some configurations and
REST-API-methods to create cores dynamically and if/how they are persisted.

Regards,
Em 


Gaurav Shingala wrote:
 
 Is it possible to create solr core dyanamically?
 
 In our case we want each workspace to have its own solr index.
 
  
 
 Thanks
  
 From: chandan.tamra...@nepasoft.com
 Date: Thu, 21 Apr 2011 11:57:53 +0545
 Subject: Re: Need to create dyanamic indexies base on different document
 workspaces
 To: solr-user@lucene.apache.org
 
 It depends on your application design how you want your index
 
 
 There is a feature called solr core .
 http://wiki.apache.org/solr/CoreAdmin
 You could still have a single index but a field to differentiate the
 items
 in index
 
 thanks
 
 
 On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala 
 gaurav.shing...@hotmail.com wrote:
 
 
 
 
 
  Hi,
 
  Is there a way to create different solr indexes for different
 categories?
  We have different document workspaces and ideally want each workspace
 to
  have its own solr index.
 
  Thanks,
  Gaurav
 
 
 
 
 
 -- 
 Chandan Tamrakar
 *
 *
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-to-create-dyanamic-indexies-base-on-different-document-workspaces-tp2845919p2846371.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Kevin Xiang

I try remove the OS06Y-field from your second entity ,import the
second entity failed.

Give a example:

Table1:
OS06Y=123,f1=100,f2=200,f3=300;
OS06Y=456,f1=100,f2=200,f3=300;

Table2:
OS06Y=123,f4=100,f5=200;
OS06Y=456,f4=100;
OS06Y=789,f4=100;

I want the result:
OS06Y=123,f1=100,f2=200,f3=300,f4=100,f5=200;
OS06Y=456,f1=100,f2=200,f3=300,f4=100;
OS06Y=789,f4=100;

Can solr implement it? if yes,how to configure dataconfig.xml in solr?

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 4:59 PM
To: solr-user@lucene.apache.org
Subject: RE: The issue of import data from database using Solr DIH

Not sure I understood you correct:

You expect that OS06Y stores *two* different performanceIds? One from
table1
and the other from table2?
I think this may be a problem.

If both OS06Y-keys are equal, than you can use the syntax as mentioned
in
the wiki without any problems. You just have to rewrite your config to
make
the second entity a sub-entity and to add a WHERE-clause.

If this is really not possible for you, just a guess, what happens if
you
remove the OS06Y-field from your second entity?

Regards,
Em

--
View this message in context:
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas
e-using-Solr-DIH-tp2845318p2846347.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need to create dyanamic indexies base on different document workspaces

2011-04-21 Thread Chandan Tamrakar

Actually you need to put  a file named *solr.xml* in the solr.home directory
to create the solr core .
you can do that programatically if you want to make it dynamic based on your
logic

pls check the solr core admin document.



On Thu, Apr 21, 2011 at 2:52 PM, Gaurav Shingala 
gaurav.shing...@hotmail.com wrote:


 Is it possible to create solr core dyanamically?

 In our case we want each workspace to have its own solr index.



 Thanks

  From: chandan.tamra...@nepasoft.com
  Date: Thu, 21 Apr 2011 11:57:53 +0545
  Subject: Re: Need to create dyanamic indexies base on different document
 workspaces
  To: solr-user@lucene.apache.org
 
  It depends on your application design how you want your index
 
 
  There is a feature called solr core .
 http://wiki.apache.org/solr/CoreAdmin
  You could still have a single index but a field to differentiate the
 items
  in index
 
  thanks
 
 
  On Thu, Apr 21, 2011 at 10:55 AM, Gaurav Shingala 
  gaurav.shing...@hotmail.com wrote:
 
  
  
  
  
   Hi,
  
   Is there a way to create different solr indexes for different
 categories?
   We have different document workspaces and ideally want each workspace
 to
   have its own solr index.
  
   Thanks,
   Gaurav
  
 
 
 
 
  --
  Chandan Tamrakar
  *
  *





-- 
Chandan Tamrakar
*
*

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread lboutros

What you want to do is something like a left outer join, isn't it ?

something like : select table2.OS06Y, f1,f2,f3,f4,f5 from table2 left outer
join table1 on table2.OS06Y = table1.OS06Y where ...

could you prepare a view in your RDBMS ? That could be another solution ?

Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846403.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The issue of import data from database using Solr DIH

As Iboutrus mentioned, if you can summarize it in a query, than yes, Solr can
handle it. 

Make a step backward: Do not think of Solr. Write a query (one! query) that
shows exactly the output you exepct. Afterwards, implement this query as a
source for DIH.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-database-using-Solr-DIH-tp2845318p2846414.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need to create dyanamic indexies base on different document workspaces

Additionally, there is an already set up example for a multicore-setup in the
example directory of your Solr-distribution.

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Need-to-create-dyanamic-indexies-base-on-different-document-workspaces-tp2845919p2846417.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: The issue of import data from database using Solr DIH

2011-04-21 Thread Kevin Xiang

Yes, it is like the left outer join.
In my example.the table may be table or view or stored procedure,I can
not change it in database.
If for every id in table1,we need search the fields by id from table2 in
database,it will met performance issue,especially the size of tables are
very big.

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: Thursday, April 21, 2011 5:25 PM
To: solr-user@lucene.apache.org
Subject: RE: The issue of import data from database using Solr DIH

What you want to do is something like a left outer join, isn't it ?

something like : select table2.OS06Y, f1,f2,f3,f4,f5 from table2 left
outer
join table1 on table2.OS06Y = table1.OS06Y where ...

could you prepare a view in your RDBMS ? That could be another solution
?

Ludovic.

-
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/The-issue-of-import-data-from-databas
e-using-Solr-DIH-tp2845318p2846403.html
Sent from the Solr - User mailing list archive at Nabble.com.

Unable to load EntityProcessor implementation for entity:16865747177753

2011-04-21 Thread vrpar...@gmail.com

hello i have one datasource - is sql server db and
second datasource - is file but dynamic means based on first
datasource db record i want to fetch one file that's why i try to use
tikaentityprocessor but got following error

org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to
load EntityProcessor implementation for entity:16865747177753 Processing
Document # 1
at
org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:576)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:314)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
Caused by: java.lang.ClassNotFoundException: Unable to load
TikaEntityProcessor or
org.apache.solr.handler.dataimport.TikaEntityProcessor
at
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:738)
at
org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:573)
... 7 more
Caused by: org.apache.solr.common.SolrException: Error loading class
'TikaEntityProcessor'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375)
at
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:728)
... 8 more
Caused by: java.lang.ClassNotFoundException: TikaEntityProcessor
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)...




data config file

dataConfig
dataSource name=ds1 driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=jdbc:sqlserver://SQL2;databaseName=test user=test password=test /
dataSource type=BinURLDataSource name=bin/   
document name=Customer
entity name=Cus pk=CustomerID query=select * from customer
dataSource=ds1
field column=CustomerID /
field column=Name /
field column=Email /
field column=file /
field column=PhoneNo /
entity processor=TikaEntityProcessor 
tikaConfig=tikaconfig.xml
url=D:\Customer_Files\${Cus.file} dataSource=bin format=text 
  
  field column=text/
/entity
 /entity
/document
/dataConfig

please help me to solve this problem

Thanks

Vishal 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-load-EntityProcessor-implementation-for-entity-16865747177753-tp2846513p2846513.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PECL SOLR PHP extension, JSON output

2011-04-21 Thread Stefan Matheis

give it a try: http://php.net/manual/en/solrclient.setresponsewriter.php

On Thu, Apr 21, 2011 at 9:03 AM, roySolr royrutten1...@gmail.com wrote:
 Hello,

 I use the PECL php extension for SOLR. I want my output in JSON.

 This is not working:

 $query-set('wt', 'json');

 How do i solve this problem?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846092.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-21 Thread Robert Gründler


On 20.04.11 18:51, Robert Muir wrote:

Hi, there is a proposed patch uploaded to the issue. Maybe you can
help by reviewing/testing it?


if i succeed in compiling solr, i can test the patch. Is this the right 
starting point

for such an endeavour ? http://wiki.apache.org/solr/HackingSolr



-robert


2011/4/20 Robert Gründlerrob...@dubture.com:

Hi all,

i'm getting the following exception when using highlighting for a field
containing HTMLStripCharFilterFactory:

org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token ...
exceeds length of provided text sized 21

It seems this is a know issue:

https://issues.apache.org/jira/browse/LUCENE-2208

Does anyone know if there's a fix implemented yet in solr?


thanks!


-robert

Re: Unable to load EntityProcessor implementation for entity:16865747177753

2011-04-21 Thread firdous_kind86

can i see your tikaconfig.xml?

meanwhile have a look at this bug:
https://issues.apache.org/jira/browse/SOLR-2116
a similar thread also exists:
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-td2839188.html

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-load-EntityProcessor-implementation-for-entity-16865747177753-tp2846513p2846574.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PECL SOLR PHP extension, JSON output

2011-04-21 Thread roySolr

I have tried that but it seems like JSON is not supported

Parameters

responseWriter

One of the following :

- xml
 - phpnative





--
View this message in context: 
http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html
Sent from the Solr - User mailing list archive at Nabble.com.

Can't determine Sort Order error when using sort by function

2011-04-21 Thread Otis Gospodnetic

Hello,

I'm trying out sorting by function with the new function queries and invariably 
getting this error:

  Can't determine Sort Order: 'termfreq(name,samsung)', pos=22

Here's an example call:
http://localhost:8983/solr/select/?q=*:*sort=termfreq%28name,samsung%29

What am I doing wrong?

Thanks,
Otis

Re: PECL SOLR PHP extension, JSON output

2011-04-21 Thread Stefan Matheis

Hm yes correct .. there is a explicit validation of response-writers in place.

if you want to modify it yourself, check the current trunk
(http://svn.php.net/repository/pecl/solr/trunk/) modify
solr_constants.h, define another response_writer and add another check
in solr_functions_helpers.c in function
solr_is_supported_response_writer

compile the module and go ahead :)

Regards
Stefan

On Thu, Apr 21, 2011 at 1:58 PM, roySolr royrutten1...@gmail.com wrote:
 I have tried that but it seems like JSON is not supported

 Parameters

 responseWriter

    One of the following :

    - xml
     - phpnative





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/PECL-SOLR-PHP-extension-JSON-output-tp2846092p2846728.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: entity name issue

2011-04-21 Thread tjtong

Hi Em,

Thanks a lot! But it still does not work. Actually my where clause in my
query was '${dataimporter.request.clean}' != 'false' and
myschema.table_a.aid=${dataimporter.request.aid} which I used to pass a
value to the full import process, and it worked without the prefix
myschema. on sybase database, but did not work on oracle either with or
without the prefix. (It would complain table not existing without the
prefix). 

TJ

--
View this message in context: 
http://lucene.472066.n3.nabble.com/entity-name-issue-tp2843812p2846816.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can't determine Sort Order error when using sort by function

On Thu, Apr 21, 2011 at 8:30 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hello,

 I'm trying out sorting by function with the new function queries and 
 invariably
 getting this error:

  Can't determine Sort Order: 'termfreq(name,samsung)', pos=22

 Here's an example call:
 http://localhost:8983/solr/select/?q=*:*sort=termfreq%28name,samsung%29

 What am I doing wrong?

Try adding the sort order asc or desc after the function.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco




 Thanks,
 Otis

Re: Highest frequency terms for a subset of documents

OK, so I copied my index and ran solr3.1 against it.
Qtime dropped, from about 40s to 17s! This is good news, but still longer
than i hoped for.
I tried to do the same text with 4.0, but i'm getting
IndexFormatTooOldException since my index was created using 1.4.1. Is my
only chance to test this is to reindex using 3.1 or 4.0?

Another strange behavior is that the Qtime seems pretty stable, no matter
how many object match my query. 200K and 20K both take about 17s.
I would have guessed that since the time is going over all the terms of all
the subset documents, would mean that the more documents, the more time.

Thanks for any insights

ofer



On Thu, Apr 21, 2011 at 3:07 AM, Ofer Fort o...@tra.cx wrote:

 my documents are user entries, so i'm guessing they vary a lot.
 Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
 thanks guys!


 On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley 
 yo...@lucidimagination.comwrote:

 On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort o...@tra.cx wrote:
  Thanks
  but i've disabled the cache already, since my concern is speed and i'm
  willing to pay the price (memory)

 Then you should not disable the cache.

 , and my subset are not fixed.
  Does the facet search do any extra work that i don't need, that i might
 be
  able to disable (either by a flag or by a code change),
  Somehow i feel, or rather hope, that counting the terms of 200K
 documents
  and finding the top 500 should take less than 30 seconds.

 Using facet.enum.cache.minDf should be a little faster than just
 disabling the cache - it's a different code path.
 Using the cache selectively will speed things up, so try setting that
 minDf to 1000 or so for example.

 How many unique terms do you have in the index?
 Is this Solr 3.1 - there were some optimizations when there were many
 terms to iterate over?
 You could also try trunk, which has even more optimizations, or the
 bulkpostings branch if you really want to experiment.

 -Yonik

Re: Highest frequency terms for a subset of documents

On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
 Another strange behavior is that the Qtime seems pretty stable, no matter
 how many object match my query. 200K and 20K both take about 17s.
 I would have guessed that since the time is going over all the terms of all
 the subset documents, would mean that the more documents, the more time.

facet.method=enum steps over all terms in the index for that field...
that takes time regardless of how many documents are in the base set.

There are also short-circuit methods that avoid looking at the docs
for a term if it's docfreq is low enough that it couldn't possibly
make it into the priority queue.  Because if this, it can actually be
faster to facet on a larger base set (try *:* as the base query).

Actually, it might be interesting to see the query time if you set
facet.mincount equal to the number of docs in the base set - that will
test pretty much just the time to enumerate over the terms without
doing any set intersections at all.  Be careful not to set mincount
greater than the number of docs in the base set though - solr will
short-circuit that too and skip enumeration altogether.

The work on the bulkpostings branch should definitely speed up your
case even more - but I have no idea when it will land on trunk.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

Not sure i fully understand,
If facet.method=enum steps over all terms in the index for that field,
than what does setting the q=field:subset do? if i set the q=*:*, than how
do i get the frequency only on my subset?
Ofer

On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
  Another strange behavior is that the Qtime seems pretty stable, no matter
  how many object match my query. 200K and 20K both take about 17s.
  I would have guessed that since the time is going over all the terms of
 all
  the subset documents, would mean that the more documents, the more time.

 facet.method=enum steps over all terms in the index for that field...
 that takes time regardless of how many documents are in the base set.

 There are also short-circuit methods that avoid looking at the docs
 for a term if it's docfreq is low enough that it couldn't possibly
 make it into the priority queue.  Because if this, it can actually be
 faster to facet on a larger base set (try *:* as the base query).

 Actually, it might be interesting to see the query time if you set
 facet.mincount equal to the number of docs in the base set - that will
 test pretty much just the time to enumerate over the terms without
 doing any set intersections at all.  Be careful not to set mincount
 greater than the number of docs in the base set though - solr will
 short-circuit that too and skip enumeration altogether.

 The work on the bulkpostings branch should definitely speed up your
 case even more - but I have no idea when it will land on trunk.


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort o...@tra.cx wrote:
 Not sure i fully understand,
 If facet.method=enum steps over all terms in the index for that field,
 than what does setting the q=field:subset do? if i set the q=*:*, than how
 do i get the frequency only on my subset?

It's an implementation detail.  Faceting *does* just give you counts
that just match
q=field:subset.  How it does it is a different matter (i.e. for
facet.method=enum, it
must step over all terms in the field), so it's closer to O(nterms in
field) rather than O(ndocs in base set)

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


 Ofer

 On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley yo...@lucidimagination.com
 wrote:

 On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
  Another strange behavior is that the Qtime seems pretty stable, no
  matter
  how many object match my query. 200K and 20K both take about 17s.
  I would have guessed that since the time is going over all the terms of
  all
  the subset documents, would mean that the more documents, the more time.

 facet.method=enum steps over all terms in the index for that field...
 that takes time regardless of how many documents are in the base set.

 There are also short-circuit methods that avoid looking at the docs
 for a term if it's docfreq is low enough that it couldn't possibly
 make it into the priority queue.  Because if this, it can actually be
 faster to facet on a larger base set (try *:* as the base query).

 Actually, it might be interesting to see the query time if you set
 facet.mincount equal to the number of docs in the base set - that will
 test pretty much just the time to enumerate over the terms without
 doing any set intersections at all.  Be careful not to set mincount
 greater than the number of docs in the base set though - solr will
 short-circuit that too and skip enumeration altogether.

 The work on the bulkpostings branch should definitely speed up your
 case even more - but I have no idea when it will land on trunk.


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

I see, thanks.
So if I would want to implement something that would fit my needs, would
going through the subset of documents and counting all the terms in each
one, would be faster? and easier to implement?

On Thu, Apr 21, 2011 at 5:36 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort o...@tra.cx wrote:
  Not sure i fully understand,
  If facet.method=enum steps over all terms in the index for that field,
  than what does setting the q=field:subset do? if i set the q=*:*, than
 how
  do i get the frequency only on my subset?

 It's an implementation detail.  Faceting *does* just give you counts
 that just match
 q=field:subset.  How it does it is a different matter (i.e. for
 facet.method=enum, it
 must step over all terms in the field), so it's closer to O(nterms in
 field) rather than O(ndocs in base set)

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


  Ofer
 
  On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley 
 yo...@lucidimagination.com
  wrote:
 
  On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort o...@tra.cx wrote:
   Another strange behavior is that the Qtime seems pretty stable, no
   matter
   how many object match my query. 200K and 20K both take about 17s.
   I would have guessed that since the time is going over all the terms
 of
   all
   the subset documents, would mean that the more documents, the more
 time.
 
  facet.method=enum steps over all terms in the index for that field...
  that takes time regardless of how many documents are in the base set.
 
  There are also short-circuit methods that avoid looking at the docs
  for a term if it's docfreq is low enough that it couldn't possibly
  make it into the priority queue.  Because if this, it can actually be
  faster to facet on a larger base set (try *:* as the base query).
 
  Actually, it might be interesting to see the query time if you set
  facet.mincount equal to the number of docs in the base set - that will
  test pretty much just the time to enumerate over the terms without
  doing any set intersections at all.  Be careful not to set mincount
  greater than the number of docs in the base set though - solr will
  short-circuit that too and skip enumeration altogether.
 
  The work on the bulkpostings branch should definitely speed up your
  case even more - but I have no idea when it will land on trunk.
 
 
  -Yonik
  http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
  25-26, San Francisco

Re: Highest frequency terms for a subset of documents

On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort o...@tra.cx wrote:
 I see, thanks.
 So if I would want to implement something that would fit my needs, would
 going through the subset of documents and counting all the terms in each
 one, would be faster? and easier to implement?

That's not just your needs, that's everyone's needs (it's the
definition of field faceting).
There's no way to do what you're asking with a term enumerator (i.e.
facet.method=enum).

Going through documents and counting all the terms in each is what
facet.method=fc does.
But it's also not great when the number of unique terms per document is high.
If you can think of a better way, go for it!


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

Re: Highest frequency terms for a subset of documents

So if i want to use the facet.method=fc, is there a way to speed it up? and
remove the bucket size limitation?

On Thu, Apr 21, 2011 at 5:58 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort o...@tra.cx wrote:
  I see, thanks.
  So if I would want to implement something that would fit my needs, would
  going through the subset of documents and counting all the terms in each
  one, would be faster? and easier to implement?

 That's not just your needs, that's everyone's needs (it's the
 definition of field faceting).
 There's no way to do what you're asking with a term enumerator (i.e.
 facet.method=enum).

 Going through documents and counting all the terms in each is what
 facet.method=fc does.
 But it's also not great when the number of unique terms per document is
 high.
 If you can think of a better way, go for it!


 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Apache Spam Filter Blocking Messages

2011-04-21 Thread Trey Grainger

Good to know; I'll go change those settings, then.  Thanks for the feedback.

-Trey


On Thu, Apr 21, 2011 at 4:42 AM, Em mailformailingli...@yahoo.de wrote:

 This really helps at the mailinglists.
 If you send your mails with Thunderbird, be sure to check that you enforce
 plain-text-emails. If not, it will often send HTML-mails.

 Regards,
 Em


 Marvin Humphrey wrote:
 
  On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
  (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
                              
  Note the HTML_MESSAGE in the list of things SpamAssassin didn't like.
 
  Apparently I sound like spam when I write perfectly good English and
  include
  some xml and a link to a jira ticket in my e-mail (I tried a couple
  different variations).  Anyone know a way around this filter, or should I
  just respond to those involved in the e-mail chain directly and avoid the
  mailing list?
 
  Send plain text email instead of HTML.  That solves the problem 99% of the
  time.
 
  Marvin Humphrey
 


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: old searchers not closing after optimize or replication

2011-04-21 Thread Trey Grainger

Hey Bernd,

Checkout https://issues.apache.org/jira/browse/SOLR-2469. There is a
pretty bad bug in Solr 3.1 which occurs if you have str
name=replicateAfterstartup/str set in your replication
configuration in solrconfig.xml. See the thread between Yonik and
myself from a few days ago titled Solr 3.1: Old Index Files Not
Removed on Optimize.

You can disable startup replication and perform an optimize to see if
this fixes your problem of old index files being left behind (though
you may have some old index files left behind from before this change
that you still need to clean-up). Yonik has already pushed up a patch
into the 3x branch and trunk for this issue. I can confirm that
applying the patch (or just removing startup replication) resolved the
issue for us.

Do you think this is your issue?

Thanks,

-Trey

On Thu, Apr 21, 2011 at 2:27 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
Hi Erik,

deletionPolicy class=solr.SolrDeletionPolicy
str name=maxCommitsToKeep1/str
str name=maxOptimizedCommitsToKeep0/str
/deletionPolicy

Due to 44 minutes optimization time we do an optimization once a day
during the night.

I will try with an smaler index on my development system.

Best regards,
Bernd

Am 20.04.2011 17:50, schrieb Erick Erickson:

It looks OK, but still doesn't explain keeping the old files around. What
is
yourdeletionPolicy in your solrconfig.xml look like? It's
possible that you're seeing Solr attempt to keep around several
optimized copies of the index, but that still doesn't explain why
restarting Solr removes them unless the deletionPolicy gets invoked
on sometime and you're index files are aging out (I don't know the
internals of deletion well enough to say).

About optimization. It's become less important with recent code. Once
upon a time, it made a substantial difference in search speed. More
recently, it has very little impact on search speed, and is used
much more sparingly. Its greatest benefit is reclaiming unused resources
left over from deleted documents. So you might want to avoid the pain
of optimizing (44 minutes!) and only optimize rarely of if you have
deleted a lot of documents.

It might be worthwhile to try (with a smaller index !) a bunch of optimize
cycles and see if thedeletionPolicy idea has any merit. I'd expect
your index to reach a maximum and stay there after the saved
copies of the index was reached...

But otherwise I'm puzzled...

Erick

On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:

Hi Erik,

Am 20.04.2011 15:42, schrieb Erick Erickson:

H, this isn't right. You've pretty much eliminated the obvious
things. What does lsof show? I'm assuming it shows the files are
being held open by your Solr instance, but it's worth checking.

Just commited new content 3 times and finally optimized.
Again having old index files left.

Then checked on my master, only the newest version of index files are
listed with lsof. No file handles to the old index files but the
old index files remain in data/index/.
Thats strange.

This time replication worked fine and cleaned up old index on slaves.

I'm not getting the same behavior, admittedly on a Windows box.
The only other thing I can think of is that you have a query that's
somehow never ending, but that's grasping at straws.

Do your log files show anything interesting?

Lets see:
- it has the old generation (generation=12) and its files
- and recognizes that there have been several commits (generation=18)

20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start

commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2

commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
_3xm.fdx, segment
s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]

commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
_3xo.tis, _3xp.pr
x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
_3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
_3xn.fdt, _3x
p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
_3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
_3xo.fdt, _3xp.fr
q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
_3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
_3xs.tis, _3x
m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm,
_3xr.fdt]
20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1302159868447

- after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
the SolrDeletionPolicy onCommit and

RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Robert Petersen

Adding another field with another stemmer and searching both???  Wow never 
thought of doing that.  I guess that doesn't really double the size of your 
index tho because all the terms are almost the same right?  Let me look into 
that.  I'll raise the other issue in a separate thread and thanks.

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 1:55 AM
To: solr-user@lucene.apache.org
Subject: RE: stemming filter analyzers, any favorites?

Hi Robert,

we often ran into the same issue with stemmers. This is why we created more
than one field, each field with different stemmers. It adds some overhead
but worked quite well.

Regarding your off-topic-question:
Look at the debugging-output of your searches. Sometimes you configured your
tools, especially the WDF, wrong and the queryParser creates an unexpected
result which leads to unmatched but still relevant documents.

Please, show us your debugging-output and the field-definition so that we
can provide you some help!

Regards,
Em

Robert Petersen-3 wrote:

 I have been doing that, and for Bags example the trailing 's' is not being
 removed by the Kstemmer so if indexing the word bags and searching on bag
 you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
 Kstemmer is dictionary based so bags isn't in the dictionary?   That
 trailing 's' should always be dropped no?  That seems like it would be
 better, we don't want to make synonyms for basic use cases like this.  I
 fear I will have to return to the Porter stemmer.  Are there other better
 ones is my main question.

 Off topic secondary question: sometimes I am puzzled by the output of the
 analysis page.  It seems like there should be a match, but I don't get the
 results during a search that I'd expect...  

 Like in the case if the WordDelimiterFilterFactory splits up a term into a
 bunch of terms before the K-stemmer is applied, sometimes if the matching
 term is in position two of the final analysis but the searcher had the
 partial term just alone and so thereby in position 1 in the analysis stack
 then when searching there wasn't a match.  Am I reading this correctly? 
 Is that right or should that match and I am misreading my analysis output?  

 Thanks!

 Robi

 PS  I have a category named Bags and am catching flack for it not coming
 up in a search for bag.  hah
 PPS the term is not in protwords.txt

 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term position 1
 term text bags
 term type word
 source start,end  0,4
 payload   

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, April 20, 2011 10:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: stemming filter analyzers, any favorites?

 You can get a better sense of exactly what tranformations occur when
 if you look at the analysis page (be sure to check the verbose
 checkbox).

 I'm surprised that bags doesn't match bag, what does the analysis
 page say?

 Best
 Erick

 On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt;
 wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory

 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1
                catenateWords=1
                catenateNumbers=1
                catenateAll=1
                preserveOriginal=1
                /                              filter
 class=solr.LowerCaseFilterFactory/

                                filter
 class=com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 protected=protwords.txt/
                                filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
                        /analyzer
                        analyzer type=query
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory

Re: Multiple Tags and Facets

Are there no ideas of how to use multiple tags per filter or to combine some
tags for excluding more than one filter per facet?

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2847569.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: stemming filter analyzers, any favorites?

As far as I know Lucene does not store an inverted index per field, so no, it
would not double the size of the index.

However, it could influence the score a little bit.

For example: If both stemmers reduce schools to school and you are
searching for all schools in america the term school has more weight to
the resulting score, since it definitly occurs in two fields which consist
of nearly the same value.

To reduce this effect you could write your own queryParser which creates a
disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 -
so only the better scoring stemmed-field contributes to the total score of
your document.

Regards,
Em


Robert Petersen-3 wrote:
 
 Adding another field with another stemmer and searching both???  Wow never
 thought of doing that.  I guess that doesn't really double the size of
 your index tho because all the terms are almost the same right?  Let me
 look into that.  I'll raise the other issue in a separate thread and
 thanks.
 
 -Original Message-
 From: Em [mailto:mailformailingli...@yahoo.de] 
 Sent: Thursday, April 21, 2011 1:55 AM
 To: solr-user@lucene.apache.org
 Subject: RE: stemming filter analyzers, any favorites?
 
 Hi Robert,
 
 we often ran into the same issue with stemmers. This is why we created
 more
 than one field, each field with different stemmers. It adds some overhead
 but worked quite well.
 
 Regarding your off-topic-question:
 Look at the debugging-output of your searches. Sometimes you configured
 your
 tools, especially the WDF, wrong and the queryParser creates an unexpected
 result which leads to unmatched but still relevant documents.
 
 Please, show us your debugging-output and the field-definition so that we
 can provide you some help!
 
 Regards,
 Em
 
 
 Robert Petersen-3 wrote:
 
 I have been doing that, and for Bags example the trailing 's' is not
 being
 removed by the Kstemmer so if indexing the word bags and searching on bag
 you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
 Kstemmer is dictionary based so bags isn't in the dictionary?   That
 trailing 's' should always be dropped no?  That seems like it would be
 better, we don't want to make synonyms for basic use cases like this.  I
 fear I will have to return to the Porter stemmer.  Are there other better
 ones is my main question.
 
 Off topic secondary question: sometimes I am puzzled by the output of the
 analysis page.  It seems like there should be a match, but I don't get
 the
 results during a search that I'd expect...  
 
 Like in the case if the WordDelimiterFilterFactory splits up a term into
 a
 bunch of terms before the K-stemmer is applied, sometimes if the matching
 term is in position two of the final analysis but the searcher had the
 partial term just alone and so thereby in position 1 in the analysis
 stack
 then when searching there wasn't a match.  Am I reading this correctly? 
 Is that right or should that match and I am misreading my analysis
 output?  
 
 Thanks!
 
 Robi
 
 PS  I have a category named Bags and am catching flack for it not coming
 up in a search for bag.  hah
 PPS the term is not in protwords.txt
 
 
 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term position1
 term textbags
 term typeword
 source start,end 0,4
 payload  
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, April 20, 2011 10:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: stemming filter analyzers, any favorites?
 
 You can get a better sense of exactly what tranformations occur when
 if you look at the analysis page (be sure to check the verbose
 checkbox).
 
 I'm surprised that bags doesn't match bag, what does the analysis
 page say?
 
 Best
 Erick
 
 On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt;
 wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory


 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/
                                filter class=solr.StopFilterFactory
 ignoreCase=true words=stopwords.txt/
          filter class=solr.WordDelimiterFilterFactory
                generateWordParts=1
                generateNumberParts=1

Re: Highest frequency terms for a subset of documents

On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

Not really - else we would have done it already ;-)
We don't really have great methods for faceting on full-text fields
(as opposed to shorter meta-data fields) today.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

RE: stemming filter analyzers, any favorites?

2011-04-21 Thread Robert Petersen

Nice!  Thanks!

-Original Message-
From: Em [mailto:mailformailingli...@yahoo.de] 
Sent: Thursday, April 21, 2011 9:23 AM
To: solr-user@lucene.apache.org
Subject: RE: stemming filter analyzers, any favorites?

As far as I know Lucene does not store an inverted index per field, so no, it
would not double the size of the index.

However, it could influence the score a little bit.

For example: If both stemmers reduce schools to school and you are
searching for all schools in america the term school has more weight to
the resulting score, since it definitly occurs in two fields which consist
of nearly the same value.

To reduce this effect you could write your own queryParser which creates a
disjunctionMaxQuery consisting of two boolean queries and a tie-break of 0 -
so only the better scoring stemmed-field contributes to the total score of
your document.

Regards,
Em

Robert Petersen-3 wrote:

 Adding another field with another stemmer and searching both???  Wow never
 thought of doing that.  I guess that doesn't really double the size of
 your index tho because all the terms are almost the same right?  Let me
 look into that.  I'll raise the other issue in a separate thread and
 thanks.

 -Original Message-
 From: Em [mailto:mailformailingli...@yahoo.de] 
 Sent: Thursday, April 21, 2011 1:55 AM
 To: solr-user@lucene.apache.org
 Subject: RE: stemming filter analyzers, any favorites?

 Hi Robert,

 we often ran into the same issue with stemmers. This is why we created
 more
 than one field, each field with different stemmers. It adds some overhead
 but worked quite well.

 Regarding your off-topic-question:
 Look at the debugging-output of your searches. Sometimes you configured
 your
 tools, especially the WDF, wrong and the queryParser creates an unexpected
 result which leads to unmatched but still relevant documents.

 Please, show us your debugging-output and the field-definition so that we
 can provide you some help!

 Regards,
 Em

 Robert Petersen-3 wrote:

 I have been doing that, and for Bags example the trailing 's' is not
 being
 removed by the Kstemmer so if indexing the word bags and searching on bag
 you get no matches.  Why wouldn't the trailing 's' get stemmed off? 
 Kstemmer is dictionary based so bags isn't in the dictionary?   That
 trailing 's' should always be dropped no?  That seems like it would be
 better, we don't want to make synonyms for basic use cases like this.  I
 fear I will have to return to the Porter stemmer.  Are there other better
 ones is my main question.

 Off topic secondary question: sometimes I am puzzled by the output of the
 analysis page.  It seems like there should be a match, but I don't get
 the
 results during a search that I'd expect...  

 Like in the case if the WordDelimiterFilterFactory splits up a term into
 a
 bunch of terms before the K-stemmer is applied, sometimes if the matching
 term is in position two of the final analysis but the searcher had the
 partial term just alone and so thereby in position 1 in the analysis
 stack
 then when searching there wasn't a match.  Am I reading this correctly? 
 Is that right or should that match and I am misreading my analysis
 output?  

 Thanks!

 Robi

 PS  I have a category named Bags and am catching flack for it not coming
 up in a search for bag.  hah
 PPS the term is not in protwords.txt

 com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory
 {protected=protwords.txt}
 term position1
 term textbags
 term typeword
 source start,end 0,4
 payload  

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, April 20, 2011 10:55 AM
 To: solr-user@lucene.apache.org
 Subject: Re: stemming filter analyzers, any favorites?

 You can get a better sense of exactly what tranformations occur when
 if you look at the analysis page (be sure to check the verbose
 checkbox).

 I'm surprised that bags doesn't match bag, what does the analysis
 page say?

 Best
 Erick

 On Wed, Apr 20, 2011 at 1:44 PM, Robert Petersen lt;rober...@buy.comgt;
 wrote:
 Stemming filter analyzers... anyone have any favorites for particular
 search domains?  Just wondering what people are using.  I'm using Lucid
 K Stemmer and having issues.   Seems like it misses a lot of common
 stems.  We went to that because of excessively loose matches on the
 solr.PorterStemFilterFactory

 I understand K Stemmer is a dictionary based stemmer.  Seems to me like
 it is missing a lot of common stem reductions.  Ie   Bags does not match
 Bag in our searches.

 Here is my analyzer stack:

                fieldType name=text class=solr.TextField
 positionIncrementGap=100
                        analyzer type=index
                                tokenizer
 class=solr.WhitespaceTokenizerFactory/
                                filter
 class=solr.SynonymFilterFactory synonyms=index_synonyms.txt
 ignoreCase=true expand=true/

Solr search based on list of terms. Order by max(score) for each term.

2011-04-21 Thread Bogdan STOICA

Hello,

I am trying to query a solr server in order to obtain the most relevant
results for a list of terms.

For example i have the list of words nokia, iphone, charger

My schema contains the following data:
nokia
iphone
nokia iphone otherwords
nokia white
iphone white

If I run a simple query like q=nokia OR iphone OR charger i get nokia
iphone otherwords as the most relevant result (because it contains more
query terms)

I would like to get nokia or iphone or iphone white as first results,
because for each individual term they would be the most relevant.

In order to obtain the correct list i would do a query for each term, then
aggregate the results and order them based on the maximum score.

Can I make this query in one request?

This question has also been asked on

http://stackoverflow.com/questions/5743264/solr-search-based-on-list-of-terms-order-by-maxscore-for-each-term

Thank you.

Re: Multiple Tags and Facets

2011-04-21 Thread Jay Hill

I don't think I understand what you're trying to do. Are you trying to
preserve all facets after a user clicks on a facet, and thereby triggers a
filter query, which excludes the other facets? If that's the case, you can
use local parameters to tag the filter queries so they are not used for the
facets:

Let's say I have the following facets:
- Solr
- Lucene
- Nutch
- Mahout

And I do a search for solr.

All of these links will have a filter query:
- Solr [ ?q=solrfq=project:solr ]
- Lucene [ ?q=solrfq=project:lucene ]
- Nutch [ ?q=solrfq=project:nutch ]
- Mahout [ ?q=solrfq=project:mahout ]

But if a user clicks on the Solr facet, the resulting query will exclude
the other facets, so you only see this facet:
- Solr

By using local parameters like this:

?q=solrfq={!tag=myTag}project:solr facet=onfacet.field{!ex=myTag}=project

I can preserve all my facets, so that my query is filtered but all facets
still remain:
- Solr
- Lucene
- Nutch
- Mahout

Hope this helps, but I'm not sure that's what you were after.

-Jay



On Wed, Apr 20, 2011 at 8:03 AM, Em mailformailingli...@yahoo.de wrote:

 Hello,

 I watched an online video with Chris Hostsetter from Lucidimagination. He
 showed the possibility of having some Facets that exclude *all* filter
 while
 also having some Facets that take care of some of the set filters while
 ignoring other filters.

 Unfortunately the Webinar did not explain how they made this and I wasn't
 able to give a filter/facet more than one tag.

 Here is an example:

 Facets and Filters: DocType, Author

 Facet:
 - Author
 -- George (10)
 -- Brian (12)
 -- Christian (78)
 -- Julia (2)

 -Doctype
 -- PDF (70)
 -- ODT (10)
 -- Word (20)
 -- JPEG (1)
 -- PNG (1)

 When clicking on Julia I would like to achieve the following:
 Facet:
 - Author
 -- George (10)
 -- Brian (12)
 -- Christian (78)
 -- Julia (2)
  Julia's Doctypes:
 -- JPEG (1)
 -- PNG (1)

 -Doctype
 -- PDF (70)
 -- ODT (10)
 -- Word (20)
 -- JPEG (1)
 -- PNG (1)

 Another example which adds special options to your GUI could be as
 following:
 Imagine a fashion store.
 If you search for shirt you get a color-facet:

 colors:
 - red (19)
 - green (12)
 - blue (4)
 - black (2)

 As well as a brand-facet:

 brands:
 - puma (18)
 - nike (19)

 When I click on the red color-facet, I would like to get the following
 back:
 colors:
 - red (19)
 - green (12)*
 - blue (4)*
 - black (2)*

 brands:
 - puma (18)*
 - nike (19)

 All those filters marked by an * could be displayed half-transparent or
 so
 - they just show the user that those filter-options exist for his/her
 search
 but aren't included in the result-set, since he/she excluded them by
 clicking the red filter.

 This case is more interesting, if not all red shirts were from nike.
 This way you can show the user that i.e. 8 of 19 red - shirts are from the
 brand you selected/you see 8 of 19 red shirts.

 I hope I explained what I want to achive.

 Thank you!

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2843130.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow

I've been using Solr for a while now, indexing 2-4 million records
using the DIH to pull data from MySQL, which has been working great.
For a new project, I need to index about 20M records (30 fields) and I
have been running into issues with MySQL disconnects, right around
15M. I've tried several remedies I've found on blogs, changing
autoCommit, batchSize etc., and none of them have seem to majorly
resolved the issue. It got me wondering: Is this the way everyone does
it? What about 100M records up to 1B; are those all pulled using DIH
and a single query?

I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.

Thanks for your help, I really enjoy using Solr and I look forward to
indexing even more data!

Re: Multiple Tags and Facets

Hi Jay,

thank you for your reply.

We most enhance your example to reproduce what I mean:

You got the following facets:

project:
- Solr
- Lucene
- Nutch
- Mahout

source:
- Documentation
- Mailinglist
- Wiki
- Commercial Websites

What I want now is: When I click on Solr + Documentation
(fq={tag=p}project:Solrfq={tag=s}source:Documentation), I want to get back
a result-set where I on the one hand see that there are no matches for
Mahout given the filter queries.
On the other hand I also want to see that there are results available for my
search but not for the current filters.

This information is usefull for creating a powerfull UI:
You can show the user that there is possibly valuable information available
on commercial websites but they are excluded from the current search.
Another point is that you can fix your UI: You always show all facets
relevant to the current search, no matter which of them are active.
Those who do not apply anymore to the given result-set (like Mahout in our
example) still remain in the list of available projects but are marked as
unuasable (displayed in smooth gray or something like that to show that they
are inactive).

My problem is that I do not know how to create such a user-experience,
because, if I add another dimension (like the source-facet) things are
getting complicated.

Since Hoss showed in the Mastering Facets Webinar that such cross-taggings
are possible, I thought that this is an already built-in option for Solr. 

Regards,
Em

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2848085.html
Sent from the Solr - User mailing list archive at Nabble.com.

MoreLikeThis

2011-04-21 Thread Brian Lamb

Hi all,

I have an mlt search set up on my site with over 2 million records in the
index. Normally, my results look like:

response
  lst name=responseHeader
int name=status0/int
int name=QTime204/int
  /lst
  result name=match numFound=41750 start=0
doc
  str name=titleSome result./str
/doc
  /result
  result name=response numFound=130872 start=0
doc
  str name=titleA similar result/str
/doc
...
  /result
/response

And there are 100 results under response. However, in some cases, there are
no results under response. Why is this the case and is there anything I
can do about it?

Here is my mlt configuration:

requestHandler name=/mlt class=solr.MoreLikeThisHandler
  lst name=defaults
str name=mlt.fltitle,score/str
int name=mlt.mindf1/int
int name=rows100/int
str name=fl*,score/str
   /lst
/requestHandler

And here is the URL I use to get results:
http://localhost:8983/solr/mlt/?q=title:Some random title

Any help on this matter would be greatly appreciated. Thanks!

Brian Lamb

Re: Multiple Tags and Facets

2011-04-21 Thread Chris Hostetter

: I watched an online video with Chris Hostsetter from Lucidimagination. He
: showed the possibility of having some Facets that exclude *all* filter while
: also having some Facets that take care of some of the set filters while
: ignoring other filters.

FWIW: That webinar is nearly identical to the apachecon talk i gave on the 
same topic, slides of which can be found here...

http://people.apache.org/~hossman/apachecon2010/facets/

This is the example i used on Slide #29...

  Same Facet, Different Exclusions

* A key can be specified for a facet to change the name used to
  identify it in the response.
* This allows you to have multiple instances of a facet, with
   differnet exclusions.

q = Hot Rod 
   fq = {!df=colors tag=cx}purple green 
  facet.field = {!key=all_colors ex=cx}colors 
  facet.field = {!key=overlap_colors}colors

...the point in that example is to treat a field (color) as two 
differnt facets: one with exclusions and one without.

it sounds like what you want is differnet -- i *think* what you 
are asking for is multiple exclusions for a single facet.  I didn't 
mention that in my slides, but you can do that using a comma seperated 
list of exclusions...

q = Hot Rod 
   fq = {!df=body tag=bc}purple
   fq = {!df=interior tag=ic}green
  facet.field = {!ex=bc,ic}model

-Hoss

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Robert Gründler


we're indexing around 10M records from a mysql database into
a single solr core.

The DataImportHandler needs to join 3 sub-entities to denormalize
the data.

We've run into some troubles for the first 2 attempts, but setting
batchSize=-1 for the dataSource resolved the issues.

Do you need a lot of complex joins to import the data from mysql?



-robert




On 4/21/11 8:08 PM, Scott Bigelow wrote:

I've been using Solr for a while now, indexing 2-4 million records
using the DIH to pull data from MySQL, which has been working great.
For a new project, I need to index about 20M records (30 fields) and I
have been running into issues with MySQL disconnects, right around
15M. I've tried several remedies I've found on blogs, changing
autoCommit, batchSize etc., and none of them have seem to majorly
resolved the issue. It got me wondering: Is this the way everyone does
it? What about 100M records up to 1B; are those all pulled using DIH
and a single query?

I've used sphinx in the past, which uses multiple queries to pull out
a subset of records ranged based on PrimaryKey, does Solr offer
functionality similar to this? It seems that once a Solr index gets to
a certain size, the indexing of a batch takes longer than MySQL's
net_write_timeout, so it kills the connection.

Thanks for your help, I really enjoy using Solr and I look forward to
indexing even more data!

Re: HTMLStripCharFilterFactory, highlighting and InvalidTokenOffsetsException

2011-04-21 Thread Erick Erickson

Perhaps a better place to start is here:
http://wiki.apache.org/solr/HowToContribute#Contributing_Code_.28Features.2C_Big_Fixes.2C_Tests.2C_etc29

That page also has information about setting up Eclipse or IntelliJ
environments. But the place to start is to get the source and get to
the point where you can issue ant clean test from the command line.
That should compile all the source and run the junit tests.

ant example will build you a full deployment in the example
directory that you can run the usual way java -jar start.jar.

The IDEs also have a wizardly way to apply patches if you don't want
to apply them the command-line way.

Best
Erick

2011/4/21 Robert Gründler rob...@dubture.com:
 On 20.04.11 18:51, Robert Muir wrote:

 Hi, there is a proposed patch uploaded to the issue. Maybe you can
 help by reviewing/testing it?

 if i succeed in compiling solr, i can test the patch. Is this the right
 starting point
 for such an endeavour ? http://wiki.apache.org/solr/HackingSolr



 -robert

 2011/4/20 Robert Gründlerrob...@dubture.com:

 Hi all,

 i'm getting the following exception when using highlighting for a field
 containing HTMLStripCharFilterFactory:

 org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
 ...
 exceeds length of provided text sized 21

 It seems this is a know issue:

 https://issues.apache.org/jira/browse/LUCENE-2208

 Does anyone know if there's a fix implemented yet in solr?


 thanks!


 -robert

Re: Highest frequency terms for a subset of documents

Well, it was worth the try;-)
But will using the facet.method=fc, will reducing the subset size
reduce the time and memory? Meaning is it an O( ndocs of the set)?
Thanks
On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

facet.method=fc builds a multi-valued fieldcache like structure
(UnInvertedField) the first time, that
is used for counting facets for all subsequent requests.  So the
faceting time (after the first time) is O(ndocs of the set),
but the UnInvertedField singleton uses a large amout of memory
unrelated to any particular base docset.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Index upgrade from 1.4.1 to 3.1 and 4.0

Hi all,
While doing some tests, I realized that an index that was created with
solr 1.4.1 is readable by solr 3.1, but nt readable by solr 4.0.
If I plan to migrate my index to 4.0, and I prefer not to reindex it
all, what would be my best course of action?
Will it be possible to continue to write to the index with 3.1? Will
that make it readable from 4.0 or only the newly created segments?
If I optimize it using 3.1, will that make it readable also from 4.0?
Thanks
Ofer

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow

Thanks for your response!

I think the issue is that the records are being returned TOO fast from
MySQL. I can dump them to CSV in about 30 minutes, but building the
solr index takes hours on the system I'm using. I may just need to use
a more powerful Solr instance so it doesn't leave MySQL hanging for
too long?

What about autoCommit, does that factor in to your import strategy?

2011/4/21 Robert Gründler rob...@dubture.com:
 we're indexing around 10M records from a mysql database into
 a single solr core.

 The DataImportHandler needs to join 3 sub-entities to denormalize
 the data.

 We've run into some troubles for the first 2 attempts, but setting
 batchSize=-1 for the dataSource resolved the issues.

 Do you need a lot of complex joins to import the data from mysql?



 -robert




 On 4/21/11 8:08 PM, Scott Bigelow wrote:

 I've been using Solr for a while now, indexing 2-4 million records
 using the DIH to pull data from MySQL, which has been working great.
 For a new project, I need to index about 20M records (30 fields) and I
 have been running into issues with MySQL disconnects, right around
 15M. I've tried several remedies I've found on blogs, changing
 autoCommit, batchSize etc., and none of them have seem to majorly
 resolved the issue. It got me wondering: Is this the way everyone does
 it? What about 100M records up to 1B; are those all pulled using DIH
 and a single query?

 I've used sphinx in the past, which uses multiple queries to pull out
 a subset of records ranged based on PrimaryKey, does Solr offer
 functionality similar to this? It seems that once a Solr index gets to
 a certain size, the indexing of a batch takes longer than MySQL's
 net_write_timeout, so it kills the connection.

 Thanks for your help, I really enjoy using Solr and I look forward to
 indexing even more data!

Re: Highest frequency terms for a subset of documents

So I'm guessing my best approach now would be to test trunk, and hope
that as 3.1 cut the performance in half, trunk will do the same
Thanks for the info
Ofer

On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

Trunk prob won't be much better... but the bulkpostings branch
possibly could be.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? 
 and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Multiple Tags and Facets

Thank you Hoss.
I will try the comma-separated thing out. It seems to be what I searched
for. :)

Regards,
Em


Chris Hostetter-3 wrote:
 
 : I watched an online video with Chris Hostsetter from Lucidimagination.
 He
 : showed the possibility of having some Facets that exclude *all* filter
 while
 : also having some Facets that take care of some of the set filters while
 : ignoring other filters.
 
 FWIW: That webinar is nearly identical to the apachecon talk i gave on the 
 same topic, slides of which can be found here...
 
 http://people.apache.org/~hossman/apachecon2010/facets/
 
 This is the example i used on Slide #29...
 
   Same Facet, Different Exclusions
 
 * A key can be specified for a facet to change the name used to
   identify it in the response.
 * This allows you to have multiple instances of a facet, with
differnet exclusions.
 
 q = Hot Rod 
fq = {!df=colors tag=cx}purple green 
   facet.field = {!key=all_colors ex=cx}colors 
   facet.field = {!key=overlap_colors}colors
 
 ...the point in that example is to treat a field (color) as two 
 differnt facets: one with exclusions and one without.
 
 it sounds like what you want is differnet -- i *think* what you 
 are asking for is multiple exclusions for a single facet.  I didn't 
 mention that in my slides, but you can do that using a comma seperated 
 list of exclusions...
 
 q = Hot Rod 
fq = {!df=body tag=bc}purple
fq = {!df=interior tag=ic}green
   facet.field = {!ex=bc,ic}model
 
 -Hoss
 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Tags-and-Facets-tp2843130p2849115.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highest frequency terms for a subset of documents

Ok, I'll give it a try, as this is a server I am willing to risk.
How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

 Trunk prob won't be much better... but the bulkpostings branch
 possibly could be.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? 
 and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort o...@tra.cx wrote:
 Ok, I'll give it a try, as this is a server I am willing to risk.
 How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

bulkpostings, trunk, and 3.1 should all be relatively solrj
compatible.  But the SolrJ javabin format (used by default for
queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034).

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

 Trunk prob won't be much better... but the bulkpostings branch
 possibly could be.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it up? 
 and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Highest frequency terms for a subset of documents

Ok, thanks

On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort o...@tra.cx wrote:
 Ok, I'll give it a try, as this is a server I am willing to risk.
 How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

 bulkpostings, trunk, and 3.1 should all be relatively solrj
 compatible.  But the SolrJ javabin format (used by default for
 queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034).

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort o...@tra.cx wrote:
 So I'm guessing my best approach now would be to test trunk, and hope
 that as 3.1 cut the performance in half, trunk will do the same

 Trunk prob won't be much better... but the bulkpostings branch
 possibly could be.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

 Thanks for the info
 Ofer

 On Friday, April 22, 2011, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort o...@tra.cx wrote:
 Well, it was worth the try;-)
 But will using the facet.method=fc, will reducing the subset size
 reduce the time and memory? Meaning is it an O( ndocs of the set)?

 facet.method=fc builds a multi-valued fieldcache like structure
 (UnInvertedField) the first time, that
 is used for counting facets for all subsequent requests.  So the
 faceting time (after the first time) is O(ndocs of the set),
 but the UnInvertedField singleton uses a large amout of memory
 unrelated to any particular base docset.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco


 Thanks
 On Thursday, April 21, 2011, Yonik Seeley yo...@lucidimagination.com 
 wrote:
 On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort o...@tra.cx wrote:
 So if i want to use the facet.method=fc, is there a way to speed it 
 up? and
 remove the bucket size limitation?

 Not really - else we would have done it already ;-)
 We don't really have great methods for faceting on full-text fields
 (as opposed to shorter meta-data fields) today.

 -Yonik
 http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
 25-26, San Francisco

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Chris Hostetter


: For a new project, I need to index about 20M records (30 fields) and I
: have been running into issues with MySQL disconnects, right around
: 15M. I've tried several remedies I've found on blogs, changing

if you can provide some concrete error/log messages and the details of how 
you are configuring your datasource that might help folks provide better 
suggestions -- youv'e said you run into a problem but you havne't provided 
any details for people to go on in giving you feedback.

: resolved the issue. It got me wondering: Is this the way everyone does
: it? What about 100M records up to 1B; are those all pulled using DIH
: and a single query?

I've only recently started using DIH, and while it definitely has a lot 
of quirks/anoyances, it seems like a pretty good 80/20 solution for 
indexing with Solr -- but that doens't mean it's perfect for all 
situations.

Writing custom indexer code can certianly make sense in a lot of cases -- 
particularly where you already have a data pblishing system that you wnat 
to tie into directly -- the trick is to ensure you have a decent strategy 
for rebuilding the entire index should the need arrise (but this is relaly 
only an issue if your primary indexing solution is incremental -- many use 
cases can be satisifed just fine with a brute force full rebuild 
periodically impelmentation.


-Hoss

term position question from analyzer stack for WordDelimiterFilterFactory

2011-04-21 Thread Robert Petersen

So if I don't put preserveOriginal=1 in my WordDelimiterFilterFactory settings 
I cannot get a match between AppleTV on the indexing side and appletv on the 
search side.  Without that setting the all lowercase version of AppleTV is in 
term position two due to the catenateWords=1 or the catenateAll=1 settings.  I 
am surprised.  How does term position affect searching?  Here is my analysis 
with preserveOriginal=1 to make the lower case occur in both term position 1 
and 2:

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, 
expand=true, ignoreCase=true}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true}
term position   1
term text   AppleTV
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, 
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, 
catenateNumbers=1}
term position   1   2
term text   AppleTV TV
Apple   AppleTV
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

org.apache.solr.analysis.LowerCaseFilterFactory {}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position   1   2
term text   appletv tv
apple   appletv
term type   wordword
wordword
source start,end0,7 5,7
0,5 0,7
payload 

Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, 
expand=true, ignoreCase=true}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=1, 
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=1, 
catenateNumbers=1}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.LowerCaseFilterFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory 
{protected=protwords.txt}
term position   1
term text   appletv
term type   word
source start,end0,7
payload 
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
term position   1
term text   appletv
term type   word
source start,end0,7
payload

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow

Thanks for the e-mail. I probably should have provided more details,
but I was more interested in making sure I was approaching the problem
correctly (using DIH, with one big SELECT statement for millions of
rows) instead of solving this specific problem. Here's a partial
stacktrace from this specific problem:

...
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was
unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
... 22 more
Apr 21, 2011 3:53:28 AM
org.apache.solr.handler.dataimport.EntityProcessorBase getNext
SEVERE: getNext() failed for query 'REDACTED'
org.apache.solr.handler.dataimport.DataImportHandlerException:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure

The last packet successfully received from the server was 128
milliseconds ago.  The last packet sent successfully to the server was
25,273,484 milliseconds ago.
...


A custom indexer, so that's a fairly common practice? So when you are
dealing with these large indexes, do you try not to fully rebuild them
when you can? It's not a nightly thing, but something to do in case of
a disaster? Is there a difference in the performance of an index that
was built all at once vs. one that has had delta inserts and updates
applied over a period of months?

Thank you for your insight.


On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : For a new project, I need to index about 20M records (30 fields) and I
 : have been running into issues with MySQL disconnects, right around
 : 15M. I've tried several remedies I've found on blogs, changing

 if you can provide some concrete error/log messages and the details of how
 you are configuring your datasource that might help folks provide better
 suggestions -- youv'e said you run into a problem but you havne't provided
 any details for people to go on in giving you feedback.

 : resolved the issue. It got me wondering: Is this the way everyone does
 : it? What about 100M records up to 1B; are those all pulled using DIH
 : and a single query?

 I've only recently started using DIH, and while it definitely has a lot
 of quirks/anoyances, it seems like a pretty good 80/20 solution for
 indexing with Solr -- but that doens't mean it's perfect for all
 situations.

 Writing custom indexer code can certianly make sense in a lot of cases --
 particularly where you already have a data pblishing system that you wnat
 to tie into directly -- the trick is to ensure you have a decent strategy
 for rebuilding the entire index should the need arrise (but this is relaly
 only an issue if your primary indexing solution is incremental -- many use
 cases can be satisifed just fine with a brute force full rebuild
 periodically impelmentation.


 -Hoss

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Li

Can you post the dataconfig.XML? Probably you didn't use batch size

Sent from my iPhone

On Apr 21, 2011, at 5:09 PM, Scott Bigelow eph...@gmail.com wrote:

 Thanks for the e-mail. I probably should have provided more details,
 but I was more interested in making sure I was approaching the problem
 correctly (using DIH, with one big SELECT statement for millions of
 rows) instead of solving this specific problem. Here's a partial
 stacktrace from this specific problem:
 
 ...
 Caused by: java.io.EOFException: Can not read response from server.
 Expected to read 4 bytes, read 0 bytes before connection was
 unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2539)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2989)
... 22 more
 Apr 21, 2011 3:53:28 AM
 org.apache.solr.handler.dataimport.EntityProcessorBase getNext
 SEVERE: getNext() failed for query 'REDACTED'
 org.apache.solr.handler.dataimport.DataImportHandlerException:
 com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
 Communications link failure
 
 The last packet successfully received from the server was 128
 milliseconds ago.  The last packet sent successfully to the server was
 25,273,484 milliseconds ago.
 ...
 
 
 A custom indexer, so that's a fairly common practice? So when you are
 dealing with these large indexes, do you try not to fully rebuild them
 when you can? It's not a nightly thing, but something to do in case of
 a disaster? Is there a difference in the performance of an index that
 was built all at once vs. one that has had delta inserts and updates
 applied over a period of months?
 
 Thank you for your insight.
 
 
 On Thu, Apr 21, 2011 at 4:31 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:
 
 : For a new project, I need to index about 20M records (30 fields) and I
 : have been running into issues with MySQL disconnects, right around
 : 15M. I've tried several remedies I've found on blogs, changing
 
 if you can provide some concrete error/log messages and the details of how
 you are configuring your datasource that might help folks provide better
 suggestions -- youv'e said you run into a problem but you havne't provided
 any details for people to go on in giving you feedback.
 
 : resolved the issue. It got me wondering: Is this the way everyone does
 : it? What about 100M records up to 1B; are those all pulled using DIH
 : and a single query?
 
 I've only recently started using DIH, and while it definitely has a lot
 of quirks/anoyances, it seems like a pretty good 80/20 solution for
 indexing with Solr -- but that doens't mean it's perfect for all
 situations.
 
 Writing custom indexer code can certianly make sense in a lot of cases --
 particularly where you already have a data pblishing system that you wnat
 to tie into directly -- the trick is to ensure you have a decent strategy
 for rebuilding the entire index should the need arrise (but this is relaly
 only an issue if your primary indexing solution is incremental -- many use
 cases can be satisifed just fine with a brute force full rebuild
 periodically impelmentation.
 
 
 -Hoss

Re: term position question from analyzer stack for WordDelimiterFilterFactory