RE: Facet sorting seems weird

2013-07-15 Thread David Quarterman
Hi Henrik,

Try setting up a copyfield in your schema and set the copied field to use 
something like 'text_ws' which implements LowerCaseFilterFactory. Then sort on 
the copyfield.

Regards,

DQ

-Original Message-
From: Henrik Ossipoff Hansen [mailto:h...@entertainment-trading.com] 
Sent: 15 July 2013 15:08
To: solr-user@lucene.apache.org
Subject: Facet sorting seems weird

Hello, first time writing to the list. I am a developer for a company where we 
recently switched all of our search core from Sphinx to Solr with very great 
results. In general we've been very happy with the switch, and everything seems 
to work just as we want it to.

Today however we've run into a bit of a issue regarding faceted sort.

For example we have a field called brand in our core, defined as the text_en 
datatype from the example Solr core. This field is copied into facet_brand with 
the datatype string (since we don't really need to do much with it except show 
it for faceted navigation).

Now, given these two entries into the field on different documents, LEGO and 
bObles, and given facet.sort=index, it appears that LEGO is sorted as being 
before bObles. I assume this is because of casing differences.

My question then is, how do we define a decent datatype in our schema, where 
the casing is exact, but we are able to sort it without casing mattering?

Thank you :)

Best regards,
Henrik Ossipoff


RE: Commit different database rows to solr with same id value?

2013-07-10 Thread David Quarterman
Hi Jason,

Assuming you're using DIH, why not build a new, unique id within the query to 
use as  the 'doc_id' for SOLR? We do something like this in one of our 
collections. In MySQL, try this (don't know what it would be for any other db 
but there must be equivalents):

select @rownum:=@rownum+1 rowid, t.* from (main select query) t, (select 
@rownum:=0) s

Regards,

DQ

-Original Message-
From: Jason Huang [mailto:jason.hu...@icare.com] 
Sent: 10 July 2013 15:50
To: solr-user@lucene.apache.org
Subject: Commit different database rows to solr with same id value?

Hello,

I am trying to use Solr to store fields from two different database tables, 
where the primary keys are in the format of 1, 2, 3, 

In Java, we build different POJO classes for these two database tables:

table1.java

@SolrIndex(name=id)

private String idTable1




table2.java

@SolrIndex(name=id)

private String idTable2



And later we add these fields defined in the two different types of tables and 
commit it to solrServer.


Here is the scenario where I am having issues:

(1) commit a row from table1 with primary key = 3, this generates a document 
in Solr

(2) commit another row from table2 with the same value of primary key = 3, 
this overwrites the document generated in step (1).


What we really want to achieve is to keep both rows in (1) and (2) because they 
are from different tables. I've read something from google search and it 
appears that we might be able to do it via keeping multiple cores in solr? 
Could anyone point at how to implement multiple core to achieve this?
To be more specific, when I commit the row as a document, I don't have a place 
to pick a certain core and I am not sure if it makes any sense for me to 
specify a core when I commit the document since the layer I am working on 
should abstract it away from me.



The second question is - if we don't want to do a multicore (since we can't 
easily search for related data between multiple cores), how can we resolve this 
issue so both rows from different database table which shares the same primary 
key still exist? We don't want to have to always change the primary key format 
to ensure a uniqueness of the primary key among all different types of database 
tables.


thanks!


Jason


SOLR 4.0 frequent admin problem

2013-07-04 Thread David Quarterman
Hi,

About once a week the admin system comes up with SolrCore Initialization 
Failures. There's nothing in the logs and SOLR continues to work in the 
application it's supporting and in the 'direct access' mode (i.e. 
http://123.465.789.100:8080/solr/collection1/select?q=bingo:*).

The cure is to restart Jetty (8.1.7) and then we can use the admin system again 
via pc's. However, a colleague can get into admin on an iPad with no trouble 
when no browser on a pc can!

Anyone any ideas? It's really frustrating!

Best regards,

DQ



RE: SOLR 4.0 frequent admin problem

2013-07-04 Thread David Quarterman
Cheers, Roman! It was a default Jetty set up so now added a 'work' directory 
and that's in use now.

-Original Message-
From: Roman Chyla [mailto:roman.ch...@gmail.com] 
Sent: 04 July 2013 15:00
To: solr-user@lucene.apache.org
Subject: Re: SOLR 4.0 frequent admin problem

Yes :-)  see SOLR-118, seems an old issue...
On 4 Jul 2013 06:43, David Quarterman da...@corexe.com wrote:

 Hi,

 About once a week the admin system comes up with SolrCore 
 Initialization Failures. There's nothing in the logs and SOLR 
 continues to work in the application it's supporting and in the 'direct 
 access' mode (i.e.
 http://123.465.789.100:8080/solr/collection1/select?q=bingo:*).

 The cure is to restart Jetty (8.1.7) and then we can use the admin 
 system again via pc's. However, a colleague can get into admin on an 
 iPad with no trouble when no browser on a pc can!

 Anyone any ideas? It's really frustrating!

 Best regards,

 DQ




RE: Newbie SolR - Need advice

2013-07-03 Thread David Quarterman
Hi Fabio,

Sandeep is right - it'll take time. SOLR isn't straightforward when you first 
start out but the tutorial is the best first step. You can then adapt the 
various config files in the tutorial to adapt to your situation. I'd recommend 
a simple approach to get the hang of it and just index one table, specifying 
some fields to be searched in the schema.xml.

There are some good books around too (Sandeeps's recommendation on Lucidworks 
is good too). Apache Solr 3.1 Cookbook by Rafal Kuc (still valid for 4.x.x), 
Jack Krupansky's Solr 4.x Deep Dive - Early Access Release, Solr In Action by 
Trey Grainger  Tim Potter.

If you need help, shout! It's a great community.

Cheers, DQ

-Original Message-
From: fabio1605 [mailto:fabio.to...@btinternet.com] 
Sent: 03 July 2013 09:55
To: solr-user@lucene.apache.org
Subject: Re: Newbie SolR - Need advice

Hi Sandeep

Thank you for your reply 

Il have a read through the tutorials now that i understand the principle of all 
this,

i would ideally like to keep mssql and bolt solr on top of this so that we can 
keep mssql as we have a 200GB database

Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4075026.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Newbie SolR - Need advice

2013-07-02 Thread David Quarterman
Hi Fabio,

Like Jack says, try the tutorial. But to answer your question, SOLR isn't a 
bolt on to SQLServer or any other DB. It's a fantastically fast 
indexing/searching tool. You'll need to use the DataImportHandler (see the 
tutorial) to import your data from the DB into the indices that SOLR uses. Once 
in there, you'll have more power  flexibility than SQLServer would ever give 
you!

Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll 
work using Jetty or Tomcat as web container.

Stick with it. The ride can be bumpy but the experience is sensational!

DQ

-Original Message-
From: fabio1605 [mailto:fabio.to...@btinternet.com] 
Sent: 02 July 2013 16:16
To: solr-user@lucene.apache.org
Subject: Newbie SolR - Need advice

Hi

we have a MSSQL Server which is just getting far to large now and performance 
is dying! the majority of our webservers mainly are doing search function so i 
thought it may be best to move to SolR But i know very little about it!

My questions are!

Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR 
is just the search bit between?

Im really struggling to understand the point of SOLR etc so if someone could 
point me to a Dummies website id apprecaite it! google is throwing to much 
confusion at me!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Newbie SolR - Need advice

2013-07-02 Thread David Quarterman
Don’t worry Fabio - nobody knows everything (apart from Hossman). Following on 
from Sandeep, to use SOLR, you extract the data from your MSSQL DB using the 
DataImportHandler and you can then query it, facet it, pivot it to your heart's 
content. And fast!

You can use almost anything to build the SOLR queries - Java  PHP being 
probably most popular. There is a library for Perl I think but never tried it.

So, you keep your mssql database, you just don't use it for searches - that'll 
relieve some of the load. Searches then all go through SOLR  its Lucene 
indexes. If your various tables need SQL joins, you specify those in the 
DataImportHandler (DIH) config. That way, when SOLR indexes everything, it 
indexes the data the way you want to see it.

DIH handles the data export from mssql - SOLR and it's not too difficult to 
set up. 

You imply you're adding (inserting) data. How much, how often? DIH has a delta 
import feature so you can add data on the fly to SOLR's indexes.

Much of it come down to the data model you have. My advice would be try it and 
see. You will be pleasantly surprised!



-Original Message-
From: fabio1605 [mailto:fabio.to...@btinternet.com] 
Sent: 02 July 2013 17:10
To: solr-user@lucene.apache.org
Subject: RE: Newbie SolR - Need advice

Thanks guys

So SolR is actually a database replacement for mssql...  Am I right 


We have a lot of perl scripts that contains lots of sql insert queries. Etc


How do we query the SolR database from scripts  I know I have a lot to 
learn still so excuse my ignorance. 

Also...  What is mongo and how does it compare

I just don't understand how in 10years of Web development I have never heard of 
SolR till last week




Sent from Samsung Mobile

 Original message 
From: David Quarterman [via Lucene] 
ml-node+s472066n4074772...@n3.nabble.com 
Date: 02/07/2013  16:57  (GMT+00:00) 
To: fabio1605 fabio.to...@btinternet.com 
Subject: RE: Newbie SolR - Need advice 
 
Hi Fabio, 

Like Jack says, try the tutorial. But to answer your question, SOLR isn't a 
bolt on to SQLServer or any other DB. It's a fantastically fast 
indexing/searching tool. You'll need to use the DataImportHandler (see the 
tutorial) to import your data from the DB into the indices that SOLR uses. Once 
in there, you'll have more power  flexibility than SQLServer would ever give 
you! 

Haven't tried SOLR on Windows (I guess your environment) but I'm sure it'll 
work using Jetty or Tomcat as web container. 

Stick with it. The ride can be bumpy but the experience is sensational! 

DQ 

-Original Message- 
From: fabio1605 [mailto:[hidden email]] 
Sent: 02 July 2013 16:16 
To: [hidden email] 
Subject: Newbie SolR - Need advice 

Hi 

we have a MSSQL Server which is just getting far to large now and performance 
is dying! the majority of our webservers mainly are doing search function so i 
thought it may be best to move to SolR But i know very little about it! 

My questions are! 

Does SolR Run as a bolt on to MSSQL - as in the data is still in MSSQL and SolR 
is just the search bit between? 

Im really struggling to understand the point of SOLR etc so if someone could 
point me to a Dummies website id apprecaite it! google is throwing to much 
confusion at me! 



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746.html
Sent from the Solr - User mailing list archive at Nabble.com. 


If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074772.html
To unsubscribe from Newbie SolR - Need advice, click here.
NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-SolR-Need-advice-tp4074746p4074782.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Building a central index with Lucene + Solr

2013-03-05 Thread David Quarterman
Hi Alvaro,

I agree with Otis  Alexandre (esp. Windows + PHP!). However, there are plenty 
of people using Solr  PHP out there very successfully. There's another good 
package at http://code.google.com/p/solr-php-client/ which is easy to implement 
and has some example usage.

Regards,

DQ

 

From: Álvaro Vargas Quezada [mailto:al...@outlook.com] 
Sent: 05 March 2013 14:53
To: solr-user@lucene.apache.org
Subject: Building a central index with Lucene + Solr

 

Hi everyone!

 

I'm trying to develop a central index, I installed Solr and I reach the screen 
that I attach. But the problem is that I don't know how to continue since this 
point, I wanted to develop an app in php which use Solr, but I don't know how, 
anyone that can help me maybe with a tutorial or something like that?

 

Thanks and greetz from Chile!

 



RE: Edismax odd results

2013-02-22 Thread David Quarterman
Hi Erick,

Funnily enough, I cracked it about 5 minutes before your email arrived! Problem 
was using WhiteSpaceTokenizer instead of Standard AND had the LowerCaseFilter 
after the PorterStemmingFilter. Getting them in the right order has solved all 
the problems and we get all our engineer boots, ankle boots at the top of the 
set!

Many thanks to all who took part.

Regards,

DQ

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 22 February 2013 12:59
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

OK, let's see the debug data for explainOther.

One thing, though. Your analysis chain is apt to be surprising. The fact that 
you have 222 terms with the : says that you're probably not getting what I'd 
guess you want. That ':' is part of your token, and will not match 
engineering, consider changing some of your filters to remove stuff like 
that

Best
Erick


RE: If we Open Source our platform, would it be interesting to you?

2013-02-21 Thread David Quarterman
Hi Marcelo,

Looked through your site and the framework looks very powerful as an 
aggregator. We do a lot of data aggregation from many different sources in many 
different formats (XML, JSON, text, CSV, etc) using RDBMS as the main 
repository for eventual SOLR indexing. A 'one-stop-shop' for all this would be 
very appealing.

Have you looked at products like Talend  Jitterbit? These offer transformation 
from almost anything to almost anything using graphical interfaces (Jitterbit 
is better) and a PHP-like coding format for trickier work. If you (or somebody) 
could add a graphical interface, the world would beat a path to your door!

Regards,

DQ

-Original Message-
From: Marcelo Elias Del Valle [mailto:marc...@s1mbi0se.com.br] 
Sent: 20 February 2013 18:18
To: solr-user@lucene.apache.org
Subject: If we Open Source our platform, would it be interesting to you?

Hello All,

I’m sending this email because I think it may be interesting for Solr users, as 
this project have a strong usage of Solr platform.

We are strongly considering opening the source of our DMP (Data Management 
Platform), if it proves to be technically interesting to other developers / 
companies.

More details: http://www.s1mbi0se.com/s1mbi0se_DMP.html

All comments, questions and critics happening at HN:
http://news.ycombinator.com/item?id=5251780

Please, feel free to send questions, comments and critics... We will try to 
reply them all.

Regards,
Marcelo


RE: Edismax odd results

2013-02-20 Thread David Quarterman
Hi Shawn,

Schema's at http://justpaste.it/davidqhog. It's the basic SOLR 4.0 with 
additions!

Regards,

DQ


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: 19 February 2013 18:32
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

On 2/19/2013 11:16 AM, David Quarterman wrote:
 This is definitely driving us mad now! Changed to PorterStemming and there's 
 very little difference.

 If we add fq=engineer, we get 0 results. Add fq=engineer* and we get the 90 
 in the system. Try with fq=ankle* and we get 2. Correct. Try with fq=harness* 
 and we get 0!

 The stemming reduces 'engineer' to 'engin' so I'd have expected a lot more 
 results.

 Anyone got any ideas?

Did you completely reindex when you changed your schema?  You must reindex.

Does the index analysis match the query analysis?  Some specific differences 
are allowed (and sometimes encouraged), but stemming must be done to both.  Can 
you share your schema?  Use a paste website like pastie.org for that.

Thanks,
Shawn



RE: Edismax odd results

2013-02-20 Thread David Quarterman
Hi Erick,

Debug=all posted on http://justpaste.it/davidqhogdebug. Can't see anything 
obvious myselfbut then I'm not an expert!

Regards,

DQ

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 20 February 2013 02:02
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

When you get back to this tomorrow, also try and paste the parsed query bits 
you get back when you append debug=all. Sometimes it's surprising what the 
parsed query _really_ looks like

Best
Erick


On Tue, Feb 19, 2013 at 3:13 PM, David Quarterman da...@corexe.com wrote:

 Hi Shawn,

 Now finished for the day but will post the schema tomorrow. Thanks for 
 the help (and Jack too).

 Regards,

 DQ

 P.S. did reindex after changing schema and the analyzer/query stuff 
 matches precisely!!

 Shawn Heisey s...@elyograg.org wrote:

 On 2/19/2013 11:16 AM, David Quarterman wrote:
  This is definitely driving us mad now! Changed to PorterStemming and
 there's very little difference.
 
  If we add fq=engineer, we get 0 results. Add fq=engineer* and we get 
  the
 90 in the system. Try with fq=ankle* and we get 2. Correct. Try with
 fq=harness* and we get 0!
 
  The stemming reduces 'engineer' to 'engin' so I'd have expected a 
  lot
 more results.
 
  Anyone got any ideas?

 Did you completely reindex when you changed your schema?  You must reindex.

 Does the index analysis match the query analysis?  Some specific 
 differences are allowed (and sometimes encouraged), but stemming must 
 be done to both.  Can you share your schema?  Use a paste website like 
 pastie.org for that.

 Thanks,
 Shawn




RE: Edismax odd results

2013-02-20 Thread David Quarterman
Hi Erick,

I understand the wildcard issue -  that was more desperation on our part than 
logic!

TermsComponent showed 
lst name=prodnameplurals
int name=engineering:222/int
int name=engineer197/int
/lst
so the term is in the index.
Using the explainOther, I can see that the relevance of documents with 
'engineer boots' in the name is low compared to the others and they appear 
randomly distributed through the resultset (I know it's not random). We've 
tried all sorts of things to boost them but to no avail. Trying 'logger boots' 
or 'harness boots' gives good results with the required terms at the top of the 
set.

I'm mystified.

Regards,

DQ

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 20 February 2013 12:49
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

OK, first:
wildcarding and stemming don't get along well together. Since you've stemmed 
the field, enginee* would not match the stemmed term engin. This is actually 
pretty tricky to try to implement. For instance, how would enginee stem? So the 
fqs you posted are going to mislead you in that regard.

If you want to examine the actual values in your index, consider using 
TermsComponent or Luke. Either will show you exactly what's being searched 
against.

I suspect that your fq entries (as typed) are going against the default field 
of text as defined in your schema, which doesn't stem, so that's leading you 
astray possibly.

Finally, you may be getting bitten by scoring, field norms and all that. If you 
have a doc ID that you _know_ contains engineers boots, try using debug with 
explainOther (
http://wiki.apache.org/solr/CommonQueryParameters#explainOther) which might 
help you understand what's happening with the doc you care about

Best
Erick


On Wed, Feb 20, 2013 at 7:13 AM, David Quarterman da...@corexe.com wrote:

 Hi Erick,

 Debug=all posted on http://justpaste.it/davidqhogdebug. Can't see 
 anything obvious myselfbut then I'm not an expert!

 Regards,

 DQ

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 20 February 2013 02:02
 To: solr-user@lucene.apache.org
 Subject: Re: Edismax odd results

 When you get back to this tomorrow, also try and paste the parsed 
 query bits you get back when you append debug=all. Sometimes it's 
 surprising what the parsed query _really_ looks like

 Best
 Erick


 On Tue, Feb 19, 2013 at 3:13 PM, David Quarterman da...@corexe.com
 wrote:

  Hi Shawn,
 
  Now finished for the day but will post the schema tomorrow. Thanks 
  for the help (and Jack too).
 
  Regards,
 
  DQ
 
  P.S. did reindex after changing schema and the analyzer/query stuff 
  matches precisely!!
 
  Shawn Heisey s...@elyograg.org wrote:
 
  On 2/19/2013 11:16 AM, David Quarterman wrote:
   This is definitely driving us mad now! Changed to PorterStemming 
   and
  there's very little difference.
  
   If we add fq=engineer, we get 0 results. Add fq=engineer* and we 
   get the
  90 in the system. Try with fq=ankle* and we get 2. Correct. Try with
  fq=harness* and we get 0!
  
   The stemming reduces 'engineer' to 'engin' so I'd have expected a 
   lot
  more results.
  
   Anyone got any ideas?
 
  Did you completely reindex when you changed your schema?  You must
 reindex.
 
  Does the index analysis match the query analysis?  Some specific 
  differences are allowed (and sometimes encouraged), but stemming 
  must be done to both.  Can you share your schema?  Use a paste 
  website like pastie.org for that.
 
  Thanks,
  Shawn
 
 



Edismax odd results

2013-02-19 Thread David Quarterman
Hi all,

We have an index of boots which contains harness boots, engineer boots , ankle 
boots, etc. An edismax search on the index for 'harness boots' brings back 
2,175 boots with 'harness' results at the top. 'Searching 'engineer boots' 
brings back everything but 'engineer boots', same for 'ankle boots' - in fact, 
same result set of 1,873 mostly boots but a few other products mixed in.

We're on SOLR 4.0 and the field we're querying is stemmed (snowball), 
lowercased on WhiteSpaceTokenizer. Any ideas?

Regards,

 

David Q



RE: Edismax odd results

2013-02-19 Thread David Quarterman
Hi Jack,

Here's q test query we've been using:

select?q=+engineer+bootsdefType=edismaxfl=prodnameqf=prodnamepluralspf2=prodnameplurals^2.0

This still produces a result set where the first 'engineer boot' is way down 
the list and subsequent ones are interspersed with other boots. They're all in 
there, just not at the top. Below is the debug on the first item that is an 
engineer boot.

str name=ITEM_
0.23492618 = (MATCH) sum of:
  0.23492618 = (MATCH) product of:
0.46985236 = (MATCH) sum of:
  0.46985236 = (MATCH) weight(prodnameplurals:boot in 48270) 
[DefaultSimilarity], result of:
0.46985236 = score(doc=48270,freq=1.0 = termFreq=1.0
), product of:
  0.22236869 = queryWeight, product of:
4.8295836 = idf(docFreq=1867, maxDocs=86009)
0.046043035 = queryNorm
  2.112943 = fieldWeight in 48270, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
4.8295836 = idf(docFreq=1867, maxDocs=86009)
0.4375 = fieldNorm(doc=48270)
0.5 = coord(1/2)
/str

Regards,

DQ

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: 19 February 2013 15:31
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

Show us your qf and pf params. Do you have PF2 set? That's the key for getting 
the phrase engineer boots boosted higher than just boots. You may also simply 
have to give a higher PF2 boost since boots probably has a much higher term 
frequency than engineer or even the natural Lucene score for engineer boot.

Also check the debugQuery=true explain scoring to see how engineer, boot, 
and engineer boot are being scored - you may have to add some specific query 
phrases to force engineer boot into the top results to comparing the scoring.

-- Jack Krupansky

-Original Message-
From: David Quarterman
Sent: Tuesday, February 19, 2013 6:21 AM
To: solr-user@lucene.apache.org
Subject: Edismax odd results

Hi all,

We have an index of boots which contains harness boots, engineer boots , ankle 
boots, etc. An edismax search on the index for 'harness boots' brings back 
2,175 boots with 'harness' results at the top. 'Searching 'engineer boots' 
brings back everything but 'engineer boots', same for 'ankle boots' - in fact, 
same result set of 1,873 mostly boots but a few other products mixed in.

We're on SOLR 4.0 and the field we're querying is stemmed (snowball), 
lowercased on WhiteSpaceTokenizer. Any ideas?

Regards,



David Q



RE: Edismax odd results

2013-02-19 Thread David Quarterman
Hi Shawn,

I checked the admin analysis earlier. Stemming is taking 'engineer' down to 
'engin', but then I'd have thought that a search on 'engin boots' would work 
but it doesn't.

I'll try turning the wick back up on the logging - we set it to 'warning'.

Regards,

DQ

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: 19 February 2013 16:25
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

I do not see the word engineer (or any other similar word) in the score 
calculation, only boots.  A test on my own index shows both words in the 
calculations.  I would use the analysis admin page on the prodnameplurals field 
to see what happens to the input of engineer boots on both index and query - 
see what part of your analysis chain removes it.

If you don't see any problem there, then the Solr log (assuming you haven't 
changed the default log level of INFO) should have a record of what parameters 
were actually received when the query was made.

Thanks,
Shawn


On 2/19/2013 9:14 AM, David Quarterman wrote:
 Hi Jack,

 Here's q test query we've been using:

 select?q=+engineer+bootsdefType=edismaxfl=prodnameqf=prodnameplural
 spf2=prodnameplurals^2.0

 This still produces a result set where the first 'engineer boot' is way down 
 the list and subsequent ones are interspersed with other boots. They're all 
 in there, just not at the top. Below is the debug on the first item that is 
 an engineer boot.

 str name=ITEM_
 0.23492618 = (MATCH) sum of:
0.23492618 = (MATCH) product of:
  0.46985236 = (MATCH) sum of:
0.46985236 = (MATCH) weight(prodnameplurals:boot in 48270) 
 [DefaultSimilarity], result of:
  0.46985236 = score(doc=48270,freq=1.0 = termFreq=1.0 ), 
 product of:
0.22236869 = queryWeight, product of:
  4.8295836 = idf(docFreq=1867, maxDocs=86009)
  0.046043035 = queryNorm
2.112943 = fieldWeight in 48270, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  4.8295836 = idf(docFreq=1867, maxDocs=86009)
  0.4375 = fieldNorm(doc=48270)
  0.5 = coord(1/2)
 /str

 Regards,

 DQ

 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: 19 February 2013 15:31
 To: solr-user@lucene.apache.org
 Subject: Re: Edismax odd results

 Show us your qf and pf params. Do you have PF2 set? That's the key for 
 getting the phrase engineer boots boosted higher than just boots. You may 
 also simply have to give a higher PF2 boost since boots probably has a much 
 higher term frequency than engineer or even the natural Lucene score for 
 engineer boot.

 Also check the debugQuery=true explain scoring to see how engineer, boot, 
 and engineer boot are being scored - you may have to add some specific 
 query phrases to force engineer boot into the top results to comparing the 
 scoring.

 -- Jack Krupansky

 -Original Message-
 From: David Quarterman
 Sent: Tuesday, February 19, 2013 6:21 AM
 To: solr-user@lucene.apache.org
 Subject: Edismax odd results

 Hi all,

 We have an index of boots which contains harness boots, engineer boots , 
 ankle boots, etc. An edismax search on the index for 'harness boots' brings 
 back 2,175 boots with 'harness' results at the top. 'Searching 'engineer 
 boots' brings back everything but 'engineer boots', same for 'ankle boots' - 
 in fact, same result set of 1,873 mostly boots but a few other products mixed 
 in.

 We're on SOLR 4.0 and the field we're querying is stemmed (snowball), 
 lowercased on WhiteSpaceTokenizer. Any ideas?



RE: Edismax odd results

2013-02-19 Thread David Quarterman
Hi Shawn/Jack,

The log shows the query going in okay, nothing gets stripped out so we're still 
at a loss to understand this. Could it be theta Snowball stemming is too 
invasive?

Regards,

DQ

-Original Message-
From: David Quarterman [mailto:da...@corexe.com] 
Sent: 19 February 2013 16:38
To: solr-user@lucene.apache.org
Subject: RE: Edismax odd results

Hi Shawn,

I checked the admin analysis earlier. Stemming is taking 'engineer' down to 
'engin', but then I'd have thought that a search on 'engin boots' would work 
but it doesn't.

I'll try turning the wick back up on the logging - we set it to 'warning'.

Regards,

DQ

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: 19 February 2013 16:25
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

I do not see the word engineer (or any other similar word) in the score 
calculation, only boots.  A test on my own index shows both words in the 
calculations.  I would use the analysis admin page on the prodnameplurals field 
to see what happens to the input of engineer boots on both index and query - 
see what part of your analysis chain removes it.

If you don't see any problem there, then the Solr log (assuming you haven't 
changed the default log level of INFO) should have a record of what parameters 
were actually received when the query was made.

Thanks,
Shawn


On 2/19/2013 9:14 AM, David Quarterman wrote:
 Hi Jack,

 Here's q test query we've been using:

 select?q=+engineer+bootsdefType=edismaxfl=prodnameqf=prodnameplural
 spf2=prodnameplurals^2.0

 This still produces a result set where the first 'engineer boot' is way down 
 the list and subsequent ones are interspersed with other boots. They're all 
 in there, just not at the top. Below is the debug on the first item that is 
 an engineer boot.

 str name=ITEM_
 0.23492618 = (MATCH) sum of:
0.23492618 = (MATCH) product of:
  0.46985236 = (MATCH) sum of:
0.46985236 = (MATCH) weight(prodnameplurals:boot in 48270) 
 [DefaultSimilarity], result of:
  0.46985236 = score(doc=48270,freq=1.0 = termFreq=1.0 ), 
 product of:
0.22236869 = queryWeight, product of:
  4.8295836 = idf(docFreq=1867, maxDocs=86009)
  0.046043035 = queryNorm
2.112943 = fieldWeight in 48270, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  4.8295836 = idf(docFreq=1867, maxDocs=86009)
  0.4375 = fieldNorm(doc=48270)
  0.5 = coord(1/2)
 /str

 Regards,

 DQ

 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: 19 February 2013 15:31
 To: solr-user@lucene.apache.org
 Subject: Re: Edismax odd results

 Show us your qf and pf params. Do you have PF2 set? That's the key for 
 getting the phrase engineer boots boosted higher than just boots. You may 
 also simply have to give a higher PF2 boost since boots probably has a much 
 higher term frequency than engineer or even the natural Lucene score for 
 engineer boot.

 Also check the debugQuery=true explain scoring to see how engineer, boot, 
 and engineer boot are being scored - you may have to add some specific 
 query phrases to force engineer boot into the top results to comparing the 
 scoring.

 -- Jack Krupansky

 -Original Message-
 From: David Quarterman
 Sent: Tuesday, February 19, 2013 6:21 AM
 To: solr-user@lucene.apache.org
 Subject: Edismax odd results

 Hi all,

 We have an index of boots which contains harness boots, engineer boots , 
 ankle boots, etc. An edismax search on the index for 'harness boots' brings 
 back 2,175 boots with 'harness' results at the top. 'Searching 'engineer 
 boots' brings back everything but 'engineer boots', same for 'ankle boots' - 
 in fact, same result set of 1,873 mostly boots but a few other products mixed 
 in.

 We're on SOLR 4.0 and the field we're querying is stemmed (snowball), 
 lowercased on WhiteSpaceTokenizer. Any ideas?



RE: Edismax odd results

2013-02-19 Thread David Quarterman
Hi,

This is definitely driving us mad now! Changed to PorterStemming and there's 
very little difference. 

If we add fq=engineer, we get 0 results. Add fq=engineer* and we get the 90 in 
the system. Try with fq=ankle* and we get 2. Correct. Try with fq=harness* and 
we get 0!

The stemming reduces 'engineer' to 'engin' so I'd have expected a lot more 
results.

Anyone got any ideas?

Regards,

DQ



-Original Message-
From: David Quarterman [mailto:da...@corexe.com] 
Sent: 19 February 2013 17:09
To: solr-user@lucene.apache.org
Subject: RE: Edismax odd results

Hi Shawn/Jack,

The log shows the query going in okay, nothing gets stripped out so we're still 
at a loss to understand this. Could it be theta Snowball stemming is too 
invasive?

Regards,

DQ

-Original Message-
From: David Quarterman [mailto:da...@corexe.com]
Sent: 19 February 2013 16:38
To: solr-user@lucene.apache.org
Subject: RE: Edismax odd results

Hi Shawn,

I checked the admin analysis earlier. Stemming is taking 'engineer' down to 
'engin', but then I'd have thought that a search on 'engin boots' would work 
but it doesn't.

I'll try turning the wick back up on the logging - we set it to 'warning'.

Regards,

DQ

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: 19 February 2013 16:25
To: solr-user@lucene.apache.org
Subject: Re: Edismax odd results

I do not see the word engineer (or any other similar word) in the score 
calculation, only boots.  A test on my own index shows both words in the 
calculations.  I would use the analysis admin page on the prodnameplurals field 
to see what happens to the input of engineer boots on both index and query - 
see what part of your analysis chain removes it.

If you don't see any problem there, then the Solr log (assuming you haven't 
changed the default log level of INFO) should have a record of what parameters 
were actually received when the query was made.

Thanks,
Shawn


On 2/19/2013 9:14 AM, David Quarterman wrote:
 Hi Jack,

 Here's q test query we've been using:

 select?q=+engineer+bootsdefType=edismaxfl=prodnameqf=prodnameplural
 spf2=prodnameplurals^2.0

 This still produces a result set where the first 'engineer boot' is way down 
 the list and subsequent ones are interspersed with other boots. They're all 
 in there, just not at the top. Below is the debug on the first item that is 
 an engineer boot.

 str name=ITEM_
 0.23492618 = (MATCH) sum of:
0.23492618 = (MATCH) product of:
  0.46985236 = (MATCH) sum of:
0.46985236 = (MATCH) weight(prodnameplurals:boot in 48270) 
 [DefaultSimilarity], result of:
  0.46985236 = score(doc=48270,freq=1.0 = termFreq=1.0 ), 
 product of:
0.22236869 = queryWeight, product of:
  4.8295836 = idf(docFreq=1867, maxDocs=86009)
  0.046043035 = queryNorm
2.112943 = fieldWeight in 48270, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  4.8295836 = idf(docFreq=1867, maxDocs=86009)
  0.4375 = fieldNorm(doc=48270)
  0.5 = coord(1/2)
 /str

 Regards,

 DQ

 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: 19 February 2013 15:31
 To: solr-user@lucene.apache.org
 Subject: Re: Edismax odd results

 Show us your qf and pf params. Do you have PF2 set? That's the key for 
 getting the phrase engineer boots boosted higher than just boots. You may 
 also simply have to give a higher PF2 boost since boots probably has a much 
 higher term frequency than engineer or even the natural Lucene score for 
 engineer boot.

 Also check the debugQuery=true explain scoring to see how engineer, boot, 
 and engineer boot are being scored - you may have to add some specific 
 query phrases to force engineer boot into the top results to comparing the 
 scoring.

 -- Jack Krupansky

 -Original Message-
 From: David Quarterman
 Sent: Tuesday, February 19, 2013 6:21 AM
 To: solr-user@lucene.apache.org
 Subject: Edismax odd results

 Hi all,

 We have an index of boots which contains harness boots, engineer boots , 
 ankle boots, etc. An edismax search on the index for 'harness boots' brings 
 back 2,175 boots with 'harness' results at the top. 'Searching 'engineer 
 boots' brings back everything but 'engineer boots', same for 'ankle boots' - 
 in fact, same result set of 1,873 mostly boots but a few other products mixed 
 in.

 We're on SOLR 4.0 and the field we're querying is stemmed (snowball), 
 lowercased on WhiteSpaceTokenizer. Any ideas?



Re: Edismax odd results

2013-02-19 Thread David Quarterman
Hi Shawn,

Now finished for the day but will post the schema tomorrow. Thanks for the help 
(and Jack too).

Regards,

DQ

P.S. did reindex after changing schema and the analyzer/query stuff matches 
precisely!!

Shawn Heisey s...@elyograg.org wrote:

On 2/19/2013 11:16 AM, David Quarterman wrote:
 This is definitely driving us mad now! Changed to PorterStemming and there's 
 very little difference.

 If we add fq=engineer, we get 0 results. Add fq=engineer* and we get the 90 
 in the system. Try with fq=ankle* and we get 2. Correct. Try with fq=harness* 
 and we get 0!

 The stemming reduces 'engineer' to 'engin' so I'd have expected a lot more 
 results.

 Anyone got any ideas?

Did you completely reindex when you changed your schema?  You must reindex.

Does the index analysis match the query analysis?  Some specific 
differences are allowed (and sometimes encouraged), but stemming must be 
done to both.  Can you share your schema?  Use a paste website like 
pastie.org for that.

Thanks,
Shawn



RE: Feature design question: use autocomple?te to search on 2 different fields, and return 2 different data groups

2012-11-01 Thread David Quarterman
We had a similar requirement and found the best solution (unfortunately)
was to spend a small amount of money. Have a look at Sematext's site
(www.sematext.com). Their Autocomplete is awesome and we have a
fantastic looking AC now on our development site, grouped by category,
product  brand with product pictures to boot!

It's very, very quick in operation too.

Best,

DQ

-Original Message-
From: fernando.beck [mailto:fernando.b...@gmail.com] 
Sent: 01 November 2012 13:40
To: solr-user@lucene.apache.org
Subject: Feature  design question: use autocomple?te to search on 2
different fields, and return 2 different data groups

Hello,

 

 we're facing a new feature request, and we can't get the right way to
come up with a working solution. 

 

Context: we have a list of businesses . For each business we have: name,
category, address, city.
 
One business may have 1 or more categories.

 

Example:

Name: Outback SteakHouse

Category: Restaurants , American

Address: xx

City: Rio de Janeiro

  

Name: Starbucks

Category: Bar, Coffee

Address: y

City: Rio de Janeiro

 

Name:  Pizza Hut

Category: Restaurant, Pizza

Address: 
 
City: New York

 

and so on.

 

What we need to do:  create an autocomplete feature; whenever someone
starts to type, we will need to search the term BOTH on CompanyName AND
Category.
 
Example:  I type pizz

 

and the result should be coming back in 2 groups.

Group 1: Categories  (displaying  Pizza)

Group 2:  all those businesses featuring pizza on their name , ie Pizza
Hut.
 
 

Right now we can not find a way to get this done.

 

Schema (since we're running a portuguese based application, there are 2
fieldType added for it):

 


?xml version=1.0 encoding=UTF-8 ?
schema name=Guia-DEV version=1.5
  types
--


fieldType name=string class=solr.StrField
sortMissingLast=true /
 

fieldType name=boolean class=solr.BoolField
sortMissingLast=true/

 fieldtype name=binary class=solr.BinaryField/



fieldType name=int class=solr.TrieIntField precisionStep=0
positionIncrementGap=0/
fieldType name=float class=solr.TrieFloatField
precisionStep=0
positionIncrementGap=0/
 fieldType name=long class=solr.TrieLongField precisionStep=0
positionIncrementGap=0/
fieldType name=double class=solr.TrieDoubleField
precisionStep=0
positionIncrementGap=0/
 
   
fieldType name=tint class=solr.TrieIntField precisionStep=8
positionIncrementGap=0/
fieldType name=tfloat class=solr.TrieFloatField
precisionStep=8
positionIncrementGap=0/
 fieldType name=tlong class=solr.TrieLongField
precisionStep=8
positionIncrementGap=0/
fieldType name=tdouble class=solr.TrieDoubleField
precisionStep=8
positionIncrementGap=0/
 
  
fieldType name=date class=solr.TrieDateField precisionStep=0
positionIncrementGap=0/


 fieldType name=tdate class=solr.TrieDateField
precisionStep=6
positionIncrementGap=0/


fieldType name=pint class=solr.IntField/
 fieldType name=plong class=solr.LongField/
fieldType name=pfloat class=solr.FloatField/
fieldType name=pdouble class=solr.DoubleField/
 fieldType name=pdate class=solr.DateField
sortMissingLast=true/


   
fieldType name=sint class=solr.SortableIntField
sortMissingLast=true omitNorms=true/
 fieldType name=slong class=solr.SortableLongField
sortMissingLast=true omitNorms=true/
fieldType name=sfloat class=solr.SortableFloatField
sortMissingLast=true omitNorms=true/
 fieldType name=sdouble class=solr.SortableDoubleField
sortMissingLast=true omitNorms=true/


fieldType name=random class=solr.RandomSortField indexed=true
/
 


fieldType name=text_general class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /

filter class=solr.LowerCaseFilterFactory/
   /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
   /analyzer
/fieldType


fieldType name=textCategoryName class=solr.TextField
positionIncrementGap=100
   analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  filter class=solr.SnowballPorterFilterFactory
language=Portuguese/
  /analyzer
  analyzer type=query

RE: Feature design question: use autocomple?te to search on 2 different fields, and return 2 different data groups

2012-11-01 Thread David Quarterman
Fernando,

Pretty much the problem we came up against. We had a basic AC running
using SpellChecker a while ago but it was the grouping that floored us
and sent us elsewhere. Again, multiple queries seemed like the only
possible answer but in an AC scenario, even with SOLR's speed, probably
too slow under load.

Best,

DQ

-Original Message-
From: fernando.beck [mailto:fernando.b...@gmail.com] 
Sent: 01 November 2012 13:55
To: solr-user@lucene.apache.org
Subject: RE: Feature  design question: use autocomple?te to search on 2
different fields, and return 2 different data groups

David,

 appreciate the suggestion.  Our current autocomplete feature is
actually working pretty good.
No perfomance issues; functionally is providing 100% results as
expected.
I checked sematext and also http://www.cominvent.com; they are great,
and our budget to go get them is 0.

At this time, and given the presented schema, my question would be: is
even possible to get it done somehow? with 1 query, and group those
results while autocompleting on 2 different search fields?



--
View this message in context:
http://lucene.472066.n3.nabble.com/Feature-design-question-use-autocompl
e-te-to-search-on-2-different-fields-and-return-2-different-dats-tp40175
28p4017534.html
Sent from the Solr - User mailing list archive at Nabble.com.


SOLR 4.0 Beta documents being duplicated

2012-10-05 Thread David Quarterman
Hi,

We've been using V4.x of SOLR since last November without too much
trouble. Our MySQL database is refreshed daily and a full import is run
automatically after the refresh and generally produces around 86,000
products, obviously on unique doc_id's.

 

So, we upgraded to 4.0 Beta a few days ago, with only mild difficulty,
reindexed and all was fine. Except after the next data refresh and
full-import, we had duplicate products appearing on different unique
doc_ids. Not all products are being duplicated, just random ones. We've
just deleted the data directory and reindexed and the product count has
dropped from 116,711 to 86,543. There'll be another refresh/import early
tomorrow morning and I fear we'll have more duplicates.

 

The call to the import now contains clean=true, commit=true and
optimize=true but it seems to make no difference.

 

Anyone have any ideas?

 

Regards,

 

David Q

 



RE: SOLR 4.0 Beta documents being duplicated

2012-10-05 Thread David Quarterman
Thanks Erick.
We've added the '_version_' and we'll see if that makes a difference
tomorrow. Also, have downloaded the RC1 and will try that next week.

Regards,

David Q

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 05 October 2012 15:40
To: solr-user@lucene.apache.org
Subject: Re: SOLR 4.0 Beta documents being duplicated

How are you indexing? There was a problem with indexing from SolrJ if
you indexed documents in batches, server.add(doclist) that's fixed in
4.0 RC#. The work-around is to add docs singly, server.add(doc)

Second thing. Bad Things Happen if you don't have a _version_ field in
your schema.xml. Solr 4.0 RC# isn't happy on startup if this field is
missing...

Personally, I think you'd be better off using one of the release
candidates.
Robert cut one here:
http://people.apache.org/~rmuir/staging_area/lucene-solr-4.0RC1-rev13911
44/solr/

There will be an RC2 sometime, a couple of problems have been found, but
using RC1 should minimize any update to the official 4.0 plus have a lot
of improvements over BETA...

Best
Erick

On Fri, Oct 5, 2012 at 10:25 AM, David Quarterman da...@corexe.com
wrote:
 Hi,

 We've been using V4.x of SOLR since last November without too much 
 trouble. Our MySQL database is refreshed daily and a full import is 
 run automatically after the refresh and generally produces around 
 86,000 products, obviously on unique doc_id's.



 So, we upgraded to 4.0 Beta a few days ago, with only mild difficulty,

 reindexed and all was fine. Except after the next data refresh and 
 full-import, we had duplicate products appearing on different unique 
 doc_ids. Not all products are being duplicated, just random ones. 
 We've just deleted the data directory and reindexed and the product 
 count has dropped from 116,711 to 86,543. There'll be another 
 refresh/import early tomorrow morning and I fear we'll have more
duplicates.



 The call to the import now contains clean=true, commit=true and 
 optimize=true but it seems to make no difference.



 Anyone have any ideas?



 Regards,



 David Q