is group.query supported in solrcloud (4.8) ?

2014-11-10 Thread Giovanni Bricconi
hello

I have a collection 0_2014_10_11 made of three shards

When I try a group.query, even specifying a single shard, i get this
error shard
0 did not set sort field values (FieldDoc.fields is null); you must pass
fillFields=true to IndexSearcher.search on each shard

This is the request, ask the collection to find groups on IDCat3=922

http://src-dev-1:8080/solr/0_2014_10_11/select?q=*:*group=truegroup.query=IDCat3%3A922shards=src-dev-1:8080/solr/0_2014_10_11_shard1_replica2

According to this page [
https://cwiki.apache.org/confluence/display/solr/Result+Grouping ] the
group.query is supported. Am I missing some key parameter?

Should the shards parameter  be really mandatory? It seems that with
group.field it is not required.

Thanks

Giovanni


grouping finds result name=doclist numFound=0

2014-11-06 Thread Giovanni Bricconi
Sorry for the basic question

q=*:*fq=-sku:2471834fq=FiltroDispo:1fq=has_image:1rows=100fl=descCat3,IDCat3,ranking2group=truegroup.field=IDCat3group.sort=ranking2+descgroup.ngroups=true

returns some groups with no results. I'm using solr 4.8.0, the collection
has 3 shards

Am I missing some parameters?

lst name=grouped
   lst name=IDCat3
int name=matches297254/int
int name=ngroups49/int
arr name=groups
 lst
   int name=groupValue0/intresult name=doclist
numFound=0 start=0//lst
 ...
lstint name=groupValue12043/intresult name=doclist
numFound=2 start=0docint name=IDCat312043/intstr
name=descCat3SSD/strint name=ranking2498/int/doc/result/lst


Re: unstable results on refresh

2014-10-23 Thread Giovanni Bricconi
My user interface shows some boxes to describe results categories. After
half a day of small updates and delete I noticed with various queries that
the boxes started swapping while browsing.
For sure I relied too much in getting the same results on each call, now
I'm keeping the categories order in request parameters to avoid the blink
effect while browsing.

The optimize process is really slow, and I can't use it. Since I have many
other parameters that should be carried along the request to make sure that
the navigation is consistent, I would like to understand if is there a
setup that can limit the idf change and keep it low enough

I tried with

indexConfig

mergeFactor5/mergeFactor

/indexConfig
In solrconfig but this morning /solr/admin/cores?action=STATUS still
reports a number of segments above ten for all cores of the shard. (I'm
sure I have reloaded each core after changing the value)

Now I'm trying with expungeDeletes called from solrj, but still I don't see
the segment count decrease

UpdateRequest commitRequest = new UpdateRequest();

  commitRequest.setAction

//(action, waitFlush, waitSearcher, maxSegments, softCommit, expungeDeletes)

   ( ACTION.COMMIT, true, true, 10, false, true);

  commitRequest.process(solrServer);



2014-10-22 15:48 GMT+02:00 Erick Erickson erickerick...@gmail.com:

 I would rather ask whether such small differences matter enough to
 do this. Is this something users will _ever_ notice? Optimization
 is quite a heavyweight operation, and is generally not recommended
 on indexes that change often, and 5 minutes is certainly below
 the recommendation for optimizing.

 There is/has been work done on distributed IDF, but I don't quite
 know the current status that should address this (I think).

 But other than in a test setup, is it worth the effort?

 Best,
 Erick

 On Wed, Oct 22, 2014 at 3:54 AM, Giovanni Bricconi
 giovanni.bricc...@banzai.it wrote:
  I have made some small patch to the application to make this problem less
  visible, and I'm trying to perform the optimize once per hour, yesterday
 it
  took 5 minutes to perform it, this morning 15 minutes. Today I will
 collect
  some statistics but the publication process sends documents every 5
  minutes, and I think the optimize is taking too much time.
 
  I have no default mergeFactor configured for this collection, do you
 think
  that setting it to a small value could improve the situation? If I have
  understood well having to merge segments will keep similar stats on all
  nodes. It's ok to have the indexing process a little bit slower.
 
 
  2014-10-21 18:44 GMT+02:00 Erick Erickson erickerick...@gmail.com:
 
  Giovanni:
 
  To see how this happens, consider a shard with a leader and two
  followers. Assume your autocommit interval is 60 seconds on each.
 
  This interval can expire at slightly different wall clock times.
  Even if the servers started perfectly in synch, they can get slightly
  out of sync. So, you index a bunch of docs and these replicas close
  the current segment and re-open a new segment with slightly different
  contents.
 
  Now docs come in that replace older docs. The tf/idf statistics
  _include_ deleted document data (which is purged on optimize). Given
  that doc X an be in different segments (or, more accurately, segments
  that get merged at different times on different machines), replica 1
  may have slightly different stats than replica 2, thus computing
  slightly different scores.
 
  Optimizing purges all data related to deleted documents, so it all
  regularizes itself on optimize.
 
  Best,
  Erick
 
  On Tue, Oct 21, 2014 at 11:08 AM, Giovanni Bricconi
  giovanni.bricc...@banzai.it wrote:
   I noticed again the problem, now I was able to collect some data. in
 my
   paste http://pastebin.com/nVwf327c you can see the result of the same
  query
   issued twice, the 2nd and 3rd group are swapped.
  
   I pasted also the clusterstate and the core state for each core.
  
   The logs did'n show any problem related to indexing, only some
 malformed
   query.
  
   After doing an optimize the problem disappeared.
  
   So, is the problem related to documents that where deleted from the
  index?
  
   The optimization took 5 minutes to complete
  
   2014-10-21 11:41 GMT+02:00 Giovanni Bricconi 
  giovanni.bricc...@banzai.it:
  
   Nice!
   I will monitor the index and try this if the problem comes back.
   Actually the problem was due to small differences in score, so I
 think
  the
   problem has the same origin
  
   2014-10-21 8:10 GMT+02:00 lboutros boutr...@gmail.com:
  
   Hi Giovanni,
  
   we had this problem as well.
   The cause was that the different nodes have slightly different idf
  values.
  
   We solved this problem by doing an optimize operation which really
  remove
   suppressed data.
  
   Ludovic.
  
  
  
   -
   Jouve
   France.
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/unstable-results-on-refresh

Re: unstable results on refresh

2014-10-22 Thread Giovanni Bricconi
I have made some small patch to the application to make this problem less
visible, and I'm trying to perform the optimize once per hour, yesterday it
took 5 minutes to perform it, this morning 15 minutes. Today I will collect
some statistics but the publication process sends documents every 5
minutes, and I think the optimize is taking too much time.

I have no default mergeFactor configured for this collection, do you think
that setting it to a small value could improve the situation? If I have
understood well having to merge segments will keep similar stats on all
nodes. It's ok to have the indexing process a little bit slower.


2014-10-21 18:44 GMT+02:00 Erick Erickson erickerick...@gmail.com:

 Giovanni:

 To see how this happens, consider a shard with a leader and two
 followers. Assume your autocommit interval is 60 seconds on each.

 This interval can expire at slightly different wall clock times.
 Even if the servers started perfectly in synch, they can get slightly
 out of sync. So, you index a bunch of docs and these replicas close
 the current segment and re-open a new segment with slightly different
 contents.

 Now docs come in that replace older docs. The tf/idf statistics
 _include_ deleted document data (which is purged on optimize). Given
 that doc X an be in different segments (or, more accurately, segments
 that get merged at different times on different machines), replica 1
 may have slightly different stats than replica 2, thus computing
 slightly different scores.

 Optimizing purges all data related to deleted documents, so it all
 regularizes itself on optimize.

 Best,
 Erick

 On Tue, Oct 21, 2014 at 11:08 AM, Giovanni Bricconi
 giovanni.bricc...@banzai.it wrote:
  I noticed again the problem, now I was able to collect some data. in my
  paste http://pastebin.com/nVwf327c you can see the result of the same
 query
  issued twice, the 2nd and 3rd group are swapped.
 
  I pasted also the clusterstate and the core state for each core.
 
  The logs did'n show any problem related to indexing, only some malformed
  query.
 
  After doing an optimize the problem disappeared.
 
  So, is the problem related to documents that where deleted from the
 index?
 
  The optimization took 5 minutes to complete
 
  2014-10-21 11:41 GMT+02:00 Giovanni Bricconi 
 giovanni.bricc...@banzai.it:
 
  Nice!
  I will monitor the index and try this if the problem comes back.
  Actually the problem was due to small differences in score, so I think
 the
  problem has the same origin
 
  2014-10-21 8:10 GMT+02:00 lboutros boutr...@gmail.com:
 
  Hi Giovanni,
 
  we had this problem as well.
  The cause was that the different nodes have slightly different idf
 values.
 
  We solved this problem by doing an optimize operation which really
 remove
  suppressed data.
 
  Ludovic.
 
 
 
  -
  Jouve
  France.
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/unstable-results-on-refresh-tp4164913p4165086.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 



Re: unstable results on refresh

2014-10-21 Thread Giovanni Bricconi
I noticed the problem looking at a group query, the groups returned where
sorted on the score field of the first result, and then showed to the user.
Repeating the same query I noticed that the order of two group started
switching

Thank you, I will look for the thread you said

2014-10-20 22:07 GMT+02:00 Alexandre Rafalovitch arafa...@gmail.com:

 What are the differences on. The document count or things like facets?
 This could be important.

 Also, I think there was a similar thread on the mailing list a week or
 two ago, might be worth looking for it.

 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On 20 October 2014 04:49, Giovanni Bricconi giovanni.bricc...@banzai.it
 wrote:
  Hello
 
  I have a procedure that sends small data changes during the day to a
  solrcloud cluster, version 4.8
 
  The cluster is made of three nodes, and three shards, each node contains
  two shards
 
  The procedure has been running for days; I don't know when but at some
  point one of the cores has gone out of synch and so repeating the same
  query has began to show small differences.
 
  The core graph was not useful, everything seemed active.
 
  I have solved the problem reindexing all, because the collection is quite
  small, but is there a way to fix this problem? Suppose I can figure out
  which core returns different results, is there a command to force that
 core
  to refetch the whole index from its master?
 
  Thanks
 
  Giovanni



Re: unstable results on refresh

2014-10-21 Thread Giovanni Bricconi
Nice!
I will monitor the index and try this if the problem comes back.
Actually the problem was due to small differences in score, so I think the
problem has the same origin

2014-10-21 8:10 GMT+02:00 lboutros boutr...@gmail.com:

 Hi Giovanni,

 we had this problem as well.
 The cause was that the different nodes have slightly different idf values.

 We solved this problem by doing an optimize operation which really remove
 suppressed data.

 Ludovic.



 -
 Jouve
 France.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/unstable-results-on-refresh-tp4164913p4165086.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: unstable results on refresh

2014-10-21 Thread Giovanni Bricconi
I noticed again the problem, now I was able to collect some data. in my
paste http://pastebin.com/nVwf327c you can see the result of the same query
issued twice, the 2nd and 3rd group are swapped.

I pasted also the clusterstate and the core state for each core.

The logs did'n show any problem related to indexing, only some malformed
query.

After doing an optimize the problem disappeared.

So, is the problem related to documents that where deleted from the index?

The optimization took 5 minutes to complete

2014-10-21 11:41 GMT+02:00 Giovanni Bricconi giovanni.bricc...@banzai.it:

 Nice!
 I will monitor the index and try this if the problem comes back.
 Actually the problem was due to small differences in score, so I think the
 problem has the same origin

 2014-10-21 8:10 GMT+02:00 lboutros boutr...@gmail.com:

 Hi Giovanni,

 we had this problem as well.
 The cause was that the different nodes have slightly different idf values.

 We solved this problem by doing an optimize operation which really remove
 suppressed data.

 Ludovic.



 -
 Jouve
 France.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/unstable-results-on-refresh-tp4164913p4165086.html
 Sent from the Solr - User mailing list archive at Nabble.com.





unstable results on refresh

2014-10-20 Thread Giovanni Bricconi
Hello

I have a procedure that sends small data changes during the day to a
solrcloud cluster, version 4.8

The cluster is made of three nodes, and three shards, each node contains
two shards

The procedure has been running for days; I don't know when but at some
point one of the cores has gone out of synch and so repeating the same
query has began to show small differences.

The core graph was not useful, everything seemed active.

I have solved the problem reindexing all, because the collection is quite
small, but is there a way to fix this problem? Suppose I can figure out
which core returns different results, is there a command to force that core
to refetch the whole index from its master?

Thanks

Giovanni


Re: solrcloud indexing completed event

2014-07-01 Thread Giovanni Bricconi
Thank you Erick,


Fortunately I can modify the data feeding process to start my post-indexing
tasks.




2014-06-30 22:13 GMT+02:00 Erick Erickson erickerick...@gmail.com:

 The paradigm is different. In SolrCloud when a client sends an indexing
 request to any node in the system, when the response comes back all the
 nodes (leaders, followers, etc) have _all_ received the update and
 processed it. So you don't have to care in the same way.

 As far as different segments, versions, and all that this is entirely
 expected.
 Considering the above. Packet-leader. leader-follower. Each of them is
 independently indexing the documents, there is no replication. So, since
 the two servers started at different times, things like the autocommit
 interval
 can kick in at different times and the indexes diverge in terms of segment
 counts, version numbers, whatever. They'll return the same _documents_,
 but

 FWIW,
 Erick

 On Mon, Jun 30, 2014 at 7:55 AM, Giovanni Bricconi
 giovanni.bricc...@banzai.it wrote:
  Hello
 
  I have one application that queries solr; when the index version changes
  this application has to redo some tasks.
 
  Since I have more than one solr server, I would like to start these tasks
  when all solr nodes are synchronized.
 
  With master/slave configuration the application simply watched
  http://myhost:8080/solr/admin/cores?action=STATUScore=0bis
  on each solr node and checked that the commit time msec was equal. When
 the
  time changes and becomes equal on all the nodes the replication is
 complete
  and it is safe to restart the tasks.
 
  Now I would like to switch to a solrcloud configuration, splitting the
 core
  0bis in 3 shards, with 2 replicas for each shard.
 
  After refeeding the collection I tried the same approach calling
 
 
 http://myhost:8080/solr/admin/cores?action=STATUScore=0bis_shard3_replica2
 
  for each core of the collection, but with suprise I have found that on
 the
  same stripe the version of the index, the number of segments and even the
  commit time msec was different!!
 
  I was thinking that it was possible to check some parameter on each
  stripe's core to check that everithing was up to date, but this does not
  seem to be true.
 
  Is it possible somehow to capture the commit done on every core of the
  collection event?
 
  Thank you
 
  Giovanni



solrcloud indexing completed event

2014-06-30 Thread Giovanni Bricconi
Hello

I have one application that queries solr; when the index version changes
this application has to redo some tasks.

Since I have more than one solr server, I would like to start these tasks
when all solr nodes are synchronized.

With master/slave configuration the application simply watched
http://myhost:8080/solr/admin/cores?action=STATUScore=0bis
on each solr node and checked that the commit time msec was equal. When the
time changes and becomes equal on all the nodes the replication is complete
and it is safe to restart the tasks.

Now I would like to switch to a solrcloud configuration, splitting the core
0bis in 3 shards, with 2 replicas for each shard.

After refeeding the collection I tried the same approach calling

http://myhost:8080/solr/admin/cores?action=STATUScore=0bis_shard3_replica2

for each core of the collection, but with suprise I have found that on the
same stripe the version of the index, the number of segments and even the
commit time msec was different!!

I was thinking that it was possible to check some parameter on each
stripe's core to check that everithing was up to date, but this does not
seem to be true.

Is it possible somehow to capture the commit done on every core of the
collection event?

Thank you

Giovanni


Re: solr cloud 4.8, synonymfilterfactory and big dictionaries

2014-05-14 Thread Giovanni Bricconi
Thank you Elaine,

splitted files worked for me too.



2014-05-06 19:15 GMT+02:00 Cario, Elaine elaine.ca...@wolterskluwer.com:

 Hi Giovanni,

 I had the same issue just last week!  I worked around it temporarily by
 segmenting the file into  1 MB files, and then using a comma-delimited
 list of files in the filter specification in the schema.

 There is a known issue around this:

 https://issues.apache.org/jira/browse/SOLR-4793

 ...and presumably there is a param you can set in zookeeper and solr
 (jute.maxbuffersize) to override the 1 MB limit.  I didn't have enough time
 to test that out (and its not clear to me what form the value should take),
 at the time it was easier for me to brute force the files.

 -Original Message-
 From: Giovanni Bricconi [mailto:giovanni.bricc...@banzai.it]
 Sent: Tuesday, May 06, 2014 12:11 PM
 To: solr-user
 Subject: solr cloud 4.8, synonymfilterfactory and big dictionaries

 Hello

 I am migrating an application to solrcloud and I have to deal with a big
 dictionary, about 10Mb

 It seems that I can't upload it to zookeper, is there a way of specifying
 an external file for the synonyms parameter?

 can I compress the file or split it in many small files?

 I have the same problem for SnowballPorterFilterFactory

 Thanks



solr cloud 4.8, synonymfilterfactory and big dictionaries

2014-05-06 Thread Giovanni Bricconi
Hello

I am migrating an application to solrcloud and I have to deal with a big
dictionary, about 10Mb

It seems that I can't upload it to zookeper, is there a way of specifying
an external file for the synonyms parameter?

can I compress the file or split it in many small files?

I have the same problem for SnowballPorterFilterFactory

Thanks


Re: Solr relevancy tuning

2014-04-11 Thread Giovanni Bricconi
Hello Doug

I have just watched the quepid demonstration video, and I strongly agree
with your introduction: it is very hard to involve marketing/business
people in repeated testing session, and speadsheets or other kind of files
are not the right tool to use.
Currenlty I'm quite alone in my tuning task and having a visual approach
could be benefical for me, you are giving me many good inputs!

I see that kelvin (my scripted tool) and queepid follows the same path. In
queepid someone quickly whatches the results and applies colours to result,
in kelvin you enter one on more queries (network cable, ethernet cable) and
states that the result must contains ethernet in the title, or must come
from a list of product categories.

I also do diffs of results, before and after changes, to check what is
going on; but I have to do that in a very unix-scripted way.

Have you considered of placing a counter of total red/bad results in
quepid? I use this index to have a quick overview of changes impact across
all queries. Actually I repeat tests in production from times to time, and
if I see the kelvin temperature rising (the number of errors going up) I
know I have to check what's going on because new products maybe are having
a bad impact on the index.

I also keep counters of products with low quality images/no images at all
or too short listings, sometimes are useful to undestand better what will
happen if you change some bq/fq in the application.

I see also that after changes in quepid someone have to check gray
results and assign them a colour, in kelvin case sometimes the conditions
can do a bit of magic (new product names still contains SM-G900F) but
sometimes can introduce false errors (the new product name contains only
Galaxy 5 and not the product code SM-G900F). So some checks are needed but
with quepid everybody can do the check, with kelvin you have to change some
line of a script, and not everybody is able/willing to do that.

The idea of a static index is a good suggestion, I will try to have it in
the next round of search engine improvement.

Thank you Doug!




2014-04-09 17:48 GMT+02:00 Doug Turnbull 
dturnb...@opensourceconnections.com:

 Hey Giovanni, nice to meet you.

 I'm the person that did the Test Driven Relevancy talk. We've got a product
 Quepid (http://quepid.com) that lets you gather good/bad results for
 queries and do a sort of test driven development against search relevancy.
 Sounds similar to your existing scripted approach. Have you considered
 keeping a static catalog for testing purposes? We had a project with a lot
 of updates and date-dependent relevancy. This lets you create some test
 scenarios against a static data set. However, one downside is you can't
 recreate problems in production in your test setup exactly-- you have to
 find a similar issue that reflects what you're seeing.

 Cheers,
 -Doug


 On Wed, Apr 9, 2014 at 10:42 AM, Giovanni Bricconi 
 giovanni.bricc...@banzai.it wrote:

  Thank you for the links.
 
  The book is really useful, I will definitively have to spend some time
  reformatting the logs to to access number of result founds, session id
 and
  much more.
 
  I'm also quite happy that my test cases produces similar results to the
  precision reports shown at the beginning of the book.
 
  Giovanni
 
 
  2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com:
 
   Hi Giovanni,
  
   Here are some relevant pointers :
  
  
  
 
 http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy
  
  
   http://rosenfeldmedia.com/books/search-analytics/
  
   http://www.sematext.com/search-analytics/index.html
  
  
   Ahmet
  
  
   On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi 
   giovanni.bricc...@banzai.it wrote:
   It is about one year I'm working on an e-commerce site, and
  unfortunately I
   have no information retrieval background, so probably I am missing
 some
   important practices about relevance tuning and search engines.
   During this period I had to fix many bugs about bad search results,
  which
   I have solved sometimes tuning edismax weights, sometimes creating ad
 hoc
   query filters or query boosting; but I am still not able to figure out
  what
   should be the correct process to improve search results relevance.
  
   These are the practices I am following, I would really appreciate any
   comments about them and any hints about what practices you follow in
 your
   projects:
  
   - In order to have a measure of search quality I have written many test
   cases such as if the user searches for nike sport watch the search
   result should display at least four tom tom products with the words
   nike and sportwatch in the title. I have written a tool that
  read
   such tests from json files and applies them to my applications, and
 then
   counts the number of results that does not match the criterias stated
 in
   the test cases. (for those interested

Solr relevancy tuning

2014-04-09 Thread Giovanni Bricconi
It is about one year I'm working on an e-commerce site, and unfortunately I
have no information retrieval background, so probably I am missing some
important practices about relevance tuning and search engines.
During this period I had to fix many bugs about bad search results, which
I have solved sometimes tuning edismax weights, sometimes creating ad hoc
query filters or query boosting; but I am still not able to figure out what
should be the correct process to improve search results relevance.

These are the practices I am following, I would really appreciate any
comments about them and any hints about what practices you follow in your
projects:

- In order to have a measure of search quality I have written many test
cases such as if the user searches for nike sport watch the search
result should display at least four tom tom products with the words
nike and sportwatch in the title. I have written a tool that read
such tests from json files and applies them to my applications, and then
counts the number of results that does not match the criterias stated in
the test cases. (for those interested this tool is available at
https://github.com/gibri/kelvin but it is still quite a prototype)

- I use this count as a quality index, I tried various times to change the
edismax weight to lower the whole number of error, or to add new
filters/boostings to the application to try to decrease the error count.

- The pros of this is that at least you have a number to look at, and that
you have a quick way of checking the impact of a modification.

- The bad side is that you have to maintain the test cases: now I have
about 800 tests and my product catalogue changes often, this implies that
some products exits the catalog, and some test cases cant pass anymore.

- I am populating the test cases using errors reported from users, and I
feel that this is driving the test cases too much toward pathologic cases.
An more over I haven't many test for cases that are working well now.

I would like to use search logs as drivers to generate tests, but I feel I
haven't picked the right path. Using top queries, manually reviewing
results, and then writing tests is a slow process; moreover many top
queries are ambiguous or are driven by site ads.

Many many queries are unique per users. How to deal with these cases?

How are you using your log to find out test cases to fix? Are you looking
for queries where the user is not opening any returned results? Which kpi
have you chosen to find out query that are not providing good results? And
what are you using as kpi for the whole search, beside the conversion rate?

Can you suggest me any other practices you are using on your projects?

Thank you very much in advance

Giovanni


Re: Solr relevancy tuning

2014-04-09 Thread Giovanni Bricconi
Thank you for the links.

The book is really useful, I will definitively have to spend some time
reformatting the logs to to access number of result founds, session id and
much more.

I'm also quite happy that my test cases produces similar results to the
precision reports shown at the beginning of the book.

Giovanni


2014-04-09 12:59 GMT+02:00 Ahmet Arslan iori...@yahoo.com:

 Hi Giovanni,

 Here are some relevant pointers :


 http://www.lucenerevolution.org/2013/Test-Driven-Relevancy-How-to-Work-with-Content-Experts-to-Optimize-and-Maintain-Search-Relevancy


 http://rosenfeldmedia.com/books/search-analytics/

 http://www.sematext.com/search-analytics/index.html


 Ahmet


 On Wednesday, April 9, 2014 12:17 PM, Giovanni Bricconi 
 giovanni.bricc...@banzai.it wrote:
 It is about one year I'm working on an e-commerce site, and unfortunately I
 have no information retrieval background, so probably I am missing some
 important practices about relevance tuning and search engines.
 During this period I had to fix many bugs about bad search results, which
 I have solved sometimes tuning edismax weights, sometimes creating ad hoc
 query filters or query boosting; but I am still not able to figure out what
 should be the correct process to improve search results relevance.

 These are the practices I am following, I would really appreciate any
 comments about them and any hints about what practices you follow in your
 projects:

 - In order to have a measure of search quality I have written many test
 cases such as if the user searches for nike sport watch the search
 result should display at least four tom tom products with the words
 nike and sportwatch in the title. I have written a tool that read
 such tests from json files and applies them to my applications, and then
 counts the number of results that does not match the criterias stated in
 the test cases. (for those interested this tool is available at
 https://github.com/gibri/kelvin but it is still quite a prototype)

 - I use this count as a quality index, I tried various times to change the
 edismax weight to lower the whole number of error, or to add new
 filters/boostings to the application to try to decrease the error count.

 - The pros of this is that at least you have a number to look at, and that
 you have a quick way of checking the impact of a modification.

 - The bad side is that you have to maintain the test cases: now I have
 about 800 tests and my product catalogue changes often, this implies that
 some products exits the catalog, and some test cases cant pass anymore.

 - I am populating the test cases using errors reported from users, and I
 feel that this is driving the test cases too much toward pathologic cases.
 An more over I haven't many test for cases that are working well now.

 I would like to use search logs as drivers to generate tests, but I feel I
 haven't picked the right path. Using top queries, manually reviewing
 results, and then writing tests is a slow process; moreover many top
 queries are ambiguous or are driven by site ads.

 Many many queries are unique per users. How to deal with these cases?

 How are you using your log to find out test cases to fix? Are you looking
 for queries where the user is not opening any returned results? Which kpi
 have you chosen to find out query that are not providing good results? And
 what are you using as kpi for the whole search, beside the conversion rate?

 Can you suggest me any other practices you are using on your projects?

 Thank you very much in advance

 Giovanni




question about synonymfilter

2013-12-23 Thread Giovanni Bricconi
hello

suppose I have this synonim
abxpower = abx power

and suppose you are indexing abxpower pipp

From the analyzer I see that abxpower is splitted in two words, but the
second word power overlaps the next one
text raw_bytes keyword position start end type positionLength
abxpower [61 62 78 70 6f 77 65 72] false 1 0 8 word 1
pipp [70 69 70 70] false 2 9 14 word 1


   SF
  text raw_bytes positionLength type start end position keyword
abx [61 62 78] 1 SYNONYM 0 8 1 false
pipp [70 69 70 70] 1 word 9 14 2 falsepower [70 6f 77 65 72] 1 SYNONYM 9 14
2 false


Is this correct? I noticed that WordDelimitedFilter instead changes start
end and position. This is what appens for abx-power pippo

 WDF
  text raw_bytes start end type position positionLength
abx [61 62 78] 0 3 word 1 1power [70 6f 77 65 72] 4 9 word 2 1
pippo [70 69 70 70 6f] 10 15 word 3 1


Re: Data import handler with multi tables

2013-10-29 Thread Giovanni Bricconi
maybe

entity name=bothTables
query=select concat('A.',id) id, id originalId,
nameA from tbl_tableA union all select concat('B.',id) id, id originalId,
nameA from tbl_tableB 
field name=id column=id/
field name=originalId column=originalId/ !--
need a new field --
field name=nameA column=nameA /
/entity

So you can keep the original id, maybe add also an originalTable field if
you don't like parsing the id colum to discover the table from which the
data was read.


2013/10/29 Stefan Matheis matheis.ste...@gmail.com

 I've never looked for another way, what's the problem using a compound key?


 On Monday, October 28, 2013 at 1:38 PM, dtphat wrote:

  Hi,
  is there no another way to import all data for this case instead Only the
  way using compound key?
  Thanks.
 
 
 
  -
  Phat T. Dong
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Re-Data-import-handler-with-multi-tables-tp4098048p4098056.html
  Sent from the Solr - User mailing list archive at Nabble.com (
 http://Nabble.com).
 
 





howto increase indexing speed?

2013-10-16 Thread Giovanni Bricconi
I have a small solr setup, not even on a physical machine but a vmware
virtual machine with a single cpu that reads data using DIH from a
database. The machine has no phisical disks attached but stores data on a
netapp nas.

Currently this machine indexes 320 documents/sec, not bad but we plan to
double the index and we would like to keep nearly the same.

Doing some basic checks during the indexing I have found with iostat that
the usage of the disks is nearly 8% and the source database is running
fine, instead the  virtual cpu is 95% running on solr.

Now I can quite easily add another virtual cpu to the solr box, but as far
as I know this won't help because DIH doesn't work in parallel. Am I wrong?

What would you do? Rewrite the feeding process quitting dih and using solrj
to feed data in parallel? Would you instead keep DIH and switch to a
sharded configuration?

Thank you for any hints

Giovanni


Re: ClassNotFoundException regarding SolrInfoMBean under Tomcat 7

2013-07-05 Thread Giovanni Bricconi
I saw something similar when I placed some jar in tomcat/lib (data import
handler), the right place was instead WEB-INF/lib.
I would try placing al needed jars there.


2013/7/5 Michael Bakonyi kont...@mb-neuemedien.de

 Hm, can't anybody help me out? I still can't get my installation run
 correctly ...

 What I've found out recently – if I understand it aright:

 SolrInfoMBean has somehow to do with JMX. So I manually activated JMX via
 inserting jmx / within my solrconfig.xml as described here:
 http://wiki.apache.org/solr/SolrJmx.

 But nevertheless the same Exception still appears ...

 Cheers,
 Michael


 Am 04.07.2013 um 13:02 schrieb Michael Bakonyi:

  Hi everyone,
 
  I'm trying to get the CMS TYPO3 connected with Solr 3.6.2.
 
  By now I followed the installation at
 http://wiki.apache.org/solr/SolrTomcat except that I didn't copy the
 .war-file into the $SOLR_HOME but referencing to it at a different location
 via Tomcat Context fragment file.
 
  Until then the Solr-Server works – I can reach the GUI via URL.
 
  To get Solr connected with the CMS I then created a new core-folder
 (btw. can anybody give me kind of a live example, when to use different
 cores? Until now I still don't really understand the concept of cores ..)
 by duplicating the example-folder in which I overwrote some files
 (especially solrconfig.xml) with files offered by the TYPO3-community. I
 also moved the file solr.xml one level up and edited it (added
 core-fragment and especially adjusted instanceDir)  to get a correct
 multicore-setup like in the example multicore-setup within the downloaded
 solr-tgz-package.
 
  But now I get the Java-exception
 
  java.lang.NoClassDefFoundError: org/apache/solr/core/SolrInfoMBean at
 java.lang.ClassLoader.defineClass1(Native Method)
 
  In the Tomcat-log file it is said additionally: Caused by:
 java.lang.ClassNotFoundException: org.apache.solr.core.SolrInfoMBean.
 
  My guess is, that within the new solrconfig.xml there are calls to
 classes which aren't included correctly. There are some libs, which are
 included at the top of this file but the paths of the references should be
 ok as I checked them via Bash: At
 http://wiki.apache.org/solr/SolrConfigXml it is said that the lib dir=
 directory is relative to the instanceDir, so this is what I've checked. I
 also inserted absolute paths but this wasn't successful either.
 
  Can anybody give me a hint how to solve this problem? Would be great :)
 
  Cheers,
  Michael




Re: Is it possible to find a leader from a list of cores in solr via java code

2013-07-03 Thread Giovanni Bricconi
I have the same question. My purpose is to start the dih full process on
the leader and not on a replica.
I tried full import on a replica but watching logs it seemed to me that the
replica was loading data to send it to the leader which in turn has to
update all the replicas.
At least this is what I saw with solr 4.2.1

Giovanni


2013/7/3 Erick Erickson erickerick...@gmail.com

 You can always query Zookeeper and find that information out.
 Take a look at CloudSolrServer, maybe ZkCoreNodeProps etc.
 for examples since CloudSolrServer is leader aware, it
 should have some clues...

 Or maybe ZkStateReader? I haven't been in that code much,
 so I can't be more specific...

 But why do you have this requirement? What do you hope to
 accomplish? Because this is often the kind of thing that's seems
 more useful than it is...

 Best
 Erick


 On Wed, Jul 3, 2013 at 3:05 AM, vicky desai vicky.de...@germinait.com
 wrote:

  Hi ,
 
  I have a set up of 1 leader and 1 replica and I have a requirement where
 in
  I need to find the leader core from the collection. Is there an api in
  solrj
  by means of which this can be achieved.
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Is-it-possible-to-find-a-leader-from-a-list-of-cores-in-solr-via-java-code-tp4074994.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Replicas and soft commit

2013-06-14 Thread Giovanni Bricconi
I have recently upgraded our application from solr 3.6 to solr 4.2.1, and I
have just started learning about soft commits and partial updates.

Currently I have one indexing node and 3 replicas of the same core, and
every modification goes through a dih delta index. This is usually ok but I
have some special cases where updates should be made visible very quickly.

As I have seen with my first tests - it is possible to send partial updates
and soft commits to each replica and to the indexer - and when the indexer
gets an hard commit every replica is realligned.

Is this the right approach or am I misunderstanding how to use this
feature?

I don't see soft commit propagation to replicas when sending update to the
indexer only: is this true or maybe I haven't changed some configuration
files when porting the application to solr4?

Giovanni


custom facet.sort

2013-05-07 Thread Giovanni Bricconi
I have a string field containing values such as 1khz 1ghz 1mhz etc.

I use this field to show a facet, currently I'm showing results in
facet.sort=count order. Now I'm asked to reorder the facet according to the
unit of measure (khz/mhz/ghz).

I also have 3/4 other custom sorting to implement

Is it possible to plug in a custom java class to provide custom facet.sort
modes?

Thank you

Giovanni


Re: Solr and OpenPipe

2013-03-28 Thread Giovanni Bricconi
Bella lì!
vedo che ci divertiamo
 Il giorno 28/mar/2013 17:11, Fabio Curti fabio.cu...@gmail.com ha
scritto:

 git clone https://github.com/kolstae/openpipe
 cd openpipe
 mvn install

 regards



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-and-OpenPipe-tp484777p4052079.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr 4 plugins

2012-12-23 Thread Giovanni Bricconi
This is really interesting!
Do you know if these added fields can be used in sorting or faceting?
Tanks
Il giorno 23/dic/2012 14:08, Otis Gospodnetic otis.gospodne...@gmail.com
ha scritto:

 Hi,

 Look into writing a custom SearchComponent.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Dec 23, 2012 2:07 AM, Eyal Ben-Meir eya...@gmail.com wrote:

  Hi all,
  I want to use solr 4 as a full text search engine, but I need to make one
  of the query fields to get its answer not from lucene engine but from my
  own engine. The rest should continue as normal.
  Any ideas how to do it? Thanks.