date:20150220

Relatively frequently (about a once a month) we need to reindex the data, by
using DIH and copying the data from one index to another.
Because of the fact that we have a large index, it could take from 12 to 24
hours to complete. At the same time the old index is being queried by users.
Sometimes DIH could be interrupted at the middle, because of some unexpected
exception caused by OutOfMemory or something else (many times it failed when
more than 90 % was completed).
More than this, almost every time, some items are missing at new the index.
It is very complicated to find them.
At this stage I can't be sure about what documents exactly were missed and I
have to do it again and waiting for many hours. At the same time the old
index constantly receives new items.

I want to suggest the following way to solve the problem:
• Get list of all item ids ( call LUCINE API , like CLUE does for example
)
• Start DIH, which will iterate over those ids and each time make a
query for n items.
1. Of course original DIH class should be changed to support it.
• This will give the following advantages :
1. I will know exactly what items were failed.
2. I can restart the process from any point and in case of DIH failure
restart it from the point of failure.

so the main difference will be that now DIH running on *:* query and I
suggest to run it list of IDS

for example if I have 1000 docs and want that this new DIH will take each
time 100 docs , so it will do 10 queries , each one will have 100 IDS . (
like id:(1 2 3 ... 100) then id:(101 102 ... 200) etc... )

The question is what do you think about it? Or all of this could be done
another way and I am trying to reinvent the wheel?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html
Sent from the Solr - User mailing list archive at Nabble.com.

Remove all parent docs having specific child doc

2015-02-20 Thread Lokesh Chhaparwal

Hi,

I want to remove all the parent docs having a specific child doc. Eg.

docEmployee1
doc
  fieldDept1/field
/doc
doc
  fieldDept2/field
/doc
/doc

docEmployee2
doc
  fieldDept2/field
/doc
doc
  fieldDept3/field
/doc
/doc

Query: Remove all employees which lie in Dept1
Response should be: Employee2 *only*

Problem: *NOT operator is not being supported* in block join query parser.
*q = {!parent which=employee:*}*:* -department:Dept1 *- results Employee1,
Employee2

*q= - {!parent which=employee:*} department:Dept1 *- it does not work with
block join query parser.

Please suggest as to how I filter those employees which lie in Dept1 using
block join or any other query parser?

Thanks,
Lokesh

Use multiple collections having different configuration

Hello,
I have scenario where I want to create/use 2 collection into
same Solr named as collection1 and collection2. I want to use distributed
servers. Each collection has multiple shards. Each collection contains
different configurations(solrconfig.xml and schema.xml). How can I do?
In between, If I want to re-configure any collection then how to do that?

As I know, If we use single collection which having multiple shards then we
need to use this upconfig link -

* example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd
upconfig -confdir example/solr/collection1/conf -confname default *
and restart all the nodes.
For 2 collections into same solr. How can I do re-configure?

Advantage of using Java programming with Solr over Solr API

Hi,
  What is the advantages of java programming with Solr over Solr
API?

Re: ignoring bad documents during index

I want to experiment with this issue , where exactly I should take a look ? 
I want to try to fix this missing aggregation . 

What class is responsible to that ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187587.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Committed before 500

2015-02-20 Thread NareshJakher

Hi Shawn,

I do not want to increase timeout as these errors are very few. Also current 
timeout of 90 seconds is good enough.  Is there a way to find why Solr is 
getting timed-out ( at times ), could it be that Solr is busy doing other 
activities like re-indexing, commits etc.

Additionally I also found that some of non-leader node move to recovering or 
recovery failed after these time out errors. I am just wondering if these are 
related to performance issue and Solr commits needs to be controlled.

Regards,
Naresh Jakher

From: Shawn Heisey-2 [via Lucene] 
[mailto:ml-node+s472066n4187382...@n3.nabble.com]
Sent: Thursday, February 19, 2015 8:12 PM
To: Jakher, Naresh
Subject: Re: Committed before 500

On 2/19/2015 6:30 AM, NareshJakher wrote:

 I am using Solr cloud with 3 nodes, at times following error is observed in
 logs during delete operation. Is it a performance issue ? What can be done
 to resolve this issue

 Committed before 500 {msg=Software caused connection abort: socket write
 error,trace=org.eclipse.jetty.io.EofException

 I did search on old topics but couldn't find anything concrete related to
 Solr cloud. Would appreciate any help on the issues as I am relatively new
 to Solr.

A jetty EofException indicates that one specific thing is happening:

The TCP connection from the client was severed before Solr responded to
the request.  Usually this happens because the client has been
configured with an absolute timeout or an inactivity timeout, and the
timeout was reached.

Configuring timeouts so that you can be sure clients don't get stuck is
a reasonable idea, but any configured timeouts should be VERY long.
You'd want to use a value like five minutes, rather than 10, 30, or 60
seconds.

The timeouts MIGHT be in the HttpShardHandler config that Solr and
SolrCloud use for distributed searches, and they also might be in
operating-system-level config.

https://wiki.apache.org/solr/SolrConfigXml?highlight=%28HttpShardHandler%29#Configuration_of_Shard_Handlers_for_Distributed_searches

Thanks,
Shawn



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187382.html
To unsubscribe from Committed before 500, click 
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4187361code=bmFyZXNoLmpha2hlckBjYXBnZW1pbmkuY29tfDQxODczNjF8NzQ0MTczNzc0.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is intended only for the person to whom 
it is addressed. If you are not the intended recipient, you are not authorized 
to read, print, retain, copy, disseminate, distribute, or use this message or 
any part thereof. If you receive this message in error, please notify the 
sender immediately and delete all copies of this message.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187601.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ignoring bad documents during index

2015-02-20 Thread Gora Mohanty

On 20 February 2015 at 15:31, SolrUser1543 osta...@gmail.com wrote:

 I want to experiment with this issue , where exactly I should take a look ?
 I want to try to fix this missing aggregation .

 What class is responsible to that ?

Are you indexing through SolrJ, DIH, or what?

Regards,

Re: Advantage of using Java programming with Solr over Solr API

On 2/20/2015 6:38 AM, Nitin Solanki wrote:
 I mean embedded Solr .
 
 On Fri, Feb 20, 2015 at 7:05 PM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:
 
 This question makes no sense. Do you mean embedded Solr vs Standalone?

 Regards,
 Alex
 On 20 Feb 2015 3:30 am, Nitin Solanki nitinml...@gmail.com wrote:

 Hi,
   What is the advantages of java programming with Solr over Solr
 API?

Standalone Solr offers the admin UI and the ability to do some of your
testing with hand-typed URLs in a browser.  The embedded server is
completely unreachable from anywhere but the java program that embeds
it, and has no options for redundancy and high availability.

The Java client implementations offer objects and methods that are very
easy for a java developer to understand and write with a very small
amount of code, and do not require any user code for building URLs or
communicating over HTTP.

If you were thinking about using EmbeddedSolrServer, you can use one of
the other SolrServer (SolrClient in 5.0) implementations instead with a
standalone Solr installation.  The resulting client code will be nearly
identical to what you'd use with EmbeddedSolrServer, because
EmbeddedSolrServer is simply another implementation of the same abstract
class and interfaces that are used by objects like HttpSolrServer and
CloudSolrServer (Server is replaced by Client in 5.0).

Thanks,
Shawn

Re: Advantage of using Java programming with Solr over Solr API

2015-02-20 Thread Alexandre Rafalovitch

This question makes no sense. Do you mean embedded Solr vs Standalone?

Regards,
Alex
On 20 Feb 2015 3:30 am, Nitin Solanki nitinml...@gmail.com wrote:

 Hi,
   What is the advantages of java programming with Solr over Solr
 API?

Re: Advantage of using Java programming with Solr over Solr API

I mean embedded Solr .

On Fri, Feb 20, 2015 at 7:05 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 This question makes no sense. Do you mean embedded Solr vs Standalone?

 Regards,
 Alex
 On 20 Feb 2015 3:30 am, Nitin Solanki nitinml...@gmail.com wrote:

  Hi,
What is the advantages of java programming with Solr over Solr
  API?

Re: Collations are not working fine.

How to get only the best collations whose hits are more and need to sort
them?

On Wed, Feb 18, 2015 at 3:53 AM, Reitzel, Charles 
charles.reit...@tiaa-cref.org wrote:

 Hi Nitin,

 I was trying many different options for a couple different queries.   In
 fact, I have collations working ok now with the Suggester and WFSTLookup.
  The problem may have been due to a different dictionary and/or lookup
 implementation and the specific options I was sending.

 In general, we're using spellcheck for search suggestions.   The Suggester
 component (vs. Suggester spellcheck implementation), doesn't handle all of
 our cases.  But we can get things working using the spellcheck interface.
 What gives us particular troubles are the cases where a term may be valid
 by itself, but also be the start of longer words.

 The specific terms are acronyms specific to our business.   But I'll
 attempt to show generic examples.

 E.g. a partial term like fo can expand to fox, fog, etc. and a full term
 like brown can also expand to something like brownstone.   And, yes, the
 collation brownstone fox is nonsense.  But assume, for the sake of
 argument, it appears in our documents somewhere.

 For multiple term query with a spelling error (or partially typed term):
 brown fo

 We get collations in order of hits, descending like ...
 brown fox,
 brown fog,
 brownstone fox.

 So far, so good.

 For a single term query, brown, we get a single suggestion, brownstone and
 no collations.

 So, we don't know to keep the term brown!

 At this point, we need spellcheck.extendedResults=true and look at the
 origFreq value in the suggested corrections.  Unfortunately, the Suggester
 (spellcheck dictionary) does not populate the original frequency
 information.  And, without this information, the SpellCheckComponent cannot
 format the extended results.

 However, with a simple change to Suggester.java, it was easy to get the
 needed frequency information use it to make a sound decision to keep or
 drop the input term.   But I'd be much obliged if there is a better way to
 go about it.

 Configs below.

 Thanks,
 Charlie

 !-- SpellCheck component --
   searchComponent class=solr.SpellCheckComponent name=suggestSC
 lst name=spellchecker
   str name=namesuggestDictionary/str
   str
 name=classnameorg.apache.solr.spelling.suggest.Suggester/str
   str
 name=lookupImplorg.apache.solr.spelling.suggest.fst.WFSTLookupFactory/str
   str name=fieldtext_all/str
   float name=threshold0.0001/float
   str name=exactMatchFirsttrue/str
   str name=buildOnCommittrue/str
 /lst
   /searchComponent

 !-- Request Handler --
 requestHandler name=/tcSuggest class=solr.SearchHandler
   lst name=defaults
 str name=titleSearch Suggestions (spellcheck)/str
 str name=echoParamsexplicit/str
 str name=wtjson/str
 str name=rows0/str
 str name=defTypeedismax/str
 str name=dftext_all/str
 str
 name=flid,name,ticker,entityType,transactionType,accountType/str
 str name=spellchecktrue/str
 str name=spellcheck.count5/str
 str name=spellcheck.dictionarysuggestDictionary/str
 str name=spellcheck.alternativeTermCount5/str
 str name=spellcheck.collatetrue/str
 str name=spellcheck.extendedResultstrue/str
 str name=spellcheck.maxCollationTries10/str
 str name=spellcheck.maxCollations5/str
   /lst
   arr name=last-components
 strsuggestSC/str
   /arr
 /requestHandler

 -Original Message-
 From: Nitin Solanki [mailto:nitinml...@gmail.com]
 Sent: Tuesday, February 17, 2015 3:17 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Collations are not working fine.

 Hi Charles,
  Will you please send the configuration which you tried.
 It will help to solve my problem. Have you sorted the collations on hits or
 frequencies of suggestions? If you did than please assist me.

 On Mon, Feb 16, 2015 at 7:59 PM, Reitzel, Charles 
 charles.reit...@tiaa-cref.org wrote:

  I have been working with collations the last couple days and I kept
 adding
  the collation-related parameters until it started working for me.   It
  seems I needed str name=spellcheck.collateMaxCollectDocs50/str.
 
  But, I am using the Suggester with the WFSTLookupFactory.
 
  Also, I needed to patch the suggester to get frequency information in
  the spellcheck response.
 
  -Original Message-
  From: Rajesh Hazari [mailto:rajeshhaz...@gmail.com]
  Sent: Friday, February 13, 2015 3:48 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Collations are not working fine.
 
  Hi Nitin,
 
  Can u try with the below config, we have these config seems to be
  working for us.
 
  searchComponent name=spellcheck class=solr.SpellCheckComponent
 
   str name=queryAnalyzerFieldTypetext_general/str
 
 
lst name=spellchecker
  str name=namewordbreak/str
  str name=classnamesolr.WordBreakSolrSpellChecker/str
  str name=fieldtextSpell/str
  str name=combineWordstrue/str
  str name=breakWordsfalse/str

Re: Use multiple collections having different configuration

On 2/20/2015 4:06 AM, Nitin Solanki wrote:
 I have scenario where I want to create/use 2 collection into
 same Solr named as collection1 and collection2. I want to use distributed
 servers. Each collection has multiple shards. Each collection contains
 different configurations(solrconfig.xml and schema.xml). How can I do?
 In between, If I want to re-configure any collection then how to do that?
 
 As I know, If we use single collection which having multiple shards then we
 need to use this upconfig link -
 
 * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd
 upconfig -confdir example/solr/collection1/conf -confname default *
 and restart all the nodes.
 For 2 collections into same solr. How can I do re-configure?

First, upload your two different configurations with zkcli upconfig
using two different names.

Create your collections with the Collections API, and tell each one to
use a different collection.configName.  If the collection already
exists, use the zkcli linkconfig command, and reload the collection.

If you need to change a config, edit the config on disk and re-do the
zkcli upconfig.  Then reload the collection with the Collections API.
Alternately you could upload a whole new config and then link it to the
existing collection.

The Collections API is not yet exposed in the admin interface, you will
need to do those calls yourself.  If you're doing this with SolrJ, there
are some objects inside CollectionAdminRequest that let you do all the
API actions.

Thanks,
Shawn

Re: Performing DIH on predefined list of IDS

My index has about 110 millions of documents. The index is split over several
shards. 
May be the number it's not so big ,but each document is relatively large. 

The reason to perform the reindex is something like adding a new fields , or
adding some update processor which can extract something from one field and
put in another and etc. 

Each time I need to reindex data , I create a new collection and starting to
import data from old one .
It gives the opportunity for an update processors to act. 

The dih running with *:* query and takes some number of items each time. 
In case of exception , the process stops and the middle and I can't to
restart from this point. 

That's the reason that I want to run on predefined list of IDs. 
In this case I will able to restart from any point and to know about filed
IDs. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589p4187753.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Performing DIH on predefined list of IDS

On 2/20/2015 3:46 PM, Shawn Heisey wrote:
 If the URL parameter is idlist then you can use
 ${dih.request.idlist} in your SELECT statement.

I realized after I sent this that you are not using a database ... the
list would simply go in the query you send to the other server.  I don't
know whether the request that the SolrEntityProcessor sends is a GET or
a POST, so for a really large list of IDs, you might need to edit the
container config on both servers.

Thanks,
Shawn

Re: Performing DIH on predefined list of IDS

On 2/20/2015 2:57 PM, SolrUser1543 wrote:
 That's the reason that I want to run on predefined list of IDs. 
 In this case I will able to restart from any point and to know about filed
 IDs. 

You can include information on a URL parameter and then use that URL
parameter inside your dih config.  If the URL parameter is idlist then
you can use ${dih.request.idlist} in your SELECT statement.

Be aware that most servlet containers have a default header length limit
of about 8192 characters, affecting the length of the URL that can be
sent successfully.  If the list of IDs is going to get huge, you will
either need to switch from a GET to a POST request where the parameter
is in the post body, or increase the header length limit in the servlet
container that is running Solr.

Thanks,
Shawn

Re: Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West

Thanks Hoss,

Protection from misconfiguration and/or starting separate solr instances
pointing to the same index dir I can understand.

The current documentation on the wiki and in the ref guide (along with just
enough understanding of Solr/Lucene indexing to be dangerous)  left me
wondering if maybe somehow a correctly configured Solr might have multiple
processes writing to the same file.
I'm wondering if your explanation above  might be added to the
documentation.

Tom

On Fri, Feb 20, 2015 at 1:25 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : We are using Solr.  We would not configure two different Solr instances
 to
 : write to the same index.  So why would a normal Solr set-up possibly
 end
 : up having more than one process writing to the same index?

 The risk here is that if you configure lockType=single, and then have some
 unintended user error such that two distinct java processes both attempt
 to use the same index dir, the locType will not protect you in that
 situation.

 For example: you normally run solr on port 8983, but someone accidently
 starts a second instance of solr on more 7574 using the exact same conigs
 with the exact same index dir -- lockType single won't help you spot this
 error.  lockType=native will (assuming your FileSystem can handle it)

 lockType=single should protect you however if, for example, multiple
 SolrCores w/in the same Solr java process attempted to refer to the same
 index dir because you accidently put an absolulte path in a solrconfig.xml
 that gets shared my multiple cores.


 -Hoss
 http://www.lucidworks.com/

[ANNOUNCE] Apache Solr 5.0.0 and Reference Guide for Solr 5.0 released

2015-02-20 Thread Anshum Gupta

20 February 2015, Apache Solr™ 5.0.0 and Reference Guide for Solr 5.0
available

The Lucene PMC is pleased to announce the release of Apache Solr 5.0.0

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 5.0 is available for immediate download at:
  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 5.0 Release Highlights:

 * Usability improvements that include improved bin scripts and new and
restructured examples.

 * Scripts to support installing and running Solr as a service on Linux.

 * Distributed IDF is now supported and can be enabled via the config.
Currently, there are four supported implementations for the same:
* LocalStatsCache: Local document stats.
* ExactStatsCache: One time use aggregation
* ExactSharedStatsCache: Stats shared across requests
* LRUStatsCache: Stats shared in an LRU cache across requests

 * Solr will no longer ship a war file and instead be a downloadable
application.

 * SolrJ now has first class support for Collections API.

 * Implicit registration of replication,get and admin handlers.

 * Config API that supports paramsets for easily configuring solr
parameters and configuring fields. This API also supports managing of
pre-existing request handlers and editing common solrconfig.xml via overlay.

 * API for managing blobs allows uploading request handler jars and
registering them via config API.

 * BALANCESHARDUNIQUE Collection API that allows for even distribution of
custom replica properties.

 * There's now an option to not shuffle the nodeSet provided during
collection creation.

 * Option to configure bandwidth usage by Replication handler to prevent it
from using up all the bandwidth.

 * Splitting of clusterstate to per-collection enables scalability
improvement in SolrCloud. This is also the default format for new
Collections that would be created going forward.

 * timeAllowed is now used to prematurely terminate requests during query
expansion and SolrClient request retry.

 * pivot.facet results can now include nested stats.field results
constrained by those pivots.

 * stats.field can be used to generate stats over the results of arbitrary
numeric functions.
It also allows for requesting for statistics for pivot facets using tags.

 * A new DateRangeField has been added for indexing date ranges, especially
multi-valued ones.

 * Spatial fields that used to require units=degrees now take
distanceUnits=degrees/kilometers miles instead.

 * MoreLikeThis query parser allows requesting for documents similar to an
existing document and also works in SolrCloud mode.

 * Logging improvements:
* Transaction log replay status is now logged
* Optional logging of slow requests.

Solr 5.0 also includes many other new features as well as numerous
optimizations and bugfixes of the corresponding Apache Lucene release.

Detailed change log:
http://lucene.apache.org/solr/5_0_0/changes/Changes.html

Also available is the *Solr Reference Guide for Solr 5.0*. This 535 page
PDF serves as the definitive user's manual for Solr 5.0. It can be
downloaded
from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

-- 
Anshum Gupta
http://about.me/anshumgupta

Re: Performing DIH on predefined list of IDS

2015-02-20 Thread Mikhail Khludnev

It's a little bit hard to get the overall context eg why do you live with
OOME as usual, what's the reasoning to pull from one index to another, and
what's added during this process.

Make sure that you are aware of
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor which
queries other Solr. and
http://wiki.apache.org/solr/DataImportHandler#LogTransformer that you can
use to log recently imported ids, to be able to restart indexing from this
point.
You can drop me more details in your native language if you wish.

On Fri, Feb 20, 2015 at 1:32 PM, SolrUser1543 osta...@gmail.com wrote:

Relatively frequently (about a once a month) we need to reindex the data,
by
using DIH and copying the data from one index to another.
Because of the fact that we have a large index, it could take from 12 to 24
hours to complete. At the same time the old index is being queried by
users.
Sometimes DIH could be interrupted at the middle, because of some
unexpected
exception caused by OutOfMemory or something else (many times it failed
when
more than 90 % was completed).
More than this, almost every time, some items are missing at new the
index.
It is very complicated to find them.
At this stage I can't be sure about what documents exactly were missed and
I
have to do it again and waiting for many hours. At the same time the old
index constantly receives new items.

I want to suggest the following way to solve the problem:
• Get list of all item ids ( call LUCINE API , like CLUE does for
example )
• Start DIH, which will iterate over those ids and each time
make a
query for n items.
1. Of course original DIH class should be changed to support it.
• This will give the following advantages :
1. I will know exactly what items were failed.
2. I can restart the process from any point and in case of DIH failure
restart it from the point of failure.

so the main difference will be that now DIH running on *:* query and I
suggest to run it list of IDS

The question is what do you think about it? Or all of this could be done
another way and I am trying to reinvent the wheel?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: Performing DIH on predefined list of IDS

2015-02-20 Thread Erick Erickson

Personally, I much prefer indexing from an independent SolrJ client
to using DIH when I have to take explicit control of errors etc.
Here's an example:
https://lucidworks.com/blog/indexing-with-solrj/

In your example, you seem to be assuming that the Lucene IDs
(and here I'm assuming you're not talking about the internal Lucene
ID) corresponds to some kind of primary key in your database table.
But the correspondence isn't necessarily straightforward, how
would it handle composite keys?

I'll leave actual comments on DIH's internals to people who, you know,
actually understand the code ;)...

Erick

On Fri, Feb 20, 2015 at 2:32 AM, SolrUser1543 osta...@gmail.com wrote:
Relatively frequently (about a once a month) we need to reindex the data, by
using DIH and copying the data from one index to another.
Because of the fact that we have a large index, it could take from 12 to 24
hours to complete. At the same time the old index is being queried by users.
Sometimes DIH could be interrupted at the middle, because of some unexpected
exception caused by OutOfMemory or something else (many times it failed when
more than 90 % was completed).
More than this, almost every time, some items are missing at new the index.
It is very complicated to find them.
At this stage I can't be sure about what documents exactly were missed and I
have to do it again and waiting for many hours. At the same time the old
index constantly receives new items.

I want to suggest the following way to solve the problem:
• Get list of all item ids ( call LUCINE API , like CLUE does for
example )
• Start DIH, which will iterate over those ids and each time make a
query for n items.
1. Of course original DIH class should be changed to support it.
• This will give the following advantages :
1. I will know exactly what items were failed.
2. I can restart the process from any point and in case of DIH failure
restart it from the point of failure.

so the main difference will be that now DIH running on *:* query and I
suggest to run it list of IDS

The question is what do you think about it? Or all of this could be done
another way and I am trying to reinvent the wheel?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Performing-DIH-on-predefined-list-of-IDS-tp4187589.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran

Hi Shawn,
Also, the tokenizer we use is very similar to the following.
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex


From the looks of it the text is being indexed as a single token and not broken 
across whitespace. 

Thanks,
Rishi. 

 

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Feb 20, 2015 11:52 am
Subject: Re: Strange search behaviour when upgrading to 4.10.3


On 2/20/2015 9:37 AM, Rishi Easwaran wrote:
 We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 
search results are not being returned, actually looks like only the first word 
in a sentence is getting indexed. 
 Ex: inserting This is a test message only returns results when searching 
 for 
content:this*. searching for content:test* or content:message* does not work 
with 4.10. Only searching for content:*message* works. This leads to me to 
believe there is something wrong with behaviour of our analyzer and tokenizers 

snip

  fields
 field name=content type=ourType stored=false indexed = true 
required=false multiValued=true /
   /fields

 fieldType name=ourType indexed = true class=solr.TextField 
 analyzer class = com.zimbra.cs.index.ZimbraAnalyzer  /
 /fieldType
  
 Looking at the release notes from solr and lucene
 http://lucene.apache.org/solr/4_10_1/changes/Changes.html
 http://lucene.apache.org/core/4_10_1/changes/Changes.html
 Nothing really sticks out, atleast to me.  Any help to get it working with 
4.10 would be great.

The links you provided lead to zero-byte files when I try them, so I
could not look deeper.

Have you recompiled your custom analysis components against the newer
versions of the Solr/Lucene libraries?  Anytime you're dealing with
custom components, you cannot assume that a component compiled to work
with one version of Solr will work with another version.  The internal
API does change, and there is less emphasis on avoiding API breaks in
minor Solr releases than there is with Lucene, because the vast majority
of Solr users are not writing their own code that uses the Solr API. 
Recompiling against the newer libraries may cause compiler errors that
reveal places in your code that require changes.

Thanks,
Shawn

Re: rankquery usage bug?

2015-02-20 Thread Joel Bernstein

Ryan,

This looks like a good jira ticket to me.

Joel Bernstein
Search Engineer at Heliosearch

On Fri, Feb 20, 2015 at 6:40 PM, Ryan Josal rjo...@gmail.com wrote:

 Hey guys, I put a rq in defaults but I can't figure out how to override it
 with no rankquery.  Looks like one option might be checking for empty
 string before trying to use it in QueryComponent?  I can work around it in
 the prep method of an earlier searchcomponent for now.

 Ryan

Re: Use multiple collections having different configuration

Thanks Shawn..

On Fri, Feb 20, 2015 at 7:53 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 2/20/2015 4:06 AM, Nitin Solanki wrote:
  I have scenario where I want to create/use 2 collection into
  same Solr named as collection1 and collection2. I want to use distributed
  servers. Each collection has multiple shards. Each collection contains
  different configurations(solrconfig.xml and schema.xml). How can I do?
  In between, If I want to re-configure any collection then how to do that?
 
  As I know, If we use single collection which having multiple shards then
 we
  need to use this upconfig link -
 
  * example/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd
  upconfig -confdir example/solr/collection1/conf -confname default *
  and restart all the nodes.
  For 2 collections into same solr. How can I do re-configure?

 First, upload your two different configurations with zkcli upconfig
 using two different names.

 Create your collections with the Collections API, and tell each one to
 use a different collection.configName.  If the collection already
 exists, use the zkcli linkconfig command, and reload the collection.

 If you need to change a config, edit the config on disk and re-do the
 zkcli upconfig.  Then reload the collection with the Collections API.
 Alternately you could upload a whole new config and then link it to the
 existing collection.

 The Collections API is not yet exposed in the admin interface, you will
 need to do those calls yourself.  If you're doing this with SolrJ, there
 are some objects inside CollectionAdminRequest that let you do all the
 API actions.

 Thanks,
 Shawn

Re: Strange search behaviour when upgrading to 4.10.3

On 2/20/2015 4:24 PM, Rishi Easwaran wrote:
 Also, the tokenizer we use is very similar to the following.
 ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalTokenizer.java
 ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalLexer.jflex


 From the looks of it the text is being indexed as a single token and not 
 broken across whitespace. 

I can't claim to know how analyzer code works.  I did manage to see the
code, but it doesn't mean much to me.

I would suggest using the analysis tab in the Solr admin interface.  On
that page, select the field or fieldType, set the verbose flag and
type the actual field contents into the index side of the page.  When
you click the Analyze Values button, it will show you what Solr does
with the input at index time.

Do you still have access to any machines (dev or otherwise) running the
old version with the custom component? If so, do the same things on the
analysis page for that version that you did on the new version, and see
whether it does something different.  If it does do something different,
then you will need to track down the problem in the code for your custom
analyzer.

Thanks,
Shawn

Re: ignoring bad documents during index

2015-02-20 Thread Michael Della Bitta

At the layer right before you send that XML out, have it have a fallback
option on error where it sends each document one at a time if there's a
failure with the batch.

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Fri, Feb 20, 2015 at 10:26 AM, SolrUser1543 osta...@gmail.com wrote:

 I am sending a bulk of XML via http request.

 The same way like indexing via  documents  in solr interface.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187632.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West

Hello,

We don't want to use locktype=native (we are using NFS) or locktype=simple
(we mount a read-only snapshot of the index on our search servers and with
locktype=simple, Solr refuses to start up becaise it sees the lock file.)

However, we don't quite understand the warnings about using locktype=single
in the context of normal Solr operation.  The ref guide and the wiki (
http://wiki.apache.org/lucene-java/AvailableLockFactories)
 seem to indicate there is some danger in using locktype=single.

The wiki says:
locktype=single:
Uses an object instance to represent the lock, so this is usefull when you
are certain that all modifications to the a given index are running against
a single shared in-process Directory instance. This is currently the
default locking for RAMDirectory, but it could also make sense on a FSDirectory
provided the other processes use the index in read-only.

We are using Solr.  We would not configure two different Solr instances to
write to the same index.  So why would a normal Solr set-up possibly end
up having more than one process writing to the same index?

At the Lucene level there are multiple index writers per thread, but they
each write to their own segments, and (I think) all the threads are in the
same Solr process),

Are we safe using locktype=single?

Tom

Re: Getting unique key of a document inside of a Similarity class.

2015-02-20 Thread J-Pro


from all the examples of what you've described, i'm fairly certain all you
really need is a TFIDF based Similarity where coord(), idf(), tf() and
queryNorm() return 1 allways, and you omitNorms from all fields.


Yeah, that's what I did in the very first iteration. It works only for 
cases #1 and #2. If you try query 3 and 4 with such Similarity, you'll get:


3. place:(34\ High\ Street)^3 = doc1(score=9), doc2(score=9)
4. name:DocumentOne^7 OR place:(34\ High\ Street)^3 = doc1(score=16), 
doc2(score=9)


That is not what I need. As I described above, in case of multiple 
tokens match for a field, method SimScorer.score is called X times, 
where X is number of matched tokens (in cases #3 and #4 there are 3 
tokens), therefore score sums up. I need to score only once in this 
case, regardless of number of tokens.


How to do it? First idea was HashSet based on fieldName, so that after 
scoring once, it don't score anymore. But in this case only first 
document was scoring (since second and other documents have the same 
field name). So I understood that I need also docID for that. And it 
worked fine until I found out (thank you for that) about that docID is 
segment-specific. So now I need segmentID as well (or something similar).




(You didn't give any examples of what you expect to happen with exclusion
clauses in your BooleanQueries


For my needs I won't need exclusion clauses, but in this case the same 
would happen - it would score depending on weight, because condition is 
true:


5. (NOT name:DocumentOne)^7 = doc2(score=7)

Re: Committed before 500

2015-02-20 Thread Walter Underwood

Since you are getting these failures, the 90 second timeout is not “good
enough”. Try increasing it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)

On Feb 20, 2015, at 5:22 AM, NareshJakher naresh.jak...@capgemini.com wrote:

Hi Shawn,

I do not want to increase timeout as these errors are very few. Also current
timeout of 90 seconds is good enough. Is there a way to find why Solr is
getting timed-out ( at times ), could it be that Solr is busy doing other
activities like re-indexing, commits etc.

Additionally I also found that some of non-leader node move to recovering or
recovery failed after these time out errors. I am just wondering if these are
related to performance issue and Solr commits needs to be controlled.

Regards,
Naresh Jakher

From: Shawn Heisey-2 [via Lucene]
[mailto:ml-node+s472066n4187382...@n3.nabble.com]
Sent: Thursday, February 19, 2015 8:12 PM
To: Jakher, Naresh
Subject: Re: Committed before 500

On 2/19/2015 6:30 AM, NareshJakher wrote:

I am using Solr cloud with 3 nodes, at times following error is observed in
logs during delete operation. Is it a performance issue ? What can be done
to resolve this issue

Committed before 500 {msg=Software caused connection abort: socket write
error,trace=org.eclipse.jetty.io.EofException

I did search on old topics but couldn't find anything concrete related to
Solr cloud. Would appreciate any help on the issues as I am relatively new
to Solr.

A jetty EofException indicates that one specific thing is happening:

The TCP connection from the client was severed before Solr responded to
the request. Usually this happens because the client has been
configured with an absolute timeout or an inactivity timeout, and the
timeout was reached.

Configuring timeouts so that you can be sure clients don't get stuck is
a reasonable idea, but any configured timeouts should be VERY long.
You'd want to use a value like five minutes, rather than 10, 30, or 60
seconds.

The timeouts MIGHT be in the HttpShardHandler config that Solr and
SolrCloud use for distributed searches, and they also might be in
operating-system-level config.

https://wiki.apache.org/solr/SolrConfigXml?highlight=%28HttpShardHandler%29#Configuration_of_Shard_Handlers_for_Distributed_searches

Thanks,
Shawn

If you reply to this email, your message will be added to the discussion
below:
http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187382.html
To unsubscribe from Committed before 500, click
herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4187361code=bmFyZXNoLmpha2hlckBjYXBnZW1pbmkuY29tfDQxODczNjF8NzQ0MTczNzc0.
NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
This message contains information that may be privileged or confidential and
is the property of the Capgemini Group. It is intended only for the person to
whom it is addressed. If you are not the intended recipient, you are not
authorized to read, print, retain, copy, disseminate, distribute, or use this
message or any part thereof. If you receive this message in error, please
notify the sender immediately and delete all copies of this message.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Committed-before-500-tp4187361p4187601.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Remove all parent docs having specific child doc

2015-02-20 Thread Mikhail Khludnev

On Fri, Feb 20, 2015 at 2:10 PM, Lokesh Chhaparwal xyzlu...@gmail.com
wrote:

 Hi,

 I want to remove all the parent docs having a specific child doc. Eg.

 docEmployee1
 doc
   fieldDept1/field
 /doc
 doc
   fieldDept2/field
 /doc
 /doc

 docEmployee2
 doc
   fieldDept2/field
 /doc
 doc
   fieldDept3/field
 /doc
 /doc

 Query: Remove all employees which lie in Dept1
 Response should be: Employee2 *only*

 Problem: *NOT operator is not being supported* in block join query parser.
 *q = {!parent which=employee:*}*:* -department:Dept1 *- results Employee1,
 Employee2

AFAIK the first space after q= breaks space handling between }*:* -d, btw,
that you can conclude it by looking into debugQuery=true output, hence try
either:

q={!parent which=employee:*}*:* -department:Dept1
q={!parent which=employee:*}*:*\ -department:Dept1
q={!parent which=employee:* v=$cq}cq=*:* -department:Dept1



 *q= - {!parent which=employee:*} department:Dept1 *- it does not work with
 block join query parser.

 Please suggest as to how I filter those employees which lie in Dept1 using
 block join or any other query parser?

 Thanks,
 Lokesh




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com

Re: ignoring bad documents during index

I am sending a bulk of XML via http request. 

The same way like indexing via  documents  in solr interface. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4187632.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr synonyms logic

2015-02-20 Thread davym

Hi all,

I'm querying a recipe database in Solr. By using synonyms, I'm trying to
make my search a little smarter.

What I'm trying to do here, is that a search for pastry returns all lasagne,
penne  cannelloni recipes. 
However a search for lasagne should only return lasagne recipes.

In my synonyms.txt, I have these lines:
-
lasagne,pastry
penne,pastry
cannelloni,pastry
-

Filter in my scheme.xml looks like this:
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.WhitespaceTokenizerFactory / 
Only in the index analyzer, not in the query.

When using the Solr analysis tool, I can see that my index for lasagne has a
synonym pastry and my query only queries lasagne. Same for penne and
cannelloni, they both have the synonym pastry.

Currently my Solr query for lasagne also returns all penne and cannelloni
recipes. I cannot understand why this is the case.

Can someone explain this behaviour to me please?







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-synonyms-logic-tp4187827.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran

Yes, The analyzers and tokenizers were recompiled with new version of 
solr/lucene and there were some errors, most of them were related to using 
BytesRefBuilder, which i did. 

Can you try these links.
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java
ftp://zimbra.imladris.sk/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java

 

 

 

-Original Message-
From: Shawn Heisey apa...@elyograg.org
To: solr-user solr-user@lucene.apache.org
Sent: Fri, Feb 20, 2015 11:52 am
Subject: Re: Strange search behaviour when upgrading to 4.10.3


On 2/20/2015 9:37 AM, Rishi Easwaran wrote:
 We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3 
search results are not being returned, actually looks like only the first word 
in a sentence is getting indexed. 
 Ex: inserting This is a test message only returns results when searching 
 for 
content:this*. searching for content:test* or content:message* does not work 
with 4.10. Only searching for content:*message* works. This leads to me to 
believe there is something wrong with behaviour of our analyzer and tokenizers 

snip

  fields
 field name=content type=ourType stored=false indexed = true 
required=false multiValued=true /
   /fields

 fieldType name=ourType indexed = true class=solr.TextField 
 analyzer class = com.zimbra.cs.index.ZimbraAnalyzer  /
 /fieldType
  
 Looking at the release notes from solr and lucene
 http://lucene.apache.org/solr/4_10_1/changes/Changes.html
 http://lucene.apache.org/core/4_10_1/changes/Changes.html
 Nothing really sticks out, atleast to me.  Any help to get it working with 
4.10 would be great.

The links you provided lead to zero-byte files when I try them, so I
could not look deeper.

Have you recompiled your custom analysis components against the newer
versions of the Solr/Lucene libraries?  Anytime you're dealing with
custom components, you cannot assume that a component compiled to work
with one version of Solr will work with another version.  The internal
API does change, and there is less emphasis on avoiding API breaks in
minor Solr releases than there is with Lucene, because the vast majority
of Solr users are not writing their own code that uses the Solr API. 
Recompiling against the newer libraries may cause compiler errors that
reveal places in your code that require changes.

Thanks,
Shawn

Strange search behaviour when upgrading to 4.10.3

2015-02-20 Thread Rishi Easwaran

Hi,

We are trying to upgrade from Solr 4.6 to 4.10.3. When testing search 4.10.3
search results are not being returned, actually looks like only the first word
in a sentence is getting indexed.
Ex: inserting This is a test message only returns results when searching for
content:this*. searching for content:test* or content:message* does not work
with 4.10. Only searching for content:*message* works. This leads to me to
believe there is something wrong with behaviour of our analyzer and tokenizers

A little bit of background.

We have our own analyzer and tokenizer since pre solr 1.4 and its been
regularly updated. The analyzer works with solr 4.6 we have it running in
production (I also tested that search works with solr 4.9.1).
It is very similar to the tokenizers and analyzers located here.
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/ZimbraAnalyzer.java
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/UniversalAnalyzer.java
ftp://193.87.16.77/src/HELIX-720.fbsd/ZimbraServer/src/java/com/zimbra/cs/index/analysis/
But with modifications to work with latest solr/lucene code ex: override-
createComponents

The schema of the filed being analyzed is as follows

fields
field name=content type=ourType stored=false indexed = true
required=false multiValued=true /
/fields

fieldType name=ourType indexed = true class=solr.TextField
analyzer class = com.zimbra.cs.index.ZimbraAnalyzer /
/fieldType

Looking at the release notes from solr and lucene
http://lucene.apache.org/solr/4_10_1/changes/Changes.html
http://lucene.apache.org/core/4_10_1/changes/Changes.html
Nothing really sticks out, atleast to me. Any help to get it working with 4.10
would be great.

Thanks,
Rishi.

Re: Strange search behaviour when upgrading to 4.10.3