replication problems with solr4.1

2013-02-11 Thread Bernd Fehling
Hi list,

after upgrading from solr4.0 to solr4.1 and running it for two weeks now
it turns out that replication has problems and unpredictable results.
My installation is single index 41 mio. docs / 115 GB index size / 1 master / 3 
slaves.
- the master builds a new index from scratch once a week
- a replication is started manually with Solr admin GUI

What I see is one of these cases:
- after a replication a new searcher is opened on index.xxx directory and
  the old data/index/ directory is never deleted and besides the file
  replication.properties there is also a file index.properties
OR
- the replication takes place everything looks fine but when opening the admin 
GUI
  the statistics report
Last Modified: a day ago
Num Docs: 42262349
Max Doc:  42262349
Deleted Docs:  0
Version:  45174
Segment Count: 1

VersionGen  Size
Master: 1360483635404  112  116.5 GB
Slave:  1360483806741  113  116.5 GB


In the first case, why is the replication doing that???
It is an offline slave, no search activity, just there fore backup!


In the second case, why is the version and generation different right after
full replication?


Any thoughts on this?


- Bernd


Faceting on tree structure in SOLR4

2013-02-11 Thread Alok Bhandari

Hello,

I have a tree data structure like 

 t1
   |-t2
|-t3
 t4
   |-t5

and so on .

And there is no limit on tree depth as well as number of children to each
node.

What I want is that when I do the faceting for parent node t1 it should also
include count of all of its children (t2 and t3 in this case). So lets say
count corresponding to t1 is 5 and t2 and t3 also its 5 then the total
should display 15 as a count against t1.

Please let me know how I can achieve this. I am using SOLR4 and tree
structure is dynamic and subject to addition,deletion and edition.

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Faceting-on-tree-structure-in-SOLR4-tp4039650.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Faceting on tree structure in SOLR4

2013-02-11 Thread Mikhail Khludnev
Hello,

is http://wiki.apache.org/solr/HierarchicalFaceting what are you talking
about?



On Mon, Feb 11, 2013 at 12:42 PM, Alok Bhandari 
alokomprakashbhand...@gmail.com wrote:


 Hello,

 I have a tree data structure like

  t1
|-t2
 |-t3
  t4
|-t5

 and so on .

 And there is no limit on tree depth as well as number of children to each
 node.

 What I want is that when I do the faceting for parent node t1 it should
 also
 include count of all of its children (t2 and t3 in this case). So lets say
 count corresponding to t1 is 5 and t2 and t3 also its 5 then the total
 should display 15 as a count against t1.

 Please let me know how I can achieve this. I am using SOLR4 and tree
 structure is dynamic and subject to addition,deletion and edition.

 Thanks




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Faceting-on-tree-structure-in-SOLR4-tp4039650.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Crawl Anywhere -

2013-02-11 Thread Jan Høydahl
Have a look at Nutch2, it is decoupled from HDFS and can store docs in e.g. 
HBase or other NoSql store.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

11. feb. 2013 kl. 06:16 skrev SivaKarthik sivakarthik.kpa...@gmail.com:

 Dear Erick,
   Thanks for ur relpy..
   ya..nutch can meet my requirement... 
  but the problem is, i want to store the crawled document in html or xml
 format instead of mapreduce format..
  not sure nutch plugins available to convert into xml files.
  please share me if you any idea .
 
 ThankYou
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4039619.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Crawl Anywhere -

2013-02-11 Thread O. Klein
Yes you can run CA on different machines.

In Manage you have to set target and engine for this to work.

I've never done this, so you have to contact the developer for more details.



SivaKarthik wrote
 Hi All,
  in our project, we need to download around millions of pages...
  so is there any support to do the crawling in distributed environment
 using crawl-anywhere apps?
   or wat could be the alternatives...?
 
  Thanks in advance..





--
View this message in context: 
http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4039674.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [Solrj 4.0] How use JOIN

2013-02-11 Thread Roman Slavik
Hi,

thanks for advice. But I need to use parent_condition and child_condition in
same time.
Parent condition is: (name:Thomas AND age:40)
Child condition: (name:John AND age:17) 
join from=parent to=id

So something like:
(name:Thomas AND age:40) AND {!join from=parent to=id}(name:John AND age:17)

This all within Solrj 4.0 (or 4.1)

I think there is solution using nested query like this:
(name:Thomas AND age:40)  AND _query_:{!join from=parent to=id}(name:John
AND age:17)

but I don't like this syntax, so looking for something else.

Any idea?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-4-0-How-use-JOIN-tp4024262p4039675.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Help! How to remove shards from SolrCloud. They keep come back...

2013-02-11 Thread Rene Nederhand
Hi Mark,

Thanks for you response.

I did delete the data directory, but that didn't help. However, upon
checking my zookeeper installation I found a clusterstate.json item that
contained references to core data directories that didn't exist anymore. I
wiped this item and it seems to work fine now.

Thanks for your help!

Rene



On Sat, Feb 9, 2013 at 8:47 PM, Mark Miller markrmil...@gmail.com wrote:

 Did you clear the data dir for all 3 zk's? If not, you will find ghosts
 coming back to haunt you :)

 It's often easier to clear zk programmatically - for example it's one call
 from the cmd line zkcli script.

 http://wiki.apache.org/solr/SolrCloud#Command_Line_Util

 - Mark

 On Feb 9, 2013, at 1:19 PM, Rene Nederhand r...@nederhand.net wrote:

  Hi,
 
  I am experimenting with SolrCloud (v. 4.1) and everything seems to work
  fine.
 
  Now I would like to restart with a clean environment, but I cannot get
 rid
  of all the collections, shards and cores I have created.
 
  What I did:
  - Closed down all Zookeeper servers (I have an ensemble of 3) and Solr
  servers (also 3)
  - I have deleted the collections and configs from zookeeper;
  - I deleted the data directory (version-2) from zookeeper
  - I deleted my solr home (with all data files)
  - Edited Solr.xml so there is no reference to instances anymore.
 
  When I restart, I get an error about no existing SolrCores, but after
  adding a new config, collection and one SolrCore I see a graph of all
  previous existing shards/cores.
 
  How can I go back to a clean state? How to remove these
 collections/shards?
 
  Thanks for helping.
 
  Rene




Re: Maximum Number of Records In Index

2013-02-11 Thread Mikhail Khludnev
Otis,
Do you run 4bn docs SolrCloud or ElasticSearch or aware of somebody who do?
10.02.2013 4:54 пользователь Otis Gospodnetic otis.gospodne...@gmail.com
написал:

 Exceeding 2B is no problem. But it won't happen in a single Lucene index
 any time soon,  so...

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Feb 7, 2013 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

  Actually, I have a dream to exceed those two billions. It seems possible,
  to move to Vint in fileformat and change int docnums to longs in Lucene
  API. Does anyone know whether it's possible?
  And this question is not so esoteric if we are talking about SolrCloud,
  which can hold more that 2bn docs in few smaller shards. Any experience?
 
 
  On Thu, Feb 7, 2013 at 5:46 PM, Rafał Kuć r@solr.pl wrote:
 
   Hello!
  
   Right, my bad - ids are still using int32. However, that still
   gives us 2,147,483,648 possible identifiers right per single index,
   which is not close to the 13,5 millions mentioned in the first mail.
  
   --
   Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
  ElasticSearch
  
Rafal,
  
What about docnums, don't they are limited by int32 ?
07.02.2013 15:33 пользователь Rafał Kuć r@solr.pl написал:
  
Hello!
   
Practically there is no limit in how many documents can be stored
 in a
single index. In your case, as you are using Solr from 2011, there
 is
a limitation regarding the number of unique terms per Lucene segment
(
   
  
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/fileformats.html#Limitations
).
However I don't think you've hit that. Solr by itself doesn't remove
documents unless told to do so.
   
Its hard to guess what can be the reason and as you said, you see
updates coming to your handler. Maybe new documents have the same
identifiers that the ones that are already indexed ? As I said, this
is only a guess and we would need to have more information. Are
 there
any exceptions in the logs ? Do you run delete command ? Are your
index files changed ? How do you run commit ?
   
--
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
   ElasticSearch
   
 I have searched this forum but not yet found a definitive answer,
 I
think the
 answer is There is No Limit depends on server specification. But
   never
the
 less I will say what I have seen and then ask the questions.
   
 From scratch (November 2011) I have set up our SOLR which contains
   data
from
 various sources, since March 2012 , the number of indexed records
   (unique
 ID's) reached 13.5 million , which was to be expected. However for
  the
last
 8 months the number of records in the index has not gone above
 13.5
million,
 yet looking at the request handler outputs I can safely say at
 least
 anywhere from 50 thousand to 100 thousand records are being
 indexed
daily.
 So I am assuming that earlier records are being removed, and I do
  not
want
 that.
   
 Question: If there is a limit to the number of records the index
 can
store
 where do I find this and change it?
 Question: If there is no limit does anyone have any idea why for
 the
   last
 months the number has not gone beyond 13.5 million, I can safely
 say
that at
 least 90% are new records.
   
 thanks
   
 macroman
   
   
   
 --
 View this message in context:

   
  
 
 http://lucene.472066.n3.nabble.com/Maximum-Number-of-Records-In-Index-tp4038961.html
 Sent from the Solr - User mailing list archive at Nabble.com.
   
   
  
  
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
   mkhlud...@griddynamics.com
 



Problems using distributed More Like This

2013-02-11 Thread Shawn Heisey
SOLR-788 added Distributed MLT to Solr 4.1, but I have not been able to 
get it to work.  I don't know if it's user error, which of course is 
very possible.  If it is user error, I'd like to know what I'm doing 
wrong so I can fix it.  I am actually using a recent checkout of Solr 
4.2, not the released 4.1.


I put some extensive information on SOLR-4414, an issue filed by another 
user having a similar problem.  If you look for the last comment from me 
on Feb 7 that has a code block, you'll see Solr's response when I use 
MoreLikeThisComponent.


https://issues.apache.org/jira/browse/SOLR-4414

Only the last seven of the query parameters were included on the URL - 
the rest of them are in solrconfig.xml.  Due to echoParams=all, the only 
part of the request handler definition that you can't see in the 
response is the fact that last-components contains spellcheck.


I redacted the company domain name from the shards and the one document 
matching the query from the result tag, but there are no other changes 
to the response.


If I send an identical query to the shard core that actually contains 
the document rather than the core with the shards parameter, I get MLT 
results.


I have heard recently that Solr 4.x has hardcoded the unique field name 
for SolrCloud sharding as id ... but my uniqueKey field name is tag_id. 
 Could this be my problem?  It would be a monumental development effort 
to change that field name in our application.  I am not using SolrCloud 
for this index.


Thanks,
Shawn


RE: Solr query parser, needs to call setAutoGeneratePhraseQueries(true)

2013-02-11 Thread Zhang, Lisheng
Thanks very much, it worked perfectly !!

Best regards, Lisheng

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, February 08, 2013 1:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr query parser, needs to call
setAutoGeneratePhraseQueries(true)


(Sorry for my split message)...

See the text_en_splitting field type for an example:

fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
...

-- Jack Krupansky

-Original Message- 
From: Zhang, Lisheng
Sent: Friday, February 08, 2013 3:20 PM
To: solr-user@lucene.apache.org
Subject: Solr query parser, needs to call setAutoGeneratePhraseQueries(true)


Hi,

In our application we need to call method

setAutoGeneratePhraseQueries(true)

on lucene QueryParser, this is the way used to work in earlier versions
and it seems to me that is the much natural way?

But in current solr 3.6.1, the only way to do so is to set

luceneMatchVersionLUCENE_30/luceneMatchVersion

in solrconfig.xml (if I read souce code correctly), but I donot want to
do so because this will change the whole behavior of lucene, and I only
want to change this query parser behavior, not other lucene features?

Please guide me if there is a better way other than to change solr source
code?

Thanks very much for helps, Lisheng 



Re: Can Solr analyze content and find dates and places

2013-02-11 Thread jazz
Hi Sujit and others who answered my question,

I have been working on the UIMA path which seems great with the available 
Eclipse tooling and this:

http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html

Now I worked through the UIMA tutorial of the RoomNumberAnnotator: 
http://uima.apache.org/doc-uima-annotator.html
And I am able to test it using the UIMA CAS Virtuall Debugger. So far so good.

But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot 
find the xml file and the Java class (they are in the correct lib directories, 
because the WhitespaceTokenizer works fine).

 updateRequestProcessorChain name=uima
  processor 
class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
lst name=uimaConfig
  lst name=runtimeParameters
  /lst
  str name=analysisEngine/RoomNumberAnnotator.xml/str
  bool name=ignoreErrorsfalse/bool
  lst name=analyzeFields
bool name=mergefalse/bool
arr name=fields
  strcontent/str
/arr
  /lst
  lst name=fieldMappings
lst name=type
  str name=nameorg.apache.uima.tutorial.RoomNumber/str
  lst name=mapping
str name=featurebuilding/str
str name=fieldUIMAname/str
  /lst
/lst
  /lst
/lst
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
 
On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it 
fails:
Deploy new jars inside one of the lib directories

Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima path.

Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which branch 
can I checkout? This is the Stable release I am running:

Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36

Regards, Bart


On 8 Feb 2013, at 22:11, SUJIT PAL wrote:

 Hi Bart,
 
 I did some work with UIMA but this was to annotate the data before it goes to 
 Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked through 
 the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I believe 
 you will have to set up your own aggregate analysis chain in place of the one 
 currently configured.
 
 Writing UIMA annotators is very simple (there is a tutorial here:  
 [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]).
  You provide the XML description for the annotation and let UIMA generate the 
 annotation bean. You write Java code for the annotator and also the annotator 
 XML descriptor. UIMA uses the annotator XML descriptor to instantiate and run 
 your annotator. Overall, sounds really complicated but its actually quite 
 simple.
 
 The tutorial has quite a few examples that you will find useful, but in case 
 you need more, I have some on this github repository:
 [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima]
 
 The dictionary and pattern annotators may be similar to what you are looking 
 for (date and city annotators).
 
 Best regards,
 Sujit
 
 On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:
 
 Hi Alex,
 
 Indeed that is exactly what I am trying to achieve using wordcities. Date 
 will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how 
 do I integrate the Java library as UIMA? The documentation about changing 
 schema.xml and solr.xml is not very detailed. 
 
 Regards, Bart
 
 On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote:
 
 Hi Bart,
 
 I haven't done any UIMA work (I used other stuff for my NLP phase), so not
 sure I can help much further. But in general, you are venturing into pure
 research territory here.
 
 Even for dates, what do you actually mean? Just fixed expression? Relative
 dates (e.g. last tuesday?). What about times (7pm?).
 
 Same with cities. If you want it offline, you need the gazetteer and
 disambiguation modules. Gazetteer for cities (worldwide) is huge and has a
 lot of duplicate names (Paris, Ontario is apparently a short drive from
 London, Ontario eh?). Something like
 http://www.maxmind.com/en/worldcities? And disambiguation usually
 requires training corpus that is similar to
 what your text will look like.
 
 Online services like OpenCalais are backed by gigantic databases and some
 serious corpus-training Machine Language disambiguation algorithms.
 
 So, no plug-and-play solution here. If you really need to get this done, I
 would recommend narrowing down the specification of exactly what you will
 settle for and looking for software that can do it. Once you have that,
 integration with Solr is your next - and smaller - concern.
 
 Regards,
 Alex.
 
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature 

Re: Do I have to reindex when upgrading from solr 4.0 to 4.1?

2013-02-11 Thread Michael Della Bitta
Arkadi,

That's the answer I received at Solr Bootcamp, yes.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Mon, Feb 11, 2013 at 2:23 AM, Arkadi Colson ark...@smartbit.be wrote:
 Does it mean that when you redo indexing after the upgrade to 4.1 shard
 splitting will work in 4.2?

 Met vriendelijke groeten

 Arkadi Colson

 Smartbit bvba • Hoogstraat 13 • 3670 Meeuwen
 T +32 11 64 08 80 • F +32 11 64 08 81

 On 02/10/2013 05:21 PM, Michael Della Bitta wrote:

 No. You can just update Solr in place. But...

 If you're using Solr Cloud, your documents won't be hashed in a way
 that lets you do shard splitting in 4.2. That seemed to be the
 consensus during Solr Boot Camp.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Sun, Feb 10, 2013 at 10:46 AM, adfel70 adfe...@gmail.com wrote:

 Do I have to recreate the collections/cores?
 Do I have to reindex?

 thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Do-I-have-to-reindex-when-upgrading-from-solr-4-0-to-4-1-tp4039560.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Maximum Number of Records In Index

2013-02-11 Thread Otis Gospodnetic
We don't run one ourselves at Sematext, but know of people who do have
large ES clusters, one with  10B docs.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Mon, Feb 11, 2013 at 8:41 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Otis,
 Do you run 4bn docs SolrCloud or ElasticSearch or aware of somebody who do?
 10.02.2013 4:54 пользователь Otis Gospodnetic 
 otis.gospodne...@gmail.com
 написал:

  Exceeding 2B is no problem. But it won't happen in a single Lucene index
  any time soon,  so...
 
  Otis
  Solr  ElasticSearch Support
  http://sematext.com/
  On Feb 7, 2013 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com
  wrote:
 
   Actually, I have a dream to exceed those two billions. It seems
 possible,
   to move to Vint in fileformat and change int docnums to longs in Lucene
   API. Does anyone know whether it's possible?
   And this question is not so esoteric if we are talking about SolrCloud,
   which can hold more that 2bn docs in few smaller shards. Any
 experience?
  
  
   On Thu, Feb 7, 2013 at 5:46 PM, Rafał Kuć r@solr.pl wrote:
  
Hello!
   
Right, my bad - ids are still using int32. However, that still
gives us 2,147,483,648 possible identifiers right per single index,
which is not close to the 13,5 millions mentioned in the first mail.
   
--
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
   ElasticSearch
   
 Rafal,
   
 What about docnums, don't they are limited by int32 ?
 07.02.2013 15:33 пользователь Rafał Kuć r@solr.pl написал:
   
 Hello!

 Practically there is no limit in how many documents can be stored
  in a
 single index. In your case, as you are using Solr from 2011, there
  is
 a limitation regarding the number of unique terms per Lucene
 segment
 (

   
  
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/fileformats.html#Limitations
 ).
 However I don't think you've hit that. Solr by itself doesn't
 remove
 documents unless told to do so.

 Its hard to guess what can be the reason and as you said, you see
 updates coming to your handler. Maybe new documents have the same
 identifiers that the ones that are already indexed ? As I said,
 this
 is only a guess and we would need to have more information. Are
  there
 any exceptions in the logs ? Do you run delete command ? Are your
 index files changed ? How do you run commit ?

 --
 Regards,
  Rafał Kuć
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

  I have searched this forum but not yet found a definitive
 answer,
  I
 think the
  answer is There is No Limit depends on server specification.
 But
never
 the
  less I will say what I have seen and then ask the questions.

  From scratch (November 2011) I have set up our SOLR which
 contains
data
 from
  various sources, since March 2012 , the number of indexed
 records
(unique
  ID's) reached 13.5 million , which was to be expected. However
 for
   the
 last
  8 months the number of records in the index has not gone above
  13.5
 million,
  yet looking at the request handler outputs I can safely say at
  least
  anywhere from 50 thousand to 100 thousand records are being
  indexed
 daily.
  So I am assuming that earlier records are being removed, and I
 do
   not
 want
  that.

  Question: If there is a limit to the number of records the index
  can
 store
  where do I find this and change it?
  Question: If there is no limit does anyone have any idea why for
  the
last
  months the number has not gone beyond 13.5 million, I can safely
  say
 that at
  least 90% are new records.

  thanks

  macroman



  --
  View this message in context:
 

   
  
 
 http://lucene.472066.n3.nabble.com/Maximum-Number-of-Records-In-Index-tp4038961.html
  Sent from the Solr - User mailing list archive at Nabble.com.


   
   
  
  
   --
   Sincerely yours
   Mikhail Khludnev
   Principal Engineer,
   Grid Dynamics
  
   http://www.griddynamics.com
mkhlud...@griddynamics.com
  
 



SolrCloud upgrade from 4.0 to 4.1

2013-02-11 Thread Shawn Heisey

I'm trying to help someone in #solr on IRC.

Early in the 4.1 release vote process over on the dev@l.a.o mailing 
list, Mark Miller mentioned that ugprading SolrCloud from 4.0 to 4.1 may 
not be as straightforward as the usual Solr upgrade process.  Providing 
some detailed instructions was mentioned, but I cannot find any such thing.


Is there a documented procedure somewhere, or is it as simple as 
dropping in the new .war, massaging the config, and restarting?


Thanks,
Shawn


Re: SolrCloud upgrade from 4.0 to 4.1

2013-02-11 Thread Mark Miller
Yonik looked into it and said the process was actually fine in his testing. 
After the release, we did find one issue - if you don't explicitly set the 
host, the host 'guess' feature has changed and may guess a different address.

- Mark

On Feb 11, 2013, at 1:16 PM, Shawn Heisey s...@elyograg.org wrote:

 I'm trying to help someone in #solr on IRC.
 
 Early in the 4.1 release vote process over on the dev@l.a.o mailing list, 
 Mark Miller mentioned that ugprading SolrCloud from 4.0 to 4.1 may not be as 
 straightforward as the usual Solr upgrade process.  Providing some detailed 
 instructions was mentioned, but I cannot find any such thing.
 
 Is there a documented procedure somewhere, or is it as simple as dropping in 
 the new .war, massaging the config, and restarting?
 
 Thanks,
 Shawn



Re: SolrCloud upgrade from 4.0 to 4.1

2013-02-11 Thread o.mares
Hey does there exist a upgrade guide? Or do you simply copy all files over?
If yes, how to verify if everything is in place.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-upgrade-from-4-0-to-4-1-tp4039757p4039775.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Solr analyze content and find dates and places

2013-02-11 Thread SUJIT PAL
Hi Bart,

Like I said, I didn't actually hook my UIMA stuff into Solr, content and 
queries are annotated before they reach Solr. What you describe sounds like a 
classpath problem (but of course you already knew that :-)). Since I haven't 
actually done what you are trying to do, here are some suggestions, they may or 
may not work...

1) package up the XML files into your custom JAR at the top level, that way you 
don't need to specify it as /RoomNumberAnnotator.xml.
2) if you are using solr4, then you should drop your custom JAR into 
$SOLR_HOME/collection1/lib, not $SOLR_HOME/lib.

-sujit

On Feb 11, 2013, at 9:40 AM, jazz wrote:

 Hi Sujit and others who answered my question,
 
 I have been working on the UIMA path which seems great with the available 
 Eclipse tooling and this:
 
 http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html
 
 Now I worked through the UIMA tutorial of the RoomNumberAnnotator: 
 http://uima.apache.org/doc-uima-annotator.html
 And I am able to test it using the UIMA CAS Virtuall Debugger. So far so good.
 
 But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot 
 find the xml file and the Java class (they are in the correct lib 
 directories, because the WhitespaceTokenizer works fine).
 
 updateRequestProcessorChain name=uima
  processor 
 class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
lst name=uimaConfig
  lst name=runtimeParameters
  /lst
  str name=analysisEngine/RoomNumberAnnotator.xml/str
  bool name=ignoreErrorsfalse/bool
  lst name=analyzeFields
bool name=mergefalse/bool
arr name=fields
  strcontent/str
/arr
  /lst
  lst name=fieldMappings
lst name=type
  str name=nameorg.apache.uima.tutorial.RoomNumber/str
  lst name=mapping
str name=featurebuilding/str
str name=fieldUIMAname/str
  /lst
/lst
  /lst
/lst
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
 
 On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it 
 fails:
 Deploy new jars inside one of the lib directories
 
 Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima path.
 
 Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which branch 
 can I checkout? This is the Stable release I am running:
 
 Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36
 
 Regards, Bart
 
 
 On 8 Feb 2013, at 22:11, SUJIT PAL wrote:
 
 Hi Bart,
 
 I did some work with UIMA but this was to annotate the data before it goes 
 to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked 
 through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I 
 believe you will have to set up your own aggregate analysis chain in place 
 of the one currently configured.
 
 Writing UIMA annotators is very simple (there is a tutorial here:  
 [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]).
  You provide the XML description for the annotation and let UIMA generate 
 the annotation bean. You write Java code for the annotator and also the 
 annotator XML descriptor. UIMA uses the annotator XML descriptor to 
 instantiate and run your annotator. Overall, sounds really complicated but 
 its actually quite simple.
 
 The tutorial has quite a few examples that you will find useful, but in case 
 you need more, I have some on this github repository:
 [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima]
 
 The dictionary and pattern annotators may be similar to what you are looking 
 for (date and city annotators).
 
 Best regards,
 Sujit
 
 On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:
 
 Hi Alex,
 
 Indeed that is exactly what I am trying to achieve using wordcities. Date 
 will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how 
 do I integrate the Java library as UIMA? The documentation about changing 
 schema.xml and solr.xml is not very detailed. 
 
 Regards, Bart
 
 On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote:
 
 Hi Bart,
 
 I haven't done any UIMA work (I used other stuff for my NLP phase), so not
 sure I can help much further. But in general, you are venturing into pure
 research territory here.
 
 Even for dates, what do you actually mean? Just fixed expression? Relative
 dates (e.g. last tuesday?). What about times (7pm?).
 
 Same with cities. If you want it offline, you need the gazetteer and
 disambiguation modules. Gazetteer for cities (worldwide) is huge and has a
 lot of duplicate names (Paris, Ontario is apparently a short drive from
 London, Ontario eh?). Something like
 http://www.maxmind.com/en/worldcities? And disambiguation 

How to limit queries to specific IDs

2013-02-11 Thread Isaac Hebsh
Hi everyone.

I have queries that should be bounded to a set of IDs (the uniqueKey field
of my schema).
My client front-end sends two Solr request:
In the first one, it wants to get the top X IDs. This result should return
very fast. No time to waste on highlighting. this is a very standard
query.
In the aecond one, it wants to get the highlighting info (corresponding to
the queried fields and terms, of course), on those documents (may be some
sequential requests, on small bulks of the full list).

These two requests are implemented as almost identical calls, to different
requestHandlers.

I thought to append a filter query to the second request, id:(1 2 3 4 5).
Is this idea good for Solr?
If does, my problem is that I don't want these filters to flood my
filterCache... Is there any way (even if it involves some coding...) to add
a filter query which won't be added to filterCache (at least, not instead
of standard filters)?


Notes:
1. It can't be assured that the the first query will remain in
queryResultsCache...
2. consider index size of 50M documents...


Re: SolrCloud new zookeper node on different ip/ replicate between two clasters

2013-02-11 Thread mizayah
This is good sollution.

One thing here is rly unyoing. The double indexing.
Is there a way to replicate to another dc? Seams solrcloud cant use his
ealier replication.

Would be nice if i can replicate somehow between two soulrcloud.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-new-zookeper-node-on-different-ip-replicate-between-two-clasters-tp4039101p4039791.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud new zookeper node on different ip/ replicate between two clasters

2013-02-11 Thread Mark Miller
The replication handler can be setup to replicate to another dc. You can also 
put nodes in both dcs. Both have plus and minuses vs just sending the same data 
to both dc's with separate clusters. Where it immediately gets difficult is 
that you need a quorum of zk nodes to survive if you want to continue handling 
updates. I have not yet found the multi dc zk solution. I know other systems 
use something like having a tie breaker node in Europe or something, but I 
don't know that zk yet supports something like this.

In most situations, i think the current best solution is to send data to both 
dcs.

- Mark

On Feb 11, 2013, at 2:43 PM, mizayah miza...@gmail.com wrote:

 This is good sollution.
 
 One thing here is rly unyoing. The double indexing.
 Is there a way to replicate to another dc? Seams solrcloud cant use his
 ealier replication.
 
 Would be nice if i can replicate somehow between two soulrcloud.
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SolrCloud-new-zookeper-node-on-different-ip-replicate-between-two-clasters-tp4039101p4039791.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Problems using distributed More Like This

2013-02-11 Thread Mark Miller
Eventually, I'll get around to trying some more real world testing. Up till 
now, no dev seems to have a real interest in this. I have 0 need for it 
currently, so it's fairly low on my itch scale, but it's on my list anyhow.

- Mark

On Feb 11, 2013, at 12:26 PM, Shawn Heisey s...@elyograg.org wrote:

 SOLR-788 added Distributed MLT to Solr 4.1, but I have not been able to get 
 it to work.  I don't know if it's user error, which of course is very 
 possible.  If it is user error, I'd like to know what I'm doing wrong so I 
 can fix it.  I am actually using a recent checkout of Solr 4.2, not the 
 released 4.1.
 
 I put some extensive information on SOLR-4414, an issue filed by another user 
 having a similar problem.  If you look for the last comment from me on Feb 7 
 that has a code block, you'll see Solr's response when I use 
 MoreLikeThisComponent.
 
 https://issues.apache.org/jira/browse/SOLR-4414
 
 Only the last seven of the query parameters were included on the URL - the 
 rest of them are in solrconfig.xml.  Due to echoParams=all, the only part of 
 the request handler definition that you can't see in the response is the fact 
 that last-components contains spellcheck.
 
 I redacted the company domain name from the shards and the one document 
 matching the query from the result tag, but there are no other changes to 
 the response.
 
 If I send an identical query to the shard core that actually contains the 
 document rather than the core with the shards parameter, I get MLT results.
 
 I have heard recently that Solr 4.x has hardcoded the unique field name for 
 SolrCloud sharding as id ... but my uniqueKey field name is tag_id.  Could 
 this be my problem?  It would be a monumental development effort to change 
 that field name in our application.  I am not using SolrCloud for this index.
 
 Thanks,
 Shawn



Re: Can Solr analyze content and find dates and places

2013-02-11 Thread jazz
Hi Sujit,

Thanks for your help! I moved the RoomNumberAnnotator.xml to the top level of 
the jar and used the same solrconfig.xml (with the /). Now it works perfect.

Best regards, Bart


On 11 Feb 2013, at 20:13, SUJIT PAL wrote:

 Hi Bart,
 
 Like I said, I didn't actually hook my UIMA stuff into Solr, content and 
 queries are annotated before they reach Solr. What you describe sounds like a 
 classpath problem (but of course you already knew that :-)). Since I haven't 
 actually done what you are trying to do, here are some suggestions, they may 
 or may not work...
 
 1) package up the XML files into your custom JAR at the top level, that way 
 you don't need to specify it as /RoomNumberAnnotator.xml.
 2) if you are using solr4, then you should drop your custom JAR into 
 $SOLR_HOME/collection1/lib, not $SOLR_HOME/lib.
 
 -sujit
 
 On Feb 11, 2013, at 9:40 AM, jazz wrote:
 
 Hi Sujit and others who answered my question,
 
 I have been working on the UIMA path which seems great with the available 
 Eclipse tooling and this:
 
 http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html
 
 Now I worked through the UIMA tutorial of the RoomNumberAnnotator: 
 http://uima.apache.org/doc-uima-annotator.html
 And I am able to test it using the UIMA CAS Virtuall Debugger. So far so 
 good.
 
 But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot 
 find the xml file and the Java class (they are in the correct lib 
 directories, because the WhitespaceTokenizer works fine).
 
 updateRequestProcessorChain name=uima
 processor 
 class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
   lst name=uimaConfig
 lst name=runtimeParameters
 /lst
 str name=analysisEngine/RoomNumberAnnotator.xml/str
 bool name=ignoreErrorsfalse/bool
 lst name=analyzeFields
   bool name=mergefalse/bool
   arr name=fields
 strcontent/str
   /arr
 /lst
 lst name=fieldMappings
   lst name=type
 str name=nameorg.apache.uima.tutorial.RoomNumber/str
 lst name=mapping
   str name=featurebuilding/str
   str name=fieldUIMAname/str
 /lst
   /lst
 /lst
   /lst
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
 
 On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it 
 fails:
 Deploy new jars inside one of the lib directories
 
 Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima 
 path.
 
 Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which branch 
 can I checkout? This is the Stable release I am running:
 
 Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36
 
 Regards, Bart
 
 
 On 8 Feb 2013, at 22:11, SUJIT PAL wrote:
 
 Hi Bart,
 
 I did some work with UIMA but this was to annotate the data before it goes 
 to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked 
 through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I 
 believe you will have to set up your own aggregate analysis chain in place 
 of the one currently configured.
 
 Writing UIMA annotators is very simple (there is a tutorial here:  
 [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]).
  You provide the XML description for the annotation and let UIMA generate 
 the annotation bean. You write Java code for the annotator and also the 
 annotator XML descriptor. UIMA uses the annotator XML descriptor to 
 instantiate and run your annotator. Overall, sounds really complicated but 
 its actually quite simple.
 
 The tutorial has quite a few examples that you will find useful, but in 
 case you need more, I have some on this github repository:
 [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima]
 
 The dictionary and pattern annotators may be similar to what you are 
 looking for (date and city annotators).
 
 Best regards,
 Sujit
 
 On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:
 
 Hi Alex,
 
 Indeed that is exactly what I am trying to achieve using wordcities. Date 
 will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how 
 do I integrate the Java library as UIMA? The documentation about changing 
 schema.xml and solr.xml is not very detailed. 
 
 Regards, Bart
 
 On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote:
 
 Hi Bart,
 
 I haven't done any UIMA work (I used other stuff for my NLP phase), so not
 sure I can help much further. But in general, you are venturing into pure
 research territory here.
 
 Even for dates, what do you actually mean? Just fixed expression? Relative
 dates (e.g. last tuesday?). What about times (7pm?).
 
 Same with cities. If you want it offline, you need the gazetteer and
 disambiguation 

Re: SolrCloud new zookeper node on different ip/ replicate between two clasters

2013-02-11 Thread mizayah
Thx Mark

The replication handler can be setup to replicate to another dc. 
Erm, i dont get it. I can setup replication between two solr cloud this way
or just solrcloud-solr?


You can also put nodes in both dcs
Indexing will slow rly much if I understad well solrcluoud replica and
leader (replication is real-time ). 
Worst is when by accident Zoo will elect leader in other dc. Zoo could use
obserwers here bit it will only makes things more comlicated too.

I have not yet found the multi dc zk solution.
Only smth called obserwers help a bit in my case. Zoo called obserwers
dont vote thay are jusr like points of read. It would be good here, but
after one dc down i need fully working zoo and obserwers doesnt support to
change to follower.
About zoo conf is ofc big problem, but configuration doesnt change much so
two zoo quorum in bots dc are ok imo.


I know other systems use something like having a tie breaker node in
Europe
Yeah, i want run my own cloud and want to have failover in amazon. I'm from
europe :)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-new-zookeper-node-on-different-ip-replicate-between-two-clasters-tp4039101p4039808.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr 4.0 is stripping XML format from RSS content field

2013-02-11 Thread eShard
Hi,
I'm running solr 4.0 final with manifoldcf 1.1 and I verified via fiddler
that Manifold is indeed sending the content field from a RSS feed that
contains xml data
However, when I query the index the content field is there with just the
data; the XML structure is gone.
Does anyone know how to stop Solr from doing this?
I'm using tika but I don't see it in the update/extract handler.
Can anyone point me in the right direction?

Thanks,




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-is-stripping-XML-format-from-RSS-content-field-tp4039809.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Cloud: Duplicate records while retrieving documents

2013-02-11 Thread devb
We are running a six node SOLR cloud which 3 shards and 3 replications. The
version of solr cloud is 4.0.0.2012.08.06.22.50.47. We use Python PySolr
client to interact with Solr. Documents that we add to solr have a unique id
and it can never have duplicates. 
Our use case is to query the index for a give searchterm and pull all
documents that matches the query. Usually our query hits over 40K documents.
While we iterate through all 40K+ documents, after few iteration, we see the
same documents ids repeated over and over, and at the end we see some 20-33%
of the records are duplicates. 
In the below code snippet after some iterations, we see a difference in the
length of idslist and idsset. Any insight into how to troubleshoot this
issue is greatly appreciated.

from pysolr import Solr
solr=  Solr('http://solrhost/solr/#/collection1')
if __name__ == '__main__':


idslist = list()
idsset = set()
query = 'snow'
skip = 0
limit= 500
i = 0
while True:
response = solr.search(q=query, rows=limit, start=skip, 
shards='host1:7575/solr,host2:7575/solr,host3:7575/solr', fl=id,source)
if skip == 0:
hits = response.hits
line = Solr Hits Count: (%s)\n % (hits)
print line  
if len(response.docs) == 0:
break 
for result in response:
idslist.append(result['id'])
idsset.add(result['id'])
if i % 500 == 0:
print len(idslist), len(idsset)
i+=1
skip += limit 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-records-while-retrieving-documents-tp4039776.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Solr analyze content and find dates and places

2013-02-11 Thread SUJIT PAL
Cool! Thanks for the update, this will help if I ever go all the way with UIMA 
and Solr.

-sujit

On Feb 11, 2013, at 12:13 PM, jazz wrote:

 Hi Sujit,
 
 Thanks for your help! I moved the RoomNumberAnnotator.xml to the top level of 
 the jar and used the same solrconfig.xml (with the /). Now it works perfect.
 
 Best regards, Bart
 
 
 On 11 Feb 2013, at 20:13, SUJIT PAL wrote:
 
 Hi Bart,
 
 Like I said, I didn't actually hook my UIMA stuff into Solr, content and 
 queries are annotated before they reach Solr. What you describe sounds like 
 a classpath problem (but of course you already knew that :-)). Since I 
 haven't actually done what you are trying to do, here are some suggestions, 
 they may or may not work...
 
 1) package up the XML files into your custom JAR at the top level, that way 
 you don't need to specify it as /RoomNumberAnnotator.xml.
 2) if you are using solr4, then you should drop your custom JAR into 
 $SOLR_HOME/collection1/lib, not $SOLR_HOME/lib.
 
 -sujit
 
 On Feb 11, 2013, at 9:40 AM, jazz wrote:
 
 Hi Sujit and others who answered my question,
 
 I have been working on the UIMA path which seems great with the available 
 Eclipse tooling and this:
 
 http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html
 
 Now I worked through the UIMA tutorial of the RoomNumberAnnotator: 
 http://uima.apache.org/doc-uima-annotator.html
 And I am able to test it using the UIMA CAS Virtuall Debugger. So far so 
 good.
 
 But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot 
 find the xml file and the Java class (they are in the correct lib 
 directories, because the WhitespaceTokenizer works fine).
 
 updateRequestProcessorChain name=uima
processor 
 class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory
  lst name=uimaConfig
lst name=runtimeParameters
/lst
str name=analysisEngine/RoomNumberAnnotator.xml/str
bool name=ignoreErrorsfalse/bool
lst name=analyzeFields
  bool name=mergefalse/bool
  arr name=fields
strcontent/str
  /arr
/lst
lst name=fieldMappings
  lst name=type
str name=nameorg.apache.uima.tutorial.RoomNumber/str
lst name=mapping
  str name=featurebuilding/str
  str name=fieldUIMAname/str
/lst
  /lst
/lst
  /lst
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
 
 On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it 
 fails:
 Deploy new jars inside one of the lib directories
 
 Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima 
 path.
 
 Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which 
 branch can I checkout? This is the Stable release I am running:
 
 Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36
 
 Regards, Bart
 
 
 On 8 Feb 2013, at 22:11, SUJIT PAL wrote:
 
 Hi Bart,
 
 I did some work with UIMA but this was to annotate the data before it goes 
 to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked 
 through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and 
 I believe you will have to set up your own aggregate analysis chain in 
 place of the one currently configured.
 
 Writing UIMA annotators is very simple (there is a tutorial here:  
 [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]).
  You provide the XML description for the annotation and let UIMA generate 
 the annotation bean. You write Java code for the annotator and also the 
 annotator XML descriptor. UIMA uses the annotator XML descriptor to 
 instantiate and run your annotator. Overall, sounds really complicated but 
 its actually quite simple.
 
 The tutorial has quite a few examples that you will find useful, but in 
 case you need more, I have some on this github repository:
 [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima]
 
 The dictionary and pattern annotators may be similar to what you are 
 looking for (date and city annotators).
 
 Best regards,
 Sujit
 
 On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote:
 
 Hi Alex,
 
 Indeed that is exactly what I am trying to achieve using wordcities. Date 
 will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But 
 how do I integrate the Java library as UIMA? The documentation about 
 changing schema.xml and solr.xml is not very detailed. 
 
 Regards, Bart
 
 On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote:
 
 Hi Bart,
 
 I haven't done any UIMA work (I used other stuff for my NLP phase), so 
 not
 sure I can help much further. But in general, you are venturing into pure
 research territory here.
 
 Even for dates, what do you actually mean? Just fixed expression? 
 Relative
 dates (e.g. 

Re: Solr Cloud: Duplicate records while retrieving documents

2013-02-11 Thread Shawn Heisey

On 2/11/2013 12:09 PM, devb wrote:

We are running a six node SOLR cloud which 3 shards and 3 replications. The
version of solr cloud is 4.0.0.2012.08.06.22.50.47. We use Python PySolr
client to interact with Solr. Documents that we add to solr have a unique id
and it can never have duplicates.
Our use case is to query the index for a give searchterm and pull all
documents that matches the query. Usually our query hits over 40K documents.
While we iterate through all 40K+ documents, after few iteration, we see the
same documents ids repeated over and over, and at the end we see some 20-33%
of the records are duplicates.
In the below code snippet after some iterations, we see a difference in the
length of idslist and idsset. Any insight into how to troubleshoot this
issue is greatly appreciated.


For discussion purposes Let's first assume that there are no bugs in 
Solr.  I don't think we can make that assumption, of course.


General note 1: Your Solr URL in your code has a # in it.  The URLs with 
# in them are Admin UI URLs.  If that's working, I'm amazed... I would 
take that part of the URL out so that you are pointing at:


http://host:port/solr/collection1

General note 2: Paging through that many results with a distributed 
query (known as deep paging) is SLOW.


http://solr.pl/en/2011/07/18/deep-paging-problem/


The first thing I'd do is ask Solr to sort your results.  I can see from 
some google searches that pysolr has sort capability.  Once you pick the 
sort field, I'd probably do the sort ascending, not descending.  The 
default sort is relevance.


The next thing to check is whether or not you are updating your index 
during the time that you are attempting to pull 40,000 documents.  If 
you are, that could completely explain what you are seeing.  If you are 
only adding documents when you update, then you may be able to set a 
sort parameter that will cause new documents to be at the end of the 
results, so pagination won't get messed up.  If you are deleting 
documents, then you won't be able to make this work, you'll have to stop 
your index updates while you pull that many results.


After all that, if the problem persists and you are absolutely sure that 
you don't have duplicate document X on two different shards, then you 
might be running into a bug.


Thanks,
Shawn



SolrCloud and hardcoded 'id' field

2013-02-11 Thread Shawn Heisey
I have heard that SolrCloud may require the presence of a uniqeKey field 
specifically named 'id' for sharding.


Is this true?  Is it still true as of Solr 4.2-SNAPSHOT?  If not, what 
svn commit fixed it?  If so, should I file a jira?  I am not actually 
using SolrCloud for one index, but my worry is that once a precedent for 
putting specific names in the code is set, it may bleed over into other 
features.  Also, I have another set of servers for a different purpose 
that ARE using SolrCloud.  Currently that system uses numShards=1, but 
one day we might want to do a distributed search there.


Both my systems have a uniqueKey field other than 'id' and it would be 
quite a task to change it.  The 'id' field doesn't exist at all in 
either system.  Here's relevant info for one of the systems:


   field name=tag_id type=lowercase indexed=true stored=true 
omitTermFreqAndPositions=true/


!-- lowercases the entire field value --
fieldType name=lowercase class=solr.TextField 
sortMissingLast=true positionIncrementGap=0 omitNorms=true

  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.TrimFilterFactory/
  /analyzer
/fieldType

  uniqueKeytag_id/uniqueKey

Thanks,
Shawn


Re: Fwd: advice about develop AbstractSolrEventListener.

2013-02-11 Thread Chris Hostetter

: I found a solution. I am going to Configured Update Request Processors,
: that I have seen in: http://wiki.apache.org/solr/UpdateRequestProcessor

Sorry for the late reply, but yes -- an UpdateProcessor seems like the 
best place to hook in custom functionality if you need to know about 
individual document adds and commits.


-Hoss


Re: Term Frequencies for Query Result

2013-02-11 Thread Chris Hostetter

: I am looking for a way to get the top terms for a query result.

you have to elaborate on exactly what you mean ... how are you defining 
top terms for a query result ?  Are you talking about hte most common 
terms in the entire result set of documents that match your query?  or the 
terms from the query that most contributed to the query? or something 
else?

: Faceting does not work since counts are measured as documents containing 
: a term and not as the overall count of a term in all found documents:
...
: Using http://wiki.apache.org/solr/TermVectorComponent an counting all 
: frequencies manually seems to be the only solution by now:

i *think* you are saying that you want the sum of term frequencies for all 
terms in all matching documents -- but i'm not sure, because i don't see 
how TermVectorComponent is helping you unless you are iterating over every 
doc in the result set (ie: deep paging) to get the TermVectors for every 
doc ... it would help if you could explain what you mean by counting all 
frequencies manually



-Hoss


Re: addSortField throws field not found

2013-02-11 Thread Chris Hostetter
: Subject: addSortField throws field not found
: 
: same field name is accepted for addFacetField but throws a field not found ex
: for the addSortField method.

As a general rule, if you are going to ask a question about an error that 
you got -- you need to cut/paste the exception (verbatim) into your email 
... with the full stack trace.

if the error was logged by solr in response to a query, cut/paste the 
query (verbatim) into your email as well.

if the error was thrown in your client code, cut/paste your client 
code (verbatim) into your email as well.

https://wiki.apache.org/solr/UsingMailingLists

As things stand, you have provided almost no information that anyone can 
use to help you here ... my best guess is maybe you are having a jar 
mismatch .. but that assumes you mean you got a NoSuchMethodError about 
the addSortField in the SolrJ API ... maybe you mean you got an error 
from solr about a field not existing in your schema? ... i honestly have 
no idea.



-Hoss


Re: memory leak - multiple cores

2013-02-11 Thread Marcos Mendez
Hi Michael,

Yes, we do intend to reload Solr when deploying new cores. So we deploy it, 
update solr.xml and then restart Solr only. So this will happen sometimes in 
production, but mostly testing. Which means it will be a real pain. Any way to 
fix this?

Also, I'm running geronimo with -Xmx1024m -XX:MaxPermSize=256m. 

Regards,
Marcos

On Feb 6, 2013, at 10:54 AM, Michael Della Bitta wrote:

 Marcos,
 
 The later 3 errors are common and won't pose a problem unless you
 intend to reload the Solr application without restarting Geronimo
 often.
 
 The first error, however, shouldn't happen. Have you changed the size
 of PermGen at all? I noticed this error while testing Solr 4.0 in
 Tomcat, but haven't seen it with Solr 4.1 (yet), so if you're on 4.0,
 you might want to try upgrading.
 
 
 Michael Della Bitta
 
 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271
 
 www.appinions.com
 
 Where Influence Isn’t a Game
 
 
 On Wed, Feb 6, 2013 at 6:09 AM, Marcos Mendez mar...@jitisoft.com wrote:
 Hi,
 
 I'm deploying the SOLR war in Geronimo, with multiple cores. I'm seeing the
 following issue and it eats up a lot of memory when shutting down. Has
 anyone seen this and have an idea how to solve it?
 
 Exception in thread DefaultThreadPool 196 java.lang.OutOfMemoryError:
 PermGen space
 2013-02-05 20:13:34,747 ERROR [ConcurrentLRUCache] ConcurrentLRUCache was
 not destroyed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE
 LEAK!!!
 2013-02-05 20:13:34,747 ERROR [ConcurrentLRUCache] ConcurrentLRUCache was
 not destroyed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE
 LEAK!!!
 2013-02-05 20:13:34,747 ERROR [CoreContainer] CoreContainer was not
 shutdown prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
 instance=2080324477
 
 Regards,
 Marcos



Re: SolrCloud and hardcoded 'id' field

2013-02-11 Thread Mark Miller
Doesn't sound right to me. I'd guess you heard wrong. 

- mark

Sent from my iPhone

On Feb 11, 2013, at 7:15 PM, Shawn Heisey s...@elyograg.org wrote:

 I have heard that SolrCloud may require the presence of a uniqeKey field 
 specifically named 'id' for sharding.
 
 Is this true?  Is it still true as of Solr 4.2-SNAPSHOT?  If not, what svn 
 commit fixed it?  If so, should I file a jira?  I am not actually using 
 SolrCloud for one index, but my worry is that once a precedent for putting 
 specific names in the code is set, it may bleed over into other features.  
 Also, I have another set of servers for a different purpose that ARE using 
 SolrCloud.  Currently that system uses numShards=1, but one day we might want 
 to do a distributed search there.
 
 Both my systems have a uniqueKey field other than 'id' and it would be quite 
 a task to change it.  The 'id' field doesn't exist at all in either system.  
 Here's relevant info for one of the systems:
 
   field name=tag_id type=lowercase indexed=true stored=true 
 omitTermFreqAndPositions=true/
 
!-- lowercases the entire field value --
fieldType name=lowercase class=solr.TextField sortMissingLast=true 
 positionIncrementGap=0 omitNorms=true
  analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.TrimFilterFactory/
  /analyzer
/fieldType
 
  uniqueKeytag_id/uniqueKey
 
 Thanks,
 Shawn


Re: Reverse range query

2013-02-11 Thread ballusethuraman
Hi,
I have craeted new attribute(Year) in attribute dictionary and associated
with different catentries with different values say
2000,2001,2002,2003,...2012.
Now I want to search with the Year attribute with min and max range. when
2000 to 2005 is given as search condition it should fetch the catentries
which is between these two values. 
This is the url I used to hit the solr server.
ads_f11001 is the logical name of the attribute year that i have created
in management center. This value will be in srchattrprop table. 2000 and
2005 is min and max range.
http://localhost/solr/MC_10701_CatalogEntry_en_US/select?q=ads_f11001:{2000
2005}

when i try to hit this url i am getting 0 records found.
http://localhost/solr/MC_10701_CatalogEntry_en_US/select?q=ads_f11001:{2000
TO *}

and

http://localhost/solr/MC_10701_CatalogEntry_en_US/select?q=ads_f11001:{* TO
2005}

These above two urls ferching me some result but it s not the expected
result. Plz help me to solve this issue.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reverse-range-query-tp1789135p4039860.html
Sent from the Solr - User mailing list archive at Nabble.com.


Searching with min and max range in solr

2013-02-11 Thread ballusethuraman
Hi,
I have craeted new attribute(Year) in attribute dictionary and associated
with different catentries with different values say
2000,2001,2002,2003,...2012.
Now I want to search with the Year attribute with min and max range. when
2000 to 2005 is given as search condition it should fetch the catentries
which is between these two values. 
This is the url I used to hit the solr server.
ads_f11001 is the logical name of the attribute year that i have created
in management center. This value will be in srchattrprop table. 2000 and
2005 is min and max range.
http://localhost/solr/MC_10701_CatalogEntry_en_US/select?q=ads_f11001:{2000
2005}

when i try to hit this url i am getting 0 records found.
http://localhost/solr/MC_10701_CatalogEntry_en_US/select?q=ads_f11001:{2000
TO *}

and

http://localhost/solr/MC_10701_CatalogEntry_en_US/select?q=ads_f11001:{* TO
2005}

These above two urls ferching me some result but it s not the expected
result. Plz help me to solve this issue.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-with-min-and-max-range-in-solr-tp4039861.html
Sent from the Solr - User mailing list archive at Nabble.com.


Custom search component executed several times when using Zookeeper

2013-02-11 Thread jens.fosh...@evita.no
We have implemented a custom search component for SOLR which handles
security. It simply adds a filter query in the prepare method. This search
component is added to our search handler as the last component.
The custom function retrieves from a database a list of ACLs attached to the
user. 

When we are running on one instance (a single master), our search component
is executed once per request. This is what expected too. But when we are
using Zookeeper (two nodes), the same custom component are executed four
times per request. This gives a huge overhead and gives a poor performance.
Is this normal behavior when using zookeeper or is there any configuration
we have overlooked ? 

Best regards

Jens Foshaug, e-vita as



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-search-component-executed-several-times-when-using-Zookeeper-tp4039872.html
Sent from the Solr - User mailing list archive at Nabble.com.