Re: How partial are partial updates

2015-03-26 Thread Mikhail Khludnev
On Thu, Mar 26, 2015 at 12:23 PM, kennyk ke...@ontoforce.com wrote:

 Does solr have to reindex the whole document and not just the modified
 fields?

yep. you are right.


 If so, can  you give me an idea of the amount (factor) of speed
 gained by partial re-indexing?

it's exactly the same what you have in indexing, and little bit worse,
because you need to read stored fields.

here is some notion of true field updates, but it doesn't updates inverted
index, nor available in Solr.



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


How partial are partial updates

2015-03-26 Thread kennyk
Hi all,

I have a question.  Here
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents  
I read that
/Solr supports several modifiers that atomically update values of a
document. This allows updating only specific fields,/
 and that
/All original source fields must be stored for field modifiers to work
correctly/

And  here https://wiki.apache.org/solr/Atomic_Updates   even more
explicitly
/Internally Solr re-adds the document to the index with the updated fields./ 

Does solr have to reindex the whole document and not just the modified
fields? If so, can  you give me an idea of the amount (factor) of speed
gained by partial re-indexing?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-partial-are-partial-updates-tp4195441.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: German Compound Splitter words.fst causing problems.

2015-03-26 Thread Christopher Morley
Thanks for the tip Markus.  We are using this filter to decompound German 
words.  Update: I am on the path to victory.  The words.fst file is actually 
built by the plugin, however there is a basic inputoutput file format mismatch 
(at the byte level) that doesn't occur with 4.0.  As soon as you try to use 
lucene core 4.1 with this particular plugin, it breaks with the same error I 
was getting.  The FST code in lucene says clearly that there is no guaranteed 
backward compatibility, so there you have it.  I'm probably going to need to 
incorporate some older code from lucene and/or figure out how to make the 
plugin work with the new lucene code.

-Chris.

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, March 25, 2015 6:15 PM
To: solr-user@lucene.apache.org
Subject: RE: German Compound Splitter words.fst causing problems.

Hello Chris - i don't know that token filter you mention but i would like to 
recommend Lucene's HyphenationCompoundWordTokenFilter. It works reasonably well 
if you provide the hyphenation rules and a dictionary. It has some flaws such 
as decompounding to irrelevant subwords, overlapping subwords or to subwords 
that do not form the whole compound word (minus genitives),  but these can be 
fixed.

Markus
 
-Original message-
 From:Chris Morley ch...@depahelix.com
 Sent: Wednesday 25th March 2015 17:59
 To: solr-user@lucene.apache.org
 Subject: German Compound Splitter words.fst causing problems.
 
 Hello, Chris Morley here, of Wayfair.com. I am working on the German 
 compound-splitter by Dawid Weiss. 
   
   I tried to upgrade the words.fst file that comes with the German 
 compound-splitter using Solr 3.5, but it doesn't work. Below is the 
 IndexNotFoundException that I get.
   
  cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp 
 lucene/build/lucene-core-3.5-SNAPSHOT.jar 
 org.apache.lucene.index.IndexUpgrader wordsFst  Exception in thread main 
 org.apache.lucene.index.IndexNotFoundException: 
 org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst
  lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e
  at 
 org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118)
  at 
 org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85)
   
  The reason I'm attempting this at all is due to the answer here, 
 http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7,
  which says to do the upgrade in a two step process, first using Solr 3.5, 
 and then the latest Solr version (4.10.3).  When I try this running the unit 
 tests for my modified German compound-splitter I'm getting this same type of 
 error.  The thing is, this is an FST, not an index, which is a little 
 confusing.  The reason why I'm following this answer though, is because I'm 
 getting that exact same message when trying to build the (modified) project 
 with mavenat the point at which it tries to load in words.fst. Below.
   
  [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter 
 - Format version is not supported (resource: 
 com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0 
 (needs to be between 3 and 4). This version of Lucene only supports indexes 
 created with release 3.0 and later.  Failed to initialize static data 
 structures for German compound splitter.
   
  Thanks,
  -Chris.
 
 
 





Installing the auto-phrase-tokenfilter

2015-03-26 Thread luismart
hello,

I am after installing the auto-phrase-tokenfilter from
https://github.com/LucidWorks/auto-phrase-tokenfilter. 

Can anyone point me to some documentation on how to do this?

Thanks

Luis Martinez




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Installing-the-auto-phrase-tokenfilter-tp4195466.html
Sent from the Solr - User mailing list archive at Nabble.com.


Running test cases with ant

2015-03-26 Thread Mrinali Agarwal
Hello ,

I am trying to run  my test cases in solr using ant .
I am using below command

ant test –Dtestcase=Test  -Dtests.leaveTemporary=true

Now , here i have my own custom schema  solrConfig . On running the above
command on solr directiory , it builds the project again which overrides my
schema.xml  solrConfig.xml

Due to this my test case fails because it is not able to find customized
schema  config .

Let me know any suggestions

Thanks
Mrinali


Re: Running test cases with ant

2015-03-26 Thread Shawn Heisey
On 3/26/2015 6:40 AM, Mrinali Agarwal wrote:
 I am trying to run  my test cases in solr using ant .
 I am using below command
 
 ant test –Dtestcase=Test  -Dtests.leaveTemporary=true
 
 Now , here i have my own custom schema  solrConfig . On running the above
 command on solr directiory , it builds the project again which overrides my
 schema.xml  solrConfig.xml
 
 Due to this my test case fails because it is not able to find customized
 schema  config .
 
 Let me know any suggestions

Take a look at org.apache.solr.search.TestLFUCache for an example of a
test that loads a custom solrconfig.

The custom config is here:

solr/core/src/test-files/solr/collection1/conf/solrconfig-caching.xml

The code in TestLFUCache.java that uses that config is:

  @BeforeClass
  public static void beforeClass() throws Exception {
initCore(solrconfig-caching.xml, schema.xml);
  }

Thanks,
Shawn



Different methods of sending documents to Solr

2015-03-26 Thread zhangxin0804
Hi All,

I am trying to post data into Solr using curl command. Does anybody
could tell me the difference between the following two methods?

Method1:
curl http://localhost:8983/solr/update/extract?literal.id=doc1commit=true;
-F myfile=@tutorial.html

The -F flag instructs curl to POST data using the Content-Type
multipart/form-data and supports the uploading of binary files. 

Method2:
curl
http://localhost:8983/solr/update/extract?literal.id=doc1defaultField=textcommit=true;
 
--data-binary @tutorial.html  -H 'Content-type:text/html'


   Consider my situation:
   I want to post many different content-types of files into Solr. Which
method should I choose?

  Thank you so much.

Sincerely,
Xiaoha





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Different-methods-of-sending-documents-to-Solr-tp4195725.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: i'm a newb: questions about schema.xml

2015-03-26 Thread Zheng Lin Edwin Yeo
Yes, this is the correct page which will tell you more about this
managed-schema thing in Solr 5.0.0. I got stuck in this for quite a while
previously too.

Regards,
Edwin

On 27 March 2015 at 08:20, Mark Bramer mbra...@esri.com wrote:

 Pretty sure I found what I am looking for:
 https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig

 I noticed the managed-schema file and a couple Google searches with that
 finally landed me at that link.

 Interesting that the file is hidden from the Files list in the Admin UI.

 Thanks!


 -Original Message-
 From: Mark Bramer
 Sent: Thursday, March 26, 2015 7:42 PM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: i'm a newb: questions about schema.xml

 Hi Shawn,

 Definitely helpful to know about the instance and files stuff in Admin.
 I'm not running cloud, so I looked in the /conf directory but there's no
 schema.xml:

 Here's what's in my core's Files:
   currency.xml
   elevate.xml
   lang
   params.json
   protwords.txt
   solrconfig.xml
   stopwords.txt
   synonyms.txt

 and echoed by ls -l:

 -rw-r--r-- 1 root root  3974 Feb 15 11:38 currency.xml
 -rw-r--r-- 1 root root  1348 Feb 15 11:38 elevate.xml drwxr-xr-x 2 root
 root  4096 Mar 23 10:46 lang
 -rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema
 -rw-r--r-- 1 root root   308 Feb 15 11:38 params.json
 -rw-r--r-- 1 root root   873 Feb 15 11:38 protwords.txt
 -rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml
 -rw-r--r-- 1 root root   781 Feb 15 11:38 stopwords.txt
 -rw-r--r-- 1 root root  1119 Feb 15 11:38 synonyms.txt

 -Original Message-
 From: Shawn Heisey [mailto:apa...@elyograg.org]
 Sent: Thursday, March 26, 2015 7:28 PM
 To: solr-user@lucene.apache.org
 Subject: Re: i'm a newb: questions about schema.xml

 On 3/26/2015 4:57 PM, Mark Bramer wrote:
  I'm a Solr newb.  I've been poking around for several days on my own
 test instance, and also online at the info available.  But one thing just
 isn't jiving and I can't put my finger on why.  I've searched many many
 times but I don't see what I'm looking for, so I'm thinking perhaps I have
 a fundamental semantic misunderstanding of something somewhere.  Everywhere
 I read, everyone talks about schema.xml and how important is.  I fully get
 what it's for but I don't get where it is, how it's used (by me), how I
 edit it, and how I create new indexes once I've edited it.
 
  I've installed, and am successfully running, solr 5.0.0 on Linux.  I've
 followed the widely recommended-by-all quick start at:
 http://lucene.apache.org/solr/quickstart.html.  I get through it fine, I
 post a bunch of stuff, I use the web UI to query for, and see, data I would
 expect to see.  Should I now have a schema.xml file somewhere that is
 somehow connected to my new index?  If so, where is it?  Was it present
 from install or did it get created when I made my first core (bin/solr
 create -c ati_docs)?
 
  [root@machine solr-5.0.0]# find -name schema.xml
  ./example/example-DIH/solr/tika/conf/schema.xml
  ./example/example-DIH/solr/rss/conf/schema.xml
  ./example/example-DIH/solr/solr/conf/schema.xml
  ./example/example-DIH/solr/db/conf/schema.xml
  ./example/example-DIH/solr/mail/conf/schema.xml
  ./server/solr/configsets/basic_configs/conf/schema.xml
  ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml
  [root@machine solr-5.0.0]#
 
  Is it the one in /configsets/basic_configs/conf?  Is that the default
 one?
 
  If I want to 'modify' schema.xml to do some different
 indexing/analyzing, how do I start?  Make a copy of that schema.xml, move
 it somewhere else and modify it?  If so, how do I create a new index using
 this schema.xml?
 
  Or am I running in schemaless mode?  I don't think I am because it
  appears that I would have to specifically state this as a command line
  parameter, i.e. bin/solr start -e schemaless
 
  What fundamentals am I missing?  I'm coming to Solr from Elasticsearch,
 and I've already recognized some differences.  Is my ES background clouding
 my grasp of Solr fundamentals?

 Hopefully you know what core you are using, so you can go to the admin UI
 and find it in the Core Selector dropdown list.  Assuming you can do
 that, you will find yourself looking at the Overview tab for that core.


 https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface

 Once you are looking at the core overview, in the upper right corner of
 your browser window is a section called Instance ... which has an entry
 that is ALSO called Instance.  Inside the directory indicated by that
 field, you should have a conf directory.  The config and schema for that
 index are found in that conf directory.

 If you're running SolrCloud, then you can forget everything I just said
 ... the active configs will be found within the zookeeper database, and you
 can use the Cloud-Tree tab in the admin UI to find your collections and
 see which configName is linked to each 

solr server datetime

2015-03-26 Thread fjq
Is it possible to retrieve the server datetime?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-server-datetime-tp4195728.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to create a core by API?

2015-03-26 Thread Mark E. Haase
On Thu, Mar 26, 2015 at 1:31 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Hmmm, looks like I stand corrected. I haven't kept complete track
 there, looks like this one didn't stick in my head.


I'm not saying you're wrong. The configSet parameter doesn't work at all in
my set up, so you might be right... I'm just wondering where that's
documented.

I thought Solr documentation was rough back in the 1.6 days, but wow...
it's gotten shockingly bad in Solr 5.


 As far as the docs are concerned, all patches welcome!


What kind of patch do you mean? Isn't all the documentation maintained on
confluence?

-- 
Mark E. Haase
202-815-0201


Re: How to create a core by API?

2015-03-26 Thread Mark E. Haase
Okay, thanks for the feedback. I'll admit that I do find the cloud vs
non-cloud deployment options a constant source of confusion, not the least
of which is due to the name. If I run a single Solr instance on EC2, that's
not cloud, but if I run a few instances with ZK on my local LAN, that is
cloud. Mmmkay.

I can't imagine why the API documentation wouldn't mention that the API
can't actually do the thing it's supposed to do (create a core). What's the
purpose of having an HTTP API if I'm expected to already have write access
to the host's file system to use it? Maybe its intended as private API? It
should only be used by Solr itself, e.g. `solr create -c foo` uses the
Cores Admin API to do some (but not all) of its work. But if that's the
case, then the API docs should say that.

From an API consumer's point of view, I'm not really interested in being
forced to learn the history of the project to use the API. The whole point
of creating APIs is to abstract out details that the caller doesn't need to
know, and yet this API requires an understanding of Solr's internal file
structure and history of the project?

Yikes.


On Thu, Mar 26, 2015 at 12:56 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Ok, you're being confused by cloud, non cloud and all that kinda stuff

 Configsets are SolrCloud only, so forget them since you specified it's
 not SolrCloud.

 bq: surely the HTTP API doesn't require the caller to create a
 directory and copy files first, does it

 In fact, yes. The thing to remember here is that you're using a much
 older approach that had its roots in the pre-cloud days. The problem
 is how do you insure that the configurations are on the node you're
 creating the core on? The whole configsets discussion is an attempt to
 solve that in SolrCloud by putting the configs in a place any Solr
 instance can find them, namely Zookeeper.

 But in non-cloud, there's no central repository. You could be firing
 the query from node X and creating the core on node Y. So Solr expects
 the config files to already be in place; you have to manually copy
 them to node Y anyway, why not copy them to the place they'll be
 needed?

 The scripts make an assumption that you're running on the same node
 you're running the scripts for quick-start purposes.

 Best,
 Erick

 On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote:
  I can't get the Core Admin API to work. I have a brand new installation
 of
  Solr 5.0.0 (in non-cloud mode). I installed using the installation script
  (a nice addition!) with default options, so I have Solr in /opt/solr and
  its data in /var/solr.
 
  Here's what I'm trying:
 
  curl '
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_core
  '
 
  But I get this error:
 
  Error CREATEing SolrCore 'new_core': Unable to create core [new_core]
  Caused by: Can't find resource 'solrconfig.xml' in classpath or
  '/var/solr/data/new_core/conf'
 
  Solr isn't even creating /var/solr/data/new_core, which I guess is the
 root
  of the problem. But /var/solr is owned by the solr user and I can do
 `sudo
  -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr
 making
  this directory?
 
  I see that 'instanceDir' is required, but I don't get an error message
 if I
  *don't* use it, so I'm not sure how required it actually is. I'm also not
  sure if its supposed to be a full path or a relative path or what, so
 here
  are a couple of other guesses at the correct incantation:
 
  curl '
 
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core
  '
  curl '
 
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core
  '
 
  These both return the same error message as my first try, so no dice...
 
  FWIW, I get the same error message even if I try doing this with the Solr
  Admin GUI so I'm really puzzled. Is the GUI supposed to work?
 
  I found a thread on Stack Overflow about this same problem (
  http://stackoverflow.com/a/28945428/122763) that suggests using
 configSet.
  Okay, the installer put some configs sets in
  /opt/solr/server /opt/solr/server/solr/configsets, and the 'basic_config'
  config set has a solrconfig.xml in it, so maybe that would solve my
  solrconfig.xml error?
 
  If I compare the HTTP API to the `solr create -c foo` script, it appears
  that the script creates the instance directory and copies in conf files
 *before
  *it calls the HTTP API... surely the HTTP API doesn't require the caller
 to
  create a directory and copy files first, does it?
 
  --
  Mark E. Haase




-- 
Mark E. Haase
202-815-0201


Re: How to create a core by API?

2015-03-26 Thread Mark E. Haase
Erick, are you sure that configSets don't apply to single-node Solr
instances?

https://cwiki.apache.org/confluence/display/solr/Config+Sets

I don't see anything about Solr cloud there. Also, configSet is a
documented argument to the Core Admin API:

https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-CREATE

And one of the few things [I thought] I knew about cloud vs non cloud
setups was the Collections API is for cloud and Cores API is for non cloud,
right? So why would the non-cloud API take a cloud-only argument?

On Thu, Mar 26, 2015 at 1:16 PM, Mark E. Haase meha...@gmail.com wrote:

 Okay, thanks for the feedback. I'll admit that I do find the cloud vs
 non-cloud deployment options a constant source of confusion, not the least
 of which is due to the name. If I run a single Solr instance on EC2, that's
 not cloud, but if I run a few instances with ZK on my local LAN, that is
 cloud. Mmmkay.

 I can't imagine why the API documentation wouldn't mention that the API
 can't actually do the thing it's supposed to do (create a core). What's the
 purpose of having an HTTP API if I'm expected to already have write access
 to the host's file system to use it? Maybe its intended as private API? It
 should only be used by Solr itself, e.g. `solr create -c foo` uses the
 Cores Admin API to do some (but not all) of its work. But if that's the
 case, then the API docs should say that.

 From an API consumer's point of view, I'm not really interested in being
 forced to learn the history of the project to use the API. The whole point
 of creating APIs is to abstract out details that the caller doesn't need to
 know, and yet this API requires an understanding of Solr's internal file
 structure and history of the project?

 Yikes.


 On Thu, Mar 26, 2015 at 12:56 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Ok, you're being confused by cloud, non cloud and all that kinda stuff

 Configsets are SolrCloud only, so forget them since you specified it's
 not SolrCloud.

 bq: surely the HTTP API doesn't require the caller to create a
 directory and copy files first, does it

 In fact, yes. The thing to remember here is that you're using a much
 older approach that had its roots in the pre-cloud days. The problem
 is how do you insure that the configurations are on the node you're
 creating the core on? The whole configsets discussion is an attempt to
 solve that in SolrCloud by putting the configs in a place any Solr
 instance can find them, namely Zookeeper.

 But in non-cloud, there's no central repository. You could be firing
 the query from node X and creating the core on node Y. So Solr expects
 the config files to already be in place; you have to manually copy
 them to node Y anyway, why not copy them to the place they'll be
 needed?

 The scripts make an assumption that you're running on the same node
 you're running the scripts for quick-start purposes.

 Best,
 Erick

 On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote:
  I can't get the Core Admin API to work. I have a brand new installation
 of
  Solr 5.0.0 (in non-cloud mode). I installed using the installation
 script
  (a nice addition!) with default options, so I have Solr in /opt/solr and
  its data in /var/solr.
 
  Here's what I'm trying:
 
  curl '
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_core
  '
 
  But I get this error:
 
  Error CREATEing SolrCore 'new_core': Unable to create core
 [new_core]
  Caused by: Can't find resource 'solrconfig.xml' in classpath or
  '/var/solr/data/new_core/conf'
 
  Solr isn't even creating /var/solr/data/new_core, which I guess is the
 root
  of the problem. But /var/solr is owned by the solr user and I can do
 `sudo
  -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr
 making
  this directory?
 
  I see that 'instanceDir' is required, but I don't get an error message
 if I
  *don't* use it, so I'm not sure how required it actually is. I'm also
 not
  sure if its supposed to be a full path or a relative path or what, so
 here
  are a couple of other guesses at the correct incantation:
 
  curl '
 
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core
  '
  curl '
 
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core
  '
 
  These both return the same error message as my first try, so no dice...
 
  FWIW, I get the same error message even if I try doing this with the
 Solr
  Admin GUI so I'm really puzzled. Is the GUI supposed to work?
 
  I found a thread on Stack Overflow about this same problem (
  http://stackoverflow.com/a/28945428/122763) that suggests using
 configSet.
  Okay, the installer put some configs sets in
  /opt/solr/server /opt/solr/server/solr/configsets, and the
 'basic_config'
  config set has a solrconfig.xml in it, so maybe that would solve my
  solrconfig.xml error?
 
  If I compare the HTTP API to the 

Re: Solr Monitoring - Stored Stats?

2015-03-26 Thread Upayavira
Have a look at the admin UI, plugins/stats.

I’ve just spent the time to re-implement it in AngularJS, so I know the
functionality is there - twice :-)

You can “watch for changes” - it pulls in a reference XML, and posts
that back to the server, which only reports back changes.

Dunno if that gives you what you are after?

Upayavira

On Thu, Mar 26, 2015, at 03:15 PM, Matt Kuiper wrote:
 Erick, Shawn,
 
 Thanks for your responses.  I figured this was the case, just wanted to
 check to be sure.
 
 I have used Zabbix to configure JMX points to monitor over time, but it
 was a bit of work to get configured.  We are looking to create a simple
 dashboard of a few stats over time.  Looks like the easiest approach will
 be to make an app to make calls for these stats at a regular interval and
 then index results to Solr, and then we will able to query over desired
 time frames...
 
 Thanks,
 Matt
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Wednesday, March 25, 2015 10:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr Monitoring - Stored Stats?
 
 Matt:
 
 Not really. There's a bunch of third-party log analysis tools that give
 much of this information (not everything exposed by JMX of course is in
 the log files though).
 
 Not quite sure whether things like Nagios, Zabbix and the like have this
 kind of stuff built in seems like a natural extension of those kinds of
 tools though
 
 Not much help here...
 Erick
 
 On Wed, Mar 25, 2015 at 8:26 AM, Matt Kuiper matt.kui...@issinc.com
 wrote:
  Hello,
 
  I am familiar with the JMX points that Solr exposes to allow for monitoring 
  of statistics like QPS, numdocs, Average Query Time...
 
  I am wondering if there is a way to configure Solr to automatically store 
  the value of these stats over time (for a given time interval), and then 
  allow a user to query a stat over a time range.  So for the QPS stat,  the 
  query might return a set that includes the QPS value for each hour in the 
  time range specified.
 
  Thanks,
  Matt
 
 


Uneven index distribution using composite router

2015-03-26 Thread Shamik Bandopadhyay
Hi,

   I'm using a three level composite router in a solr cloud environment,
primarily for multi-tenant and field collapsing. The format is as follows.

*language!topic!url*.

An example would be :

ENU!12345!www.testurl.com/enu/doc1
GER!12345!www.testurl.com/ger/doc2
CHS!67890!www.testurl.com/chs/doc3

The Solr Cloud cluster contains 2 shard, each having 3 replicas. After
indexing around 10 million documents, I'm observing that the index size in
shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is
getting indexed in shard 1. Since 60% of the document is english, I expect
the index size to be higher on one shard, but the difference seem little
too high.

The idea is to make sure that all ENU!12345 documents are routed to one
shard so that distributed field collapsing works. Is there something I can
do differently here to make a better distribution ?

Any pointers will be appreciated.

Regards,
Shamik


How to create a core by API?

2015-03-26 Thread Mark E. Haase
I can't get the Core Admin API to work. I have a brand new installation of
Solr 5.0.0 (in non-cloud mode). I installed using the installation script
(a nice addition!) with default options, so I have Solr in /opt/solr and
its data in /var/solr.

Here's what I'm trying:

curl 'http://localhost:8983/solr/admin/cores?action=CREATEname=new_core
'

But I get this error:

Error CREATEing SolrCore 'new_core': Unable to create core [new_core]
Caused by: Can't find resource 'solrconfig.xml' in classpath or
'/var/solr/data/new_core/conf'

Solr isn't even creating /var/solr/data/new_core, which I guess is the root
of the problem. But /var/solr is owned by the solr user and I can do `sudo
-u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr making
this directory?

I see that 'instanceDir' is required, but I don't get an error message if I
*don't* use it, so I'm not sure how required it actually is. I'm also not
sure if its supposed to be a full path or a relative path or what, so here
are a couple of other guesses at the correct incantation:

curl '
http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core
'
curl '
http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core
'

These both return the same error message as my first try, so no dice...

FWIW, I get the same error message even if I try doing this with the Solr
Admin GUI so I'm really puzzled. Is the GUI supposed to work?

I found a thread on Stack Overflow about this same problem (
http://stackoverflow.com/a/28945428/122763) that suggests using configSet.
Okay, the installer put some configs sets in
/opt/solr/server /opt/solr/server/solr/configsets, and the 'basic_config'
config set has a solrconfig.xml in it, so maybe that would solve my
solrconfig.xml error?

If I compare the HTTP API to the `solr create -c foo` script, it appears
that the script creates the instance directory and copies in conf files *before
*it calls the HTTP API... surely the HTTP API doesn't require the caller to
create a directory and copy files first, does it?

-- 
Mark E. Haase


Re: Applying Tokenizers and Filters to CopyFields

2015-03-26 Thread Erick Erickson
Glad it worked out...

Looking back, I can't believe I didn't mention adding debug=query to
the URL. That would have shown you exactly what the parsed query
looked like and you'd have seen right off that it wasn't searching
against the field you thought it was. It's one of the first things I
do when queries don't return what I expect.

Glad it's working for you!
Erick

On Thu, Mar 26, 2015 at 8:24 AM, Michael Della Bitta
michael.della.bi...@appinions.com wrote:
 Glad you are sorted out!

 Michael Della Bitta

 Senior Software Engineer

 o: +1 646 532 3062

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 w: appinions.com http://www.appinions.com/

 On Thu, Mar 26, 2015 at 10:09 AM, Martin Wunderlich martin...@gmx.net
 wrote:

 Thanks so much, Erick and Michael, for all the additional explanation.
 The crucial information in the end turned out to be the one about the
 Default Search Field („df“). In solrconfig.xml this parameter was to point
 to the original text, which is why the expanded queries didn’t work. When I
 set the df parameter to one of the fields with the expanded text, the
 search works fine. I have also removed the copyField declarations.

 It’s all working as expected now. Thanks again for the help.

 Cheers,

 Martin




  Am 25.03.2015 um 23:43 schrieb Erick Erickson erickerick...@gmail.com:
 
  Martin:
  Perhaps this would help
 
  indexed=true, stored=true
  field can be searched. The raw input (not analyzed in any way) can be
  shown to the user in the results list.
 
  indexed=true, stored=false
  field can be searched. However, the field can't be returned in the
  results list with the document.
 
  indexed=false, stored=true
  The field cannot be searched, but the contents can be returned in the
  results list with the document. There are some use-cases where this is
  desirable behavior.
 
  indexed=false, stored=false
  The entire field is thrown out, it's just as if you didn't send the
  field to be indexed at all.
 
  And one other thing, the copyField gets the _raw_ data not the
  analyzed data. Let's say you have two fields, src and dst.
  copying from src to dest in schema.xml is identical to
  add
   doc
 field name=srcoriginal text/field
field name=dstoriginal text/field
  /doc
  /add
 
  that is, copyfield directives are not chained.
 
  Also, watch out for your query syntax. Michael's comments are spot-on,
  I'd just add this:
 
 
 http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true
 
  is kind of odd. Let's assume you mean qf rather than fq. That
  _only_ matters if your query parser is edismax, it'll be ignored in
  this case I believe.
 
  You'd want something like
  q=src:Sprache
  or
  q=dst:Sprache
  or even
  http://localhost:8983/solr/windex/select?q=Sprachedf=src
  http://localhost:8983/solr/windex/select?q=Sprachedf=dst
 
  where df is default field and the search is applied against that
  field in the absence of a field qualification like my first two
  examples.
 
  Best,
  Erick
 
  On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
  michael.della.bi...@appinions.com wrote:
  I agree the terminology is possibly a little confusing.
 
  Stored refers to values that are stored verbatim. You can retrieve them
  verbatim. Analysis does not affect stored values.
  Indexed values are tokenized/transformed and stored inverted. You can't
  recover the literal analyzed version (at least, not easily).
 
  If what you really want is to store and retrieve case folded versions of
  your data as well as the original, you need to use something like a
  UpdateRequestProcessor, which I personally am less familiar with.
 
 
  On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net
  wrote:
 
  So, the pre-processing steps are applied under analyzer type=„index“.
  And this point is not quite clear to me: Assuming that I have a simple
  case-folding step applied to the target of the copyField: How or where
 are
  the lower-case tokens stored, if the text isn’t added to the index?
 How is
  the query supposed to retrieve the lower-case version?
  (sorry, if this sounds like a naive question, but I have a feeling
 that I
  am missing something really basic here).
 
 
 
  Michael Della Bitta
 
  Senior Software Engineer
 
  o: +1 646 532 3062
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
  w: appinions.com http://www.appinions.com/




Re: How to create a core by API?

2015-03-26 Thread Erick Erickson
Hmmm, looks like I stand corrected. I haven't kept complete track
there, looks like this one didn't stick in my head.

As far as the docs are concerned, all patches welcome!

Best,
Erick

On Thu, Mar 26, 2015 at 10:26 AM, Mark E. Haase meha...@gmail.com wrote:
 Erick, are you sure that configSets don't apply to single-node Solr
 instances?

 https://cwiki.apache.org/confluence/display/solr/Config+Sets

 I don't see anything about Solr cloud there. Also, configSet is a
 documented argument to the Core Admin API:

 https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-CREATE

 And one of the few things [I thought] I knew about cloud vs non cloud
 setups was the Collections API is for cloud and Cores API is for non cloud,
 right? So why would the non-cloud API take a cloud-only argument?

 On Thu, Mar 26, 2015 at 1:16 PM, Mark E. Haase meha...@gmail.com wrote:

 Okay, thanks for the feedback. I'll admit that I do find the cloud vs
 non-cloud deployment options a constant source of confusion, not the least
 of which is due to the name. If I run a single Solr instance on EC2, that's
 not cloud, but if I run a few instances with ZK on my local LAN, that is
 cloud. Mmmkay.

 I can't imagine why the API documentation wouldn't mention that the API
 can't actually do the thing it's supposed to do (create a core). What's the
 purpose of having an HTTP API if I'm expected to already have write access
 to the host's file system to use it? Maybe its intended as private API? It
 should only be used by Solr itself, e.g. `solr create -c foo` uses the
 Cores Admin API to do some (but not all) of its work. But if that's the
 case, then the API docs should say that.

 From an API consumer's point of view, I'm not really interested in being
 forced to learn the history of the project to use the API. The whole point
 of creating APIs is to abstract out details that the caller doesn't need to
 know, and yet this API requires an understanding of Solr's internal file
 structure and history of the project?

 Yikes.


 On Thu, Mar 26, 2015 at 12:56 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Ok, you're being confused by cloud, non cloud and all that kinda stuff

 Configsets are SolrCloud only, so forget them since you specified it's
 not SolrCloud.

 bq: surely the HTTP API doesn't require the caller to create a
 directory and copy files first, does it

 In fact, yes. The thing to remember here is that you're using a much
 older approach that had its roots in the pre-cloud days. The problem
 is how do you insure that the configurations are on the node you're
 creating the core on? The whole configsets discussion is an attempt to
 solve that in SolrCloud by putting the configs in a place any Solr
 instance can find them, namely Zookeeper.

 But in non-cloud, there's no central repository. You could be firing
 the query from node X and creating the core on node Y. So Solr expects
 the config files to already be in place; you have to manually copy
 them to node Y anyway, why not copy them to the place they'll be
 needed?

 The scripts make an assumption that you're running on the same node
 you're running the scripts for quick-start purposes.

 Best,
 Erick

 On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote:
  I can't get the Core Admin API to work. I have a brand new installation
 of
  Solr 5.0.0 (in non-cloud mode). I installed using the installation
 script
  (a nice addition!) with default options, so I have Solr in /opt/solr and
  its data in /var/solr.
 
  Here's what I'm trying:
 
  curl '
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_core
  '
 
  But I get this error:
 
  Error CREATEing SolrCore 'new_core': Unable to create core
 [new_core]
  Caused by: Can't find resource 'solrconfig.xml' in classpath or
  '/var/solr/data/new_core/conf'
 
  Solr isn't even creating /var/solr/data/new_core, which I guess is the
 root
  of the problem. But /var/solr is owned by the solr user and I can do
 `sudo
  -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr
 making
  this directory?
 
  I see that 'instanceDir' is required, but I don't get an error message
 if I
  *don't* use it, so I'm not sure how required it actually is. I'm also
 not
  sure if its supposed to be a full path or a relative path or what, so
 here
  are a couple of other guesses at the correct incantation:
 
  curl '
 
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core
  '
  curl '
 
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core
  '
 
  These both return the same error message as my first try, so no dice...
 
  FWIW, I get the same error message even if I try doing this with the
 Solr
  Admin GUI so I'm really puzzled. Is the GUI supposed to work?
 
  I found a thread on Stack Overflow about this same problem (
  http://stackoverflow.com/a/28945428/122763) that suggests using
 

Re: How to create a core by API?

2015-03-26 Thread Erick Erickson
Ok, you're being confused by cloud, non cloud and all that kinda stuff

Configsets are SolrCloud only, so forget them since you specified it's
not SolrCloud.

bq: surely the HTTP API doesn't require the caller to create a
directory and copy files first, does it

In fact, yes. The thing to remember here is that you're using a much
older approach that had its roots in the pre-cloud days. The problem
is how do you insure that the configurations are on the node you're
creating the core on? The whole configsets discussion is an attempt to
solve that in SolrCloud by putting the configs in a place any Solr
instance can find them, namely Zookeeper.

But in non-cloud, there's no central repository. You could be firing
the query from node X and creating the core on node Y. So Solr expects
the config files to already be in place; you have to manually copy
them to node Y anyway, why not copy them to the place they'll be
needed?

The scripts make an assumption that you're running on the same node
you're running the scripts for quick-start purposes.

Best,
Erick

On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote:
 I can't get the Core Admin API to work. I have a brand new installation of
 Solr 5.0.0 (in non-cloud mode). I installed using the installation script
 (a nice addition!) with default options, so I have Solr in /opt/solr and
 its data in /var/solr.

 Here's what I'm trying:

 curl 'http://localhost:8983/solr/admin/cores?action=CREATEname=new_core
 '

 But I get this error:

 Error CREATEing SolrCore 'new_core': Unable to create core [new_core]
 Caused by: Can't find resource 'solrconfig.xml' in classpath or
 '/var/solr/data/new_core/conf'

 Solr isn't even creating /var/solr/data/new_core, which I guess is the root
 of the problem. But /var/solr is owned by the solr user and I can do `sudo
 -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr making
 this directory?

 I see that 'instanceDir' is required, but I don't get an error message if I
 *don't* use it, so I'm not sure how required it actually is. I'm also not
 sure if its supposed to be a full path or a relative path or what, so here
 are a couple of other guesses at the correct incantation:

 curl '
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core
 '
 curl '
 http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core
 '

 These both return the same error message as my first try, so no dice...

 FWIW, I get the same error message even if I try doing this with the Solr
 Admin GUI so I'm really puzzled. Is the GUI supposed to work?

 I found a thread on Stack Overflow about this same problem (
 http://stackoverflow.com/a/28945428/122763) that suggests using configSet.
 Okay, the installer put some configs sets in
 /opt/solr/server /opt/solr/server/solr/configsets, and the 'basic_config'
 config set has a solrconfig.xml in it, so maybe that would solve my
 solrconfig.xml error?

 If I compare the HTTP API to the `solr create -c foo` script, it appears
 that the script creates the instance directory and copies in conf files 
 *before
 *it calls the HTTP API... surely the HTTP API doesn't require the caller to
 create a directory and copy files first, does it?

 --
 Mark E. Haase


Re: Uneven index distribution using composite router

2015-03-26 Thread Erick Erickson
right, when you take over routing, making sure the distribution is
even is now your responsibility.

Your assumption is that the amount of _text_ in each doc is roughly
the same between your three languages, have you verified this? And are
you doing anything like copyFields that are kicking in on one shard
but not the others (e.g. if you have text_en fields you might be
copying that to text_en_all but not doing so with text_ger to
text_ger_all). that's totally a shot in the dark though.

Best,
Erick

On Thu, Mar 26, 2015 at 10:26 AM, Shamik Bandopadhyay sham...@gmail.com wrote:
 Hi,

I'm using a three level composite router in a solr cloud environment,
 primarily for multi-tenant and field collapsing. The format is as follows.

 *language!topic!url*.

 An example would be :

 ENU!12345!www.testurl.com/enu/doc1
 GER!12345!www.testurl.com/ger/doc2
 CHS!67890!www.testurl.com/chs/doc3

 The Solr Cloud cluster contains 2 shard, each having 3 replicas. After
 indexing around 10 million documents, I'm observing that the index size in
 shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is
 getting indexed in shard 1. Since 60% of the document is english, I expect
 the index size to be higher on one shard, but the difference seem little
 too high.

 The idea is to make sure that all ENU!12345 documents are routed to one
 shard so that distributed field collapsing works. Is there something I can
 do differently here to make a better distribution ?

 Any pointers will be appreciated.

 Regards,
 Shamik


Re: How to create a core by API?

2015-03-26 Thread Erick Erickson
Got to the comments section and add any corrections you'd like,
that'll get bubbled up.

Best,
Erick

On Thu, Mar 26, 2015 at 10:45 AM, Mark E. Haase meha...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 1:31 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Hmmm, looks like I stand corrected. I haven't kept complete track
 there, looks like this one didn't stick in my head.


 I'm not saying you're wrong. The configSet parameter doesn't work at all in
 my set up, so you might be right... I'm just wondering where that's
 documented.

 I thought Solr documentation was rough back in the 1.6 days, but wow...
 it's gotten shockingly bad in Solr 5.


 As far as the docs are concerned, all patches welcome!


 What kind of patch do you mean? Isn't all the documentation maintained on
 confluence?

 --
 Mark E. Haase
 202-815-0201


Re: How to create a core by API?

2015-03-26 Thread Shawn Heisey
On 3/26/2015 10:24 AM, Mark E. Haase wrote:
 I can't get the Core Admin API to work. I have a brand new installation of
 Solr 5.0.0 (in non-cloud mode). I installed using the installation script
 (a nice addition!) with default options, so I have Solr in /opt/solr and
 its data in /var/solr.

 Here's what I'm trying:

 curl 'http://localhost:8983/solr/admin/cores?action=CREATEname=new_core
 '

 But I get this error:

 Error CREATEing SolrCore 'new_core': Unable to create core [new_core]
 Caused by: Can't find resource 'solrconfig.xml' in classpath or

The error message tells you what is wrong.

The CoreAdmin API requires that the instanceDir already exist, with a
conf directory inside it that contains solrconfig.xml, schema.xml, and
any other necessary config files.

If you want completely from-scratch creation without any existing
filesystem layout, you will need to run SolrCloud, which keeps config
files in the zookeeper database.  At that point you would be using the
Collections API.

If you go to Core Admin in the admin UI and click the Add Core button,
you will see the following note:

|instanceDir| and |dataDir| need to exist before you can create the core

This message is not quite accurate -- the dataDir (defaulting to
${instanceDir}/data) will be created if it does not already exist, and
the user running Solr has the required permissions to create it.  The
message also doesn't say anything about the conf directory or the two
required XML files.

Thanks,
Shawn



Performance json vs javabin

2015-03-26 Thread Tech MOnkey
Has anyone done performance tests between json and javabin?  Scale tipped 
towards javabin when compared to 
XML(https://issues.apache.org/jira/browse/SOLR-486).  I am curious to know if 
it is same with json when load is 600 per minute, for example.
Thanks,
  

Re: Replacing a group of documents (Delete/Insert) without a query on the index ever showing an empty list (Docs)

2015-03-26 Thread Shawn Heisey
On 3/26/2015 9:53 AM, Russell Taylor wrote:
 I have an index which is made up of groups of documents, each group is 
 defined by a field called keyField (keyField:A).
 I need to delete all the keyField:A documents and replace them with a brand 
 new set without the index ever returning
 zero documents on a query.

 At the moment I deleteByQuery:keyField:A and then insert a SolrInputDocument 
 list via
 SolrJ into my index. I have a small time period where somebody doing a 
 q=fieldKey:A
 can be returned an empty list.

 FYI: The keyField group might be just 100 documents or up to 10 million.

As long as you don't have any commits with openSearcher=true happening
between the delete and the insert, that would work ... but why go
through the manual delete if you don't have to?

If you define a suitable uniqueKey field in your schema, simply indexing
a new document with the same value in the uniqueKeyfield as an existing
document will delete the old document.

https://wiki.apache.org/solr/UniqueKey

Thanks,
Shawn



delta import on changes in entity within a document

2015-03-26 Thread PeterKerk
I have the following data-config:

document name=locations
entity pk=id name=location query=select * from locations WHERE
isapproved='true'
deltaImportQuery=select * from locations WHERE updatedate lt; 
getdate()
AND isapproved='true' AND id='${dataimporter.delta.id}'
deltaQuery=select id from locations where isapproved='true' AND
updatedate gt; '${dataimporter.last_index_time}'



entity name=offerdetails query=SELECT title as
offer_title,ISNULL(img,'') as offer_thumb,id as offer_id
,startdate as offer_startdate
,enddate as offer_enddate
,description as offer_description
,updatedate as offer_updatedate
FROM offers WHERE objectid=${location.id}
/entity   
/document


Now, when the object in the [locations] table is updated, my delta import
(/dataimport?command=delta-import) query works perfectly.
But when an offer is updated in the [offers] table, this is not seen by the
deltaimport command. Is there way to delta-import only the updated offers
for the respective location if an offer is updated? And then without:
a. having to fully import ALL locations 
or 
b. having to update this single location and then do a regular deltaimport?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/delta-import-on-changes-in-entity-within-a-document-tp4195615.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Uneven index distribution using composite router

2015-03-26 Thread shamik
Thanks for your reply Eric.

In my case, I've 14 languages, out of which 50% of the documents belong to
English. German and CHS will probably constitute another 25%. I'm not using
copyfield, rather, each language has it's dedicated field such as title_enu,
text_enu, title_ger,text_ger, etc. Since I know the language prior to index
time, this works for, me. 

I've added one more sample key in the example. 

ENU!12345!www.testurl.com/enu/doc1 
ENU!12345!www.testurl.com/enu/doc10 
GER!12345!www.testurl.com/ger/doc2 
CHS!67890!www.testurl.com/chs/doc3 

As you can see, there are 2 documents in english having same topic id
(12345). I added topicid as part of the key to make sure that they are
residing in the same shard in order to make field collapsing work on topic
id. I can perhaps remove the composite key and only have language and url,
something like, 

ENU!www.testurl.com/enu/doc1

But that'll probably not solve the distribution issue. You mentioned when
you take over routing, making sure the distribution is even is now your
responsibility. I'm wondering, what's the best practice to make it happen ?
I can get away from composite router and manually assign a bunch of language
to a dedicated shard, both during index and query time. But I'm not sure
keeping a map is an efficient way of dealing with it. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Uneven-index-distribution-using-composite-router-tp4195569p4195591.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to create a core by API?

2015-03-26 Thread Yonik Seeley
On Thu, Mar 26, 2015 at 1:45 PM, Mark E. Haase meha...@gmail.com wrote:
 I'm not saying you're wrong. The configSet parameter doesn't work at all in
 my set up, so you might be right... I'm just wondering where that's
 documented.

Trying on current trunk, I got it to work:

/opt/code/lusolr_trunk/solr$ curl -XPOST
http://localhost:8983/solr/admin/cores?action=CREATEname=demo3instanceDir=demo3configSet=basic_configs;

?xml version=1.0 encoding=UTF-8?
response
  lst name=responseHeaderint name=status0/intint
name=QTime769/int/lststr name=coredemo3/str
/response

Although I'm not thrilled with a different parameter name  for cloud
vs non-cloud.  I come from the camp that believes that overloading is
both natural and easily understood (e.g.  I don't find foo + bar
and 1.5 + 2.5 both using + confusing).

-Yonik


Re: Build index from Oracle, adding fields

2015-03-26 Thread Julian Perry

On 27/03/2015 12:42, Shawn Heisey wrote:

If that's not practical, then the only real option you have is to drop
back to one entity, and build a single SELECT statement (using JOIN and
some form of CONCAT) that will gather all the information from all the
tables at the same time, and combine multiple values together into one
SQL result field with some kind of delimiter.  Then you can use the
RegexTransformer's splitBy functionality to turn the concatenated data
back into multiple values for your multi-valued field.  Database servers
tend to be REALLY good at JOIN operations, so the database would be
doing the heavy lifting.


I did try that in fact (and do it with one of my other indexes).

However, with this index the sub-select can return 200 rows of
200 characters - and that blows up in Oracle as the field is
over 4000 characters long (and the work-around for that is to
use clob's - but that has its own performance problems).

Currently I am doing this by exporting a CSV file and
processing it with a C program - and then reading the CSV with
SOLR :(

--
Cheers
Jules.



Re: i'm a newb: questions about schema.xml

2015-03-26 Thread Erick Erickson
This is key: managed-schema

You've managed to get things started with the managed schema.
Therefore, you need to use the REST API to
add/subtract/multiply/divide. This is different than schemaless,
although it _is_ related. And they're both different than having a
schema.xml to edit.

Or start over _without_ a managed schema, not quite sure how you
started that way in the first place ;). You may have used bin/solr
start -e schemaless when you started and maybe forgot?

Here's a place to start:
https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig

Best,
Erick

On Thu, Mar 26, 2015 at 4:41 PM, Mark Bramer mbra...@esri.com wrote:
 Hi Shawn,

 Definitely helpful to know about the instance and files stuff in Admin.  I'm 
 not running cloud, so I looked in the /conf directory but there's no 
 schema.xml:

 Here's what's in my core's Files:
   currency.xml
   elevate.xml
   lang
   params.json
   protwords.txt
   solrconfig.xml
   stopwords.txt
   synonyms.txt

 and echoed by ls -l:

 -rw-r--r-- 1 root root  3974 Feb 15 11:38 currency.xml
 -rw-r--r-- 1 root root  1348 Feb 15 11:38 elevate.xml
 drwxr-xr-x 2 root root  4096 Mar 23 10:46 lang
 -rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema
 -rw-r--r-- 1 root root   308 Feb 15 11:38 params.json
 -rw-r--r-- 1 root root   873 Feb 15 11:38 protwords.txt
 -rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml
 -rw-r--r-- 1 root root   781 Feb 15 11:38 stopwords.txt
 -rw-r--r-- 1 root root  1119 Feb 15 11:38 synonyms.txt

 -Original Message-
 From: Shawn Heisey [mailto:apa...@elyograg.org]
 Sent: Thursday, March 26, 2015 7:28 PM
 To: solr-user@lucene.apache.org
 Subject: Re: i'm a newb: questions about schema.xml

 On 3/26/2015 4:57 PM, Mark Bramer wrote:
 I'm a Solr newb.  I've been poking around for several days on my own test 
 instance, and also online at the info available.  But one thing just isn't 
 jiving and I can't put my finger on why.  I've searched many many times but 
 I don't see what I'm looking for, so I'm thinking perhaps I have a 
 fundamental semantic misunderstanding of something somewhere.  Everywhere I 
 read, everyone talks about schema.xml and how important is.  I fully get 
 what it's for but I don't get where it is, how it's used (by me), how I edit 
 it, and how I create new indexes once I've edited it.

 I've installed, and am successfully running, solr 5.0.0 on Linux.  I've 
 followed the widely recommended-by-all quick start at: 
 http://lucene.apache.org/solr/quickstart.html.  I get through it fine, I 
 post a bunch of stuff, I use the web UI to query for, and see, data I would 
 expect to see.  Should I now have a schema.xml file somewhere that is 
 somehow connected to my new index?  If so, where is it?  Was it present from 
 install or did it get created when I made my first core (bin/solr create -c 
 ati_docs)?

 [root@machine solr-5.0.0]# find -name schema.xml
 ./example/example-DIH/solr/tika/conf/schema.xml
 ./example/example-DIH/solr/rss/conf/schema.xml
 ./example/example-DIH/solr/solr/conf/schema.xml
 ./example/example-DIH/solr/db/conf/schema.xml
 ./example/example-DIH/solr/mail/conf/schema.xml
 ./server/solr/configsets/basic_configs/conf/schema.xml
 ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml
 [root@machine solr-5.0.0]#

 Is it the one in /configsets/basic_configs/conf?  Is that the default one?

 If I want to 'modify' schema.xml to do some different indexing/analyzing, 
 how do I start?  Make a copy of that schema.xml, move it somewhere else and 
 modify it?  If so, how do I create a new index using this schema.xml?

 Or am I running in schemaless mode?  I don't think I am because it
 appears that I would have to specifically state this as a command line
 parameter, i.e. bin/solr start -e schemaless

 What fundamentals am I missing?  I'm coming to Solr from Elasticsearch, and 
 I've already recognized some differences.  Is my ES background clouding my 
 grasp of Solr fundamentals?

 Hopefully you know what core you are using, so you can go to the admin UI and 
 find it in the Core Selector dropdown list.  Assuming you can do that, you 
 will find yourself looking at the Overview tab for that core.

 https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface

 Once you are looking at the core overview, in the upper right corner of your 
 browser window is a section called Instance ... which has an entry that is 
 ALSO called Instance.  Inside the directory indicated by that field, you 
 should have a conf directory.  The config and schema for that index are found 
 in that conf directory.

 If you're running SolrCloud, then you can forget everything I just said ... 
 the active configs will be found within the zookeeper database, and you can 
 use the Cloud-Tree tab in the admin UI to find your collections and see 
 which configName is linked to each one.  You'll want to become familiar with 
 the zkcli script 

RE: i'm a newb: questions about schema.xml

2015-03-26 Thread Mark Bramer
Pretty sure I found what I am looking for: 
https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig

I noticed the managed-schema file and a couple Google searches with that 
finally landed me at that link.  

Interesting that the file is hidden from the Files list in the Admin UI.

Thanks!


-Original Message-
From: Mark Bramer 
Sent: Thursday, March 26, 2015 7:42 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: i'm a newb: questions about schema.xml

Hi Shawn,

Definitely helpful to know about the instance and files stuff in Admin.  I'm 
not running cloud, so I looked in the /conf directory but there's no schema.xml:

Here's what's in my core's Files: 
  currency.xml
  elevate.xml
  lang
  params.json
  protwords.txt
  solrconfig.xml
  stopwords.txt
  synonyms.txt

and echoed by ls -l: 

-rw-r--r-- 1 root root  3974 Feb 15 11:38 currency.xml
-rw-r--r-- 1 root root  1348 Feb 15 11:38 elevate.xml drwxr-xr-x 2 root root  
4096 Mar 23 10:46 lang
-rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema
-rw-r--r-- 1 root root   308 Feb 15 11:38 params.json
-rw-r--r-- 1 root root   873 Feb 15 11:38 protwords.txt
-rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml
-rw-r--r-- 1 root root   781 Feb 15 11:38 stopwords.txt
-rw-r--r-- 1 root root  1119 Feb 15 11:38 synonyms.txt

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Thursday, March 26, 2015 7:28 PM
To: solr-user@lucene.apache.org
Subject: Re: i'm a newb: questions about schema.xml

On 3/26/2015 4:57 PM, Mark Bramer wrote:
 I'm a Solr newb.  I've been poking around for several days on my own test 
 instance, and also online at the info available.  But one thing just isn't 
 jiving and I can't put my finger on why.  I've searched many many times but I 
 don't see what I'm looking for, so I'm thinking perhaps I have a fundamental 
 semantic misunderstanding of something somewhere.  Everywhere I read, 
 everyone talks about schema.xml and how important is.  I fully get what it's 
 for but I don't get where it is, how it's used (by me), how I edit it, and 
 how I create new indexes once I've edited it.

 I've installed, and am successfully running, solr 5.0.0 on Linux.  I've 
 followed the widely recommended-by-all quick start at: 
 http://lucene.apache.org/solr/quickstart.html.  I get through it fine, I post 
 a bunch of stuff, I use the web UI to query for, and see, data I would expect 
 to see.  Should I now have a schema.xml file somewhere that is somehow 
 connected to my new index?  If so, where is it?  Was it present from install 
 or did it get created when I made my first core (bin/solr create -c ati_docs)?

 [root@machine solr-5.0.0]# find -name schema.xml 
 ./example/example-DIH/solr/tika/conf/schema.xml
 ./example/example-DIH/solr/rss/conf/schema.xml
 ./example/example-DIH/solr/solr/conf/schema.xml
 ./example/example-DIH/solr/db/conf/schema.xml
 ./example/example-DIH/solr/mail/conf/schema.xml
 ./server/solr/configsets/basic_configs/conf/schema.xml
 ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml
 [root@machine solr-5.0.0]#

 Is it the one in /configsets/basic_configs/conf?  Is that the default one?

 If I want to 'modify' schema.xml to do some different indexing/analyzing, how 
 do I start?  Make a copy of that schema.xml, move it somewhere else and 
 modify it?  If so, how do I create a new index using this schema.xml?

 Or am I running in schemaless mode?  I don't think I am because it 
 appears that I would have to specifically state this as a command line 
 parameter, i.e. bin/solr start -e schemaless

 What fundamentals am I missing?  I'm coming to Solr from Elasticsearch, and 
 I've already recognized some differences.  Is my ES background clouding my 
 grasp of Solr fundamentals?

Hopefully you know what core you are using, so you can go to the admin UI and 
find it in the Core Selector dropdown list.  Assuming you can do that, you 
will find yourself looking at the Overview tab for that core.

https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface

Once you are looking at the core overview, in the upper right corner of your 
browser window is a section called Instance ... which has an entry that is 
ALSO called Instance.  Inside the directory indicated by that field, you 
should have a conf directory.  The config and schema for that index are found 
in that conf directory.

If you're running SolrCloud, then you can forget everything I just said ... the 
active configs will be found within the zookeeper database, and you can use the 
Cloud-Tree tab in the admin UI to find your collections and see which 
configName is linked to each one.  You'll want to become familiar with the 
zkcli script in server/scripts/cloud-scripts.

https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities

Whether it is SolrCloud or not, you can always LOOK at your configs right in 
the admin UI -- click on the 

Re: Custom TokenFilter

2015-03-26 Thread Test Test
Hi Erick, 
For me, this classCastException is caused by the wrong use of TokenFilter.In 
fieldType declaration (schema.xml), i've put :tokenizer 
class=com.tamingtext.texttamer.solr.SentenceTokenizerFactory/And instead 
using TokenizerFactory in my class, i utilize TokenFilterFactory like this 
:public class SentenceTokenizerFactory  extends TokenFilterFactory 
So when solr try to load my class, it expects to load TokenizerFactory class 
but it has TokenFilterFactory class. 
Regards,Andry


 Le Jeudi 26 mars 2015 4h13, Erick Erickson erickerick...@gmail.com a 
écrit :
   

 Thanks for letting us know the resolution, the problem was bugging me

Erick

On Wed, Mar 25, 2015 at 4:21 PM, Test Test andymish...@yahoo.fr wrote:
 Re,
 Finally, i think i found where this problem comes.I didn't use the right 
 class extender, instead using Tokenizers, i'm using Token filter.
 Eric, thanks for your replies.Regards.


      Le Mercredi 25 mars 2015 23h55, Test Test andymish...@yahoo.fr a écrit 
:


  Re,
 I have tried to remove all the redundant jar files.Then i've relaunched it 
 but it's blocked directly on the same issue.
 It's very strange.
 Regards,


    Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a 
écrit :


  Wait, you didn't put, say, lucene-core-4.10.2.jar into your
 contrib/tamingtext/dependency directory did you? That means you have
 Lucene (and solr and solrj and ...) in your class path twice since
 they're _already_ in your classpath by default since you're running
 Solr.

 All your jars should be in your aggregate classpath exactly once.
 Having them in twice would explain the cast exception. not need these
 in the tamingtext/dependency subdirectory, just the things that are
 _not_ in Solr already..

 Best,
 Erick

 On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote:
 Re,
 Sorry about the image.So, there are all my dependencies jar in listing below 
 :
    - commons-cli-2.0-mahout.jar

    - commons-compress-1.9.jar

    - commons-io-2.4.jar

    - commons-logging-1.2.jar

    - httpclient-4.4.jar

    - httpcore-4.4.jar

    - httpmime-4.4.jar

    - junit-4.10.jar

    - log4j-1.2.17.jar

    - lucene-analyzers-common-4.10.2.jar

    - lucene-benchmark-4.10.2.jar

    - lucene-core-4.10.2.jar

    - mahout-core-0.9.jar

    - noggit-0.5.jar

    - opennlp-maxent-3.0.3.jar

    - opennlp-tools-1.5.3.jar

    - slf4j-api-1.7.9.jar

    - slf4j-simple-1.7.10.jar

    - solr-solrj-4.10.2.jar


 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml

    - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /

    - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar /


 Thanks for advance
 Regards.



      Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a 
écrit :


  Re,
 Sorry about the image.So, there are all my dependencies jar in listing below 
 :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- 
 commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- 
 httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- 
 lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- 
 lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- 
 opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- 
 slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar
 I have put them into a specific repository 
 (contrib/tamingtext/dependency).And my jar containing my class into another 
 repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml
 lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib 
 dir=../../../contrib/tamingtext/dependency regex=.*\.jar /
 Thanks for advance,Regards.



    Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com 
a écrit :


  Images don't come through the mailing list, can't see your image.

 Whether or not all the jars in the directory you're working on are
 consistent is the least of your problems. Are the libs to be found in any
 _other_ place specified on your classpath?

 Best,
 Erick

 On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote:

 Thanks Eric,

 I'm working on Solr 4.10.2 and all my dependencies jar seems to be
 compatible with this version.

 [image: Image en ligne]

 I can't figure out which one make this issue.

 Thanks
 Regards,




  Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a
 écrit :


 bq: 13 moreCaused by: java.lang.ClassCastException: class
 com.tamingtext.texttamer.solr.

 This usually means you have jar files from different versions of Solr
 in your classpath.

 Best,
 Erick

 On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote:
  Hi there,
  I'm trying to create my own TokenizerFactory (from tamingtext's
 book).After setting schema.xml and have adding path in solrconfig.xml, i
 start solr.I have 

SolrCloud -- Blocking access to administration commands while keeping the solr internal communication

2015-03-26 Thread Oded Sofer
Hello there, 

There are many blogs discussing this issue but it is hard to find if someone 
had managed to resolve that. 
We have many nodes in the SolrCloud, implementing the iptable restriction will 
fill the iptable with many rules that will affect performance. 
We are using 4.3.10, on Tomcat 5. 




Re: i'm a newb: questions about schema.xml

2015-03-26 Thread Shawn Heisey
On 3/26/2015 4:57 PM, Mark Bramer wrote:
 I'm a Solr newb.  I've been poking around for several days on my own test 
 instance, and also online at the info available.  But one thing just isn't 
 jiving and I can't put my finger on why.  I've searched many many times but I 
 don't see what I'm looking for, so I'm thinking perhaps I have a fundamental 
 semantic misunderstanding of something somewhere.  Everywhere I read, 
 everyone talks about schema.xml and how important is.  I fully get what it's 
 for but I don't get where it is, how it's used (by me), how I edit it, and 
 how I create new indexes once I've edited it.

 I've installed, and am successfully running, solr 5.0.0 on Linux.  I've 
 followed the widely recommended-by-all quick start at: 
 http://lucene.apache.org/solr/quickstart.html.  I get through it fine, I post 
 a bunch of stuff, I use the web UI to query for, and see, data I would expect 
 to see.  Should I now have a schema.xml file somewhere that is somehow 
 connected to my new index?  If so, where is it?  Was it present from install 
 or did it get created when I made my first core (bin/solr create -c ati_docs)?

 [root@machine solr-5.0.0]# find -name schema.xml
 ./example/example-DIH/solr/tika/conf/schema.xml
 ./example/example-DIH/solr/rss/conf/schema.xml
 ./example/example-DIH/solr/solr/conf/schema.xml
 ./example/example-DIH/solr/db/conf/schema.xml
 ./example/example-DIH/solr/mail/conf/schema.xml
 ./server/solr/configsets/basic_configs/conf/schema.xml
 ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml
 [root@machine solr-5.0.0]#

 Is it the one in /configsets/basic_configs/conf?  Is that the default one?

 If I want to 'modify' schema.xml to do some different indexing/analyzing, how 
 do I start?  Make a copy of that schema.xml, move it somewhere else and 
 modify it?  If so, how do I create a new index using this schema.xml?

 Or am I running in schemaless mode?  I don't think I am because it appears 
 that I would have to specifically state this as a command line parameter, 
 i.e. bin/solr start -e schemaless

 What fundamentals am I missing?  I'm coming to Solr from Elasticsearch, and 
 I've already recognized some differences.  Is my ES background clouding my 
 grasp of Solr fundamentals?

Hopefully you know what core you are using, so you can go to the admin
UI and find it in the Core Selector dropdown list.  Assuming you can
do that, you will find yourself looking at the Overview tab for that core.

https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface

Once you are looking at the core overview, in the upper right corner of
your browser window is a section called Instance ... which has an
entry that is ALSO called Instance.  Inside the directory indicated by
that field, you should have a conf directory.  The config and schema for
that index are found in that conf directory.

If you're running SolrCloud, then you can forget everything I just said
... the active configs will be found within the zookeeper database, and
you can use the Cloud-Tree tab in the admin UI to find your collections
and see which configName is linked to each one.  You'll want to become
familiar with the zkcli script in server/scripts/cloud-scripts.

https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities

Whether it is SolrCloud or not, you can always LOOK at your configs
right in the admin UI -- click on the Files tab after you select the
core from the selector.

Thanks,
Shawn



Re: Build index from Oracle, adding fields

2015-03-26 Thread Shawn Heisey
On 3/26/2015 5:19 PM, Julian Perry wrote:
 I have an index with, say, 10 fields.

 I load that index directly from Oracle - data-config.xml using
 JDBC.  I can load 10 million rows very quickly.  This direct
 way of loading from Oracle straight into SOLR is fantastic -
 really efficient and saves writing loads of import/export code
 (e.g. via a CSV file).

 Of those 10 fields - two of them (set to multiValued) come from
 a separate table and there are anything from 1 to 10 rows per
 row from the main table.

 I can use a nested entity to extract the child rows for each of
 the 10m rows in the main table - but then SOLR generates 10m
 separate SQL calls - and the load time goes from a few minutes
 to several days.

 On smaller tables - just a few thousand rows - I can use a
 second nested entity with a JDBC call - but not for very large
 tables.

 Could I load the data in two steps:
 1)  load the main 10m rows
 2)  load into the existing index by adding the data from a
 second SQL call into fields for each existing row (i.e.
 an UPDATE instead of an INSERT).

 I don't know what syntax/option might achieve that.  There
 is incremental loading - but I think that replaces whole rows
 rather then updating individual fields.  Or maybe it does
 do both?

If those child tables do not have a large number of entries, you can
configure caching on the inner entities so that the information doesn't
need to actually be requested from the database server.  If there are a
large number of entries, then that may not be possible due to memory
constraints.

https://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor

If that's not practical, then the only real option you have is to drop
back to one entity, and build a single SELECT statement (using JOIN and
some form of CONCAT) that will gather all the information from all the
tables at the same time, and combine multiple values together into one
SQL result field with some kind of delimiter.  Then you can use the
RegexTransformer's splitBy functionality to turn the concatenated data
back into multiple values for your multi-valued field.  Database servers
tend to be REALLY good at JOIN operations, so the database would be
doing the heavy lifting.

https://wiki.apache.org/solr/DataImportHandler#RegexTransformer

Solr does have an equivalent concept to SQL's UPDATE, but there are
enough caveats to using it that it may not be a good option:

https://wiki.apache.org/solr/Atomic_Updates

Thanks,
Shawn



RE: i'm a newb: questions about schema.xml

2015-03-26 Thread Mark Bramer
Hi Shawn,

Definitely helpful to know about the instance and files stuff in Admin.  I'm 
not running cloud, so I looked in the /conf directory but there's no schema.xml:

Here's what's in my core's Files: 
  currency.xml
  elevate.xml
  lang
  params.json
  protwords.txt
  solrconfig.xml
  stopwords.txt
  synonyms.txt

and echoed by ls -l: 

-rw-r--r-- 1 root root  3974 Feb 15 11:38 currency.xml
-rw-r--r-- 1 root root  1348 Feb 15 11:38 elevate.xml
drwxr-xr-x 2 root root  4096 Mar 23 10:46 lang
-rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema
-rw-r--r-- 1 root root   308 Feb 15 11:38 params.json
-rw-r--r-- 1 root root   873 Feb 15 11:38 protwords.txt
-rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml
-rw-r--r-- 1 root root   781 Feb 15 11:38 stopwords.txt
-rw-r--r-- 1 root root  1119 Feb 15 11:38 synonyms.txt

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Thursday, March 26, 2015 7:28 PM
To: solr-user@lucene.apache.org
Subject: Re: i'm a newb: questions about schema.xml

On 3/26/2015 4:57 PM, Mark Bramer wrote:
 I'm a Solr newb.  I've been poking around for several days on my own test 
 instance, and also online at the info available.  But one thing just isn't 
 jiving and I can't put my finger on why.  I've searched many many times but I 
 don't see what I'm looking for, so I'm thinking perhaps I have a fundamental 
 semantic misunderstanding of something somewhere.  Everywhere I read, 
 everyone talks about schema.xml and how important is.  I fully get what it's 
 for but I don't get where it is, how it's used (by me), how I edit it, and 
 how I create new indexes once I've edited it.

 I've installed, and am successfully running, solr 5.0.0 on Linux.  I've 
 followed the widely recommended-by-all quick start at: 
 http://lucene.apache.org/solr/quickstart.html.  I get through it fine, I post 
 a bunch of stuff, I use the web UI to query for, and see, data I would expect 
 to see.  Should I now have a schema.xml file somewhere that is somehow 
 connected to my new index?  If so, where is it?  Was it present from install 
 or did it get created when I made my first core (bin/solr create -c ati_docs)?

 [root@machine solr-5.0.0]# find -name schema.xml 
 ./example/example-DIH/solr/tika/conf/schema.xml
 ./example/example-DIH/solr/rss/conf/schema.xml
 ./example/example-DIH/solr/solr/conf/schema.xml
 ./example/example-DIH/solr/db/conf/schema.xml
 ./example/example-DIH/solr/mail/conf/schema.xml
 ./server/solr/configsets/basic_configs/conf/schema.xml
 ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml
 [root@machine solr-5.0.0]#

 Is it the one in /configsets/basic_configs/conf?  Is that the default one?

 If I want to 'modify' schema.xml to do some different indexing/analyzing, how 
 do I start?  Make a copy of that schema.xml, move it somewhere else and 
 modify it?  If so, how do I create a new index using this schema.xml?

 Or am I running in schemaless mode?  I don't think I am because it 
 appears that I would have to specifically state this as a command line 
 parameter, i.e. bin/solr start -e schemaless

 What fundamentals am I missing?  I'm coming to Solr from Elasticsearch, and 
 I've already recognized some differences.  Is my ES background clouding my 
 grasp of Solr fundamentals?

Hopefully you know what core you are using, so you can go to the admin UI and 
find it in the Core Selector dropdown list.  Assuming you can do that, you 
will find yourself looking at the Overview tab for that core.

https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface

Once you are looking at the core overview, in the upper right corner of your 
browser window is a section called Instance ... which has an entry that is 
ALSO called Instance.  Inside the directory indicated by that field, you 
should have a conf directory.  The config and schema for that index are found 
in that conf directory.

If you're running SolrCloud, then you can forget everything I just said ... the 
active configs will be found within the zookeeper database, and you can use the 
Cloud-Tree tab in the admin UI to find your collections and see which 
configName is linked to each one.  You'll want to become familiar with the 
zkcli script in server/scripts/cloud-scripts.

https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities

Whether it is SolrCloud or not, you can always LOOK at your configs right in 
the admin UI -- click on the Files tab after you select the core from the 
selector.

Thanks,
Shawn




i'm a newb: questions about schema.xml

2015-03-26 Thread Mark Bramer
Hello,

I'm a Solr newb.  I've been poking around for several days on my own test 
instance, and also online at the info available.  But one thing just isn't 
jiving and I can't put my finger on why.  I've searched many many times but I 
don't see what I'm looking for, so I'm thinking perhaps I have a fundamental 
semantic misunderstanding of something somewhere.  Everywhere I read, everyone 
talks about schema.xml and how important is.  I fully get what it's for but I 
don't get where it is, how it's used (by me), how I edit it, and how I create 
new indexes once I've edited it.

I've installed, and am successfully running, solr 5.0.0 on Linux.  I've 
followed the widely recommended-by-all quick start at: 
http://lucene.apache.org/solr/quickstart.html.  I get through it fine, I post a 
bunch of stuff, I use the web UI to query for, and see, data I would expect to 
see.  Should I now have a schema.xml file somewhere that is somehow connected 
to my new index?  If so, where is it?  Was it present from install or did it 
get created when I made my first core (bin/solr create -c ati_docs)?

[root@machine solr-5.0.0]# find -name schema.xml
./example/example-DIH/solr/tika/conf/schema.xml
./example/example-DIH/solr/rss/conf/schema.xml
./example/example-DIH/solr/solr/conf/schema.xml
./example/example-DIH/solr/db/conf/schema.xml
./example/example-DIH/solr/mail/conf/schema.xml
./server/solr/configsets/basic_configs/conf/schema.xml
./server/solr/configsets/sample_techproducts_configs/conf/schema.xml
[root@machine solr-5.0.0]#

Is it the one in /configsets/basic_configs/conf?  Is that the default one?

If I want to 'modify' schema.xml to do some different indexing/analyzing, how 
do I start?  Make a copy of that schema.xml, move it somewhere else and modify 
it?  If so, how do I create a new index using this schema.xml?

Or am I running in schemaless mode?  I don't think I am because it appears 
that I would have to specifically state this as a command line parameter, i.e. 
bin/solr start -e schemaless

What fundamentals am I missing?  I'm coming to Solr from Elasticsearch, and 
I've already recognized some differences.  Is my ES background clouding my 
grasp of Solr fundamentals?

Thanks for any help.

Mark Bramer | Technical Team Lead, DC Services
Esri | 8615 Westwood Center Dr | Vienna, VA 22182 | USA
T 703 506 9515 x8017 | mbra...@esri.commailto:mbra...@esri.com | esri.com



Re: SolrCloud -- Blocking access to administration commands while keeping the solr internal communication

2015-03-26 Thread Shawn Heisey
On 3/26/2015 3:38 PM, Oded Sofer wrote:
 There are many blogs discussing this issue but it is hard to find if someone 
 had managed to resolve that. 
 We have many nodes in the SolrCloud, implementing the iptable restriction 
 will fill the iptable with many rules that will affect performance. 
 We are using 4.3.10, on Tomcat 5. 

Because Solr is a webapp, it relies on software outside itself to
provide network and protocol (HTTP) communication.  In your case, that
software is Tomcat.  For others, it is Jetty, JBoss, Weblogic, or one of
several other possibilities.  This means that there are many things that
are impossible (or extremely difficult) for Solr to handle within its
own code.  Security is one of them.

This is one of the major reasons that Solr will become a true
application at some point in the future.  When Solr can control the
network and the HTTP server, we will be able to restrict access to the
admin UI separately from access to the query interface, the update
interface, replication, etc.

As far as your iptables rule list ... are your Solr servers contained
within discrete IP address blocks that could be added to the rule list
as subnets instead of individual addresses?  Ideally you will handle
complicated access controls on edge firewalls or as ACLs on internal
routing devices, not at the host level.

Thanks,
Shawn



Build index from Oracle, adding fields

2015-03-26 Thread Julian Perry

Hi

I have looked and cannot see any clear answers to this on
the Interwebs.


I have an index with, say, 10 fields.

I load that index directly from Oracle - data-config.xml using
JDBC.  I can load 10 million rows very quickly.  This direct
way of loading from Oracle straight into SOLR is fantastic -
really efficient and saves writing loads of import/export code
(e.g. via a CSV file).

Of those 10 fields - two of them (set to multiValued) come from
a separate table and there are anything from 1 to 10 rows per
row from the main table.

I can use a nested entity to extract the child rows for each of
the 10m rows in the main table - but then SOLR generates 10m
separate SQL calls - and the load time goes from a few minutes
to several days.

On smaller tables - just a few thousand rows - I can use a
second nested entity with a JDBC call - but not for very large
tables.

Could I load the data in two steps:
1)  load the main 10m rows
2)  load into the existing index by adding the data from a
second SQL call into fields for each existing row (i.e.
an UPDATE instead of an INSERT).

I don't know what syntax/option might achieve that.  There
is incremental loading - but I think that replaces whole rows
rather then updating individual fields.  Or maybe it does
do both?

Any other techniques that would be fast/efficient?

Help!

--
Cheers
Jules.


Re: Solr Monitoring - Stored Stats?

2015-03-26 Thread Otis Gospodnetic
Matt,

SPM will give you all that out of the box with alerts, anomaly detection etc. 
See http://sematext.com/spm

Otis

 

 On Mar 25, 2015, at 11:26, Matt Kuiper matt.kui...@issinc.com wrote:
 
 Hello,
 
 I am familiar with the JMX points that Solr exposes to allow for monitoring 
 of statistics like QPS, numdocs, Average Query Time...
 
 I am wondering if there is a way to configure Solr to automatically store the 
 value of these stats over time (for a given time interval), and then allow a 
 user to query a stat over a time range.  So for the QPS stat,  the query 
 might return a set that includes the QPS value for each hour in the time 
 range specified.
 
 Thanks,
 Matt
 
 


Re: Data indexing is going too slow on single shard Why?

2015-03-26 Thread Nitin Solanki
Great thanks Shawn...
As you said -  **For 204GB of data per server, I recommend at least 128GB
of total RAM,
preferably 256GB**. Therefore, if I have 204GB of data on single
server/shard then I prefer is 256GB by which searching will be fast and
never slow down. Is it?

On Wed, Mar 25, 2015 at 9:50 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 3/25/2015 8:42 AM, Nitin Solanki wrote:
  Server configuration:
  8 CPUs.
  32 GB RAM
  O.S. - Linux

 snip

  are running.  Java heap set to 4096 MB in Solr.  While indexing,

 snip

  *Currently*, I have 1 shard  with 2 replicas using SOLR CLOUD.
  Data Size:
  102Gsolr/node1/solr/wikingram_shard1_replica2
  102Gsolr/node2/solr/wikingram_shard1_replica1

 If both of those are on the same machine, I'm guessing that you're
 running two Solr instances on that machine, so there's 8GB of RAM used
 for Java.  That means you have about 24 GB of RAM left for caching ...
 and 200GB of index data to cache.

 24GB is not enough to cache 200GB of index.  If there is only one Solr
 instance (leaving 28GB for caching) with 102GB of data on the machine,
 it still might not be enough.  See that SolrPerformanceProblems wiki
 page I linked in my earlier email.

 For 102GB of data per server, I recommend at least 64GB of total RAM,
 preferably 128GB.

 For 204GB of data per server, I recommend at least 128GB of total RAM,
 preferably 256GB.

 Thanks,
 Shawn




Re: Data indexing is going too slow on single shard Why?

2015-03-26 Thread Shawn Heisey
On 3/26/2015 12:03 AM, Nitin Solanki wrote:
 Great thanks Shawn...
 As you said -  **For 204GB of data per server, I recommend at least 128GB
 of total RAM,
 preferably 256GB**. Therefore, if I have 204GB of data on single
 server/shard then I prefer is 256GB by which searching will be fast and
 never slow down. Is it?

Obviously I cannot guarantee it, but I think it's extremely likely that
with that much memory, performance will be very good.

One other possibility, which is discussed on that wiki page I linked, is
that your java heap is being almost exhausted and large amounts of time
are spent in garbage collection.  If you increase the heap from 4GB to
5GB and see performance get better, then that would be confirmed.  There
would be less memory available for caching, but constant garbage
collection would be a much greater problem than the disk cache being too
small.

Thanks,
Shawn



Re: Applying Tokenizers and Filters to CopyFields

2015-03-26 Thread Martin Wunderlich
Thanks so much, Erick and Michael, for all the additional explanation. 
The crucial information in the end turned out to be the one about the Default 
Search Field („df“). In solrconfig.xml this parameter was to point to the 
original text, which is why the expanded queries didn’t work. When I set the df 
parameter to one of the fields with the expanded text, the search works fine. I 
have also removed the copyField declarations. 

It’s all working as expected now. Thanks again for the help. 

Cheers, 

Martin
 



 Am 25.03.2015 um 23:43 schrieb Erick Erickson erickerick...@gmail.com:
 
 Martin:
 Perhaps this would help
 
 indexed=true, stored=true
 field can be searched. The raw input (not analyzed in any way) can be
 shown to the user in the results list.
 
 indexed=true, stored=false
 field can be searched. However, the field can't be returned in the
 results list with the document.
 
 indexed=false, stored=true
 The field cannot be searched, but the contents can be returned in the
 results list with the document. There are some use-cases where this is
 desirable behavior.
 
 indexed=false, stored=false
 The entire field is thrown out, it's just as if you didn't send the
 field to be indexed at all.
 
 And one other thing, the copyField gets the _raw_ data not the
 analyzed data. Let's say you have two fields, src and dst.
 copying from src to dest in schema.xml is identical to
 add
  doc
field name=srcoriginal text/field
   field name=dstoriginal text/field
 /doc
 /add
 
 that is, copyfield directives are not chained.
 
 Also, watch out for your query syntax. Michael's comments are spot-on,
 I'd just add this:
 
 http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true
 
 is kind of odd. Let's assume you mean qf rather than fq. That
 _only_ matters if your query parser is edismax, it'll be ignored in
 this case I believe.
 
 You'd want something like
 q=src:Sprache
 or
 q=dst:Sprache
 or even
 http://localhost:8983/solr/windex/select?q=Sprachedf=src
 http://localhost:8983/solr/windex/select?q=Sprachedf=dst
 
 where df is default field and the search is applied against that
 field in the absence of a field qualification like my first two
 examples.
 
 Best,
 Erick
 
 On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 I agree the terminology is possibly a little confusing.
 
 Stored refers to values that are stored verbatim. You can retrieve them
 verbatim. Analysis does not affect stored values.
 Indexed values are tokenized/transformed and stored inverted. You can't
 recover the literal analyzed version (at least, not easily).
 
 If what you really want is to store and retrieve case folded versions of
 your data as well as the original, you need to use something like a
 UpdateRequestProcessor, which I personally am less familiar with.
 
 
 On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net
 wrote:
 
 So, the pre-processing steps are applied under analyzer type=„index“.
 And this point is not quite clear to me: Assuming that I have a simple
 case-folding step applied to the target of the copyField: How or where are
 the lower-case tokens stored, if the text isn’t added to the index? How is
 the query supposed to retrieve the lower-case version?
 (sorry, if this sounds like a naive question, but I have a feeling that I
 am missing something really basic here).
 
 
 
 Michael Della Bitta
 
 Senior Software Engineer
 
 o: +1 646 532 3062
 
 appinions inc.
 
 “The Science of Influence Marketing”
 
 18 East 41st Street
 
 New York, NY 10017
 
 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 w: appinions.com http://www.appinions.com/



RE: Solr Monitoring - Stored Stats?

2015-03-26 Thread Matt Kuiper
Erick, Shawn,

Thanks for your responses.  I figured this was the case, just wanted to check 
to be sure.

I have used Zabbix to configure JMX points to monitor over time, but it was a 
bit of work to get configured.  We are looking to create a simple dashboard of 
a few stats over time.  Looks like the easiest approach will be to make an app 
to make calls for these stats at a regular interval and then index results to 
Solr, and then we will able to query over desired time frames...

Thanks,
Matt

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, March 25, 2015 10:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Monitoring - Stored Stats?

Matt:

Not really. There's a bunch of third-party log analysis tools that give much of 
this information (not everything exposed by JMX of course is in the log files 
though).

Not quite sure whether things like Nagios, Zabbix and the like have this kind 
of stuff built in seems like a natural extension of those kinds of tools 
though

Not much help here...
Erick

On Wed, Mar 25, 2015 at 8:26 AM, Matt Kuiper matt.kui...@issinc.com wrote:
 Hello,

 I am familiar with the JMX points that Solr exposes to allow for monitoring 
 of statistics like QPS, numdocs, Average Query Time...

 I am wondering if there is a way to configure Solr to automatically store the 
 value of these stats over time (for a given time interval), and then allow a 
 user to query a stat over a time range.  So for the QPS stat,  the query 
 might return a set that includes the QPS value for each hour in the time 
 range specified.

 Thanks,
 Matt




Re: Applying Tokenizers and Filters to CopyFields

2015-03-26 Thread Michael Della Bitta
Glad you are sorted out!

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
w: appinions.com http://www.appinions.com/

On Thu, Mar 26, 2015 at 10:09 AM, Martin Wunderlich martin...@gmx.net
wrote:

 Thanks so much, Erick and Michael, for all the additional explanation.
 The crucial information in the end turned out to be the one about the
 Default Search Field („df“). In solrconfig.xml this parameter was to point
 to the original text, which is why the expanded queries didn’t work. When I
 set the df parameter to one of the fields with the expanded text, the
 search works fine. I have also removed the copyField declarations.

 It’s all working as expected now. Thanks again for the help.

 Cheers,

 Martin




  Am 25.03.2015 um 23:43 schrieb Erick Erickson erickerick...@gmail.com:
 
  Martin:
  Perhaps this would help
 
  indexed=true, stored=true
  field can be searched. The raw input (not analyzed in any way) can be
  shown to the user in the results list.
 
  indexed=true, stored=false
  field can be searched. However, the field can't be returned in the
  results list with the document.
 
  indexed=false, stored=true
  The field cannot be searched, but the contents can be returned in the
  results list with the document. There are some use-cases where this is
  desirable behavior.
 
  indexed=false, stored=false
  The entire field is thrown out, it's just as if you didn't send the
  field to be indexed at all.
 
  And one other thing, the copyField gets the _raw_ data not the
  analyzed data. Let's say you have two fields, src and dst.
  copying from src to dest in schema.xml is identical to
  add
   doc
 field name=srcoriginal text/field
field name=dstoriginal text/field
  /doc
  /add
 
  that is, copyfield directives are not chained.
 
  Also, watch out for your query syntax. Michael's comments are spot-on,
  I'd just add this:
 
 
 http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true
 
  is kind of odd. Let's assume you mean qf rather than fq. That
  _only_ matters if your query parser is edismax, it'll be ignored in
  this case I believe.
 
  You'd want something like
  q=src:Sprache
  or
  q=dst:Sprache
  or even
  http://localhost:8983/solr/windex/select?q=Sprachedf=src
  http://localhost:8983/solr/windex/select?q=Sprachedf=dst
 
  where df is default field and the search is applied against that
  field in the absence of a field qualification like my first two
  examples.
 
  Best,
  Erick
 
  On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta
  michael.della.bi...@appinions.com wrote:
  I agree the terminology is possibly a little confusing.
 
  Stored refers to values that are stored verbatim. You can retrieve them
  verbatim. Analysis does not affect stored values.
  Indexed values are tokenized/transformed and stored inverted. You can't
  recover the literal analyzed version (at least, not easily).
 
  If what you really want is to store and retrieve case folded versions of
  your data as well as the original, you need to use something like a
  UpdateRequestProcessor, which I personally am less familiar with.
 
 
  On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net
  wrote:
 
  So, the pre-processing steps are applied under analyzer type=„index“.
  And this point is not quite clear to me: Assuming that I have a simple
  case-folding step applied to the target of the copyField: How or where
 are
  the lower-case tokens stored, if the text isn’t added to the index?
 How is
  the query supposed to retrieve the lower-case version?
  (sorry, if this sounds like a naive question, but I have a feeling
 that I
  am missing something really basic here).
 
 
 
  Michael Della Bitta
 
  Senior Software Engineer
 
  o: +1 646 532 3062
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
  w: appinions.com http://www.appinions.com/




Replacing a group of documents (Delete/Insert) without a query on the index ever showing an empty list (Docs)

2015-03-26 Thread Russell Taylor
Hi,
I have an index which is made up of groups of documents, each group is defined 
by a field called keyField (keyField:A).
I need to delete all the keyField:A documents and replace them with a brand new 
set without the index ever returning
zero documents on a query.

At the moment I deleteByQuery:keyField:A and then insert a SolrInputDocument 
list via
SolrJ into my index. I have a small time period where somebody doing a 
q=fieldKey:A
can be returned an empty list.

FYI: The keyField group might be just 100 documents or up to 10 million.

Any help much appreciated.


Thanks

Russ.


Index example
docs: [
{
keyField:A

...
lastField:xyz

},

{
keyField:A

...
lastField:xyz

},

{
keyField:B

...
lastField:xyz

},

{
keyField:A

...
lastField:xyz

},

{
keyField:B

...
lastField:xyz
}







***
This message (including any files transmitted with it) may contain confidential 
and/or proprietary information, is the property of Interactive Data Corporation 
and/or its subsidiaries, and is directed only to the addressee(s). If you are 
not the designated recipient or have reason to believe you received this 
message in error, please delete this message from your system and notify the 
sender immediately. An unintended recipient's disclosure, copying, 
distribution, or use of this message or any attachments is prohibited and may 
be unlawful. 
***