from:"Utkarsh Sengar"

Solr performance improved under heavy load

2014-07-09 Thread Utkarsh Sengar

I run a small solr cloud cluster (4.5) of 3 nodes, 3 collections with 3
shards each. Total index size per node is about 20GB with about 70M
documents.

In regular traffic (27-50 rpm) the performance is ok and response time
ranges from 100 to 500ms.
But when I start loading (overwriting) 70M documents again via curl + csv,
the performance has drastically improved. I see a 6ms response time
(screenshot attached).

So I am just curious about this, intuitively solr should perform better
under low traffic and slow down as traffic goes up. So what is the reason
for this? Efficient memory management with more data?

-- 
Thanks,
-Utkarsh

Denormalize or use multivalued field for nested data?

2014-04-30 Thread Utkarsh Sengar

I have to modify a schema where I can attach nested pricing per store
information for a product. For example:

10010137332:{
   title:iPad 64gb
   description: iPad 64gb with retina
   pricing:{
merchantid64354:{
  locationid643:{
 USD|600
  }
  locationid6436:{
 USD|600
  }
}
merchantid343:{
  locationid1345:{
 USD|600
  }
  locationid4353:{
 USD|600
  }
}
   }
}


This is what is suggested all over the internet:
Denormalize it: In my case, I will end up with total number of columns =
total locations with a price which is about 100k. I don't think having 100k
columns for 60M products is a good idea.

Are there any better ways of handling this?
I am trying to figure out multivalue field but as far as I understand it,
it can only be used as a flag but cannot be used to get a value
associated to a key.

Based on this answer, solr 4.5+ supports nested documents:
http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4.



-- 
Thanks,
-Utkarsh

Investigating performance issues in solr cloud

2014-04-08 Thread Utkarsh Sengar

I see sudden drop in throughput once every 3-4 days. The downtime is for
about 2-6minutes and things stabilize after that.

But I am not sure what is causing it the problem.

I have 3 shards with 20GB of data on each shard.
Solr dashboard: http://i.imgur.com/6RWT2Dj.png
Newrelic graphs when during the downtime of about 4hours:
http://i.imgur.com/9vhKiB2.png
JVM memory graph says its normal: http://i.imgur.com/pAycgdC.png

I thought it was GC pauses but it should be in the newrelic logs.

How can I go about investigating this problem? I am running solr 4.4.0, I
don't see a strong reason to upgrade yet.

-- 
Thanks,
-Utkarsh

Re: Investigating performance issues in solr cloud

2014-04-08 Thread Utkarsh Sengar

Lots of questions indeed :)

1. Total virtual machines: 3
2. Replication factor: 0 (don't have any replicas yet)
3. Each machine has 1 shard which has 20GB of data. So data for a
collection is spread across 3 machines totalling to 60GB
4. Start solr:
java -Xmx1m
   -javaagent:newrelic/newrelic.jar
   -Dsolr.clustering.enabled=true
   -Dsolr.solr.home=multicore
   -Djetty.class.path=lib/ext/* 
   -Dbootstrap_conf=true
   -DnumShards=3
   -DzkHost=localhost:2181 -jar start.jar
5. Yes, all machines have 24GB RAM and 9GB heap. Separate process of ZK is
running on these machine.
6. top screenshot: http://i.imgur.com/g6w9Bim.png

Thanks!




On Tue, Apr 8, 2014 at 4:48 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/8/2014 5:30 PM, Utkarsh Sengar wrote:
  I see sudden drop in throughput once every 3-4 days. The downtime is
 for
  about 2-6minutes and things stabilize after that.
 
  But I am not sure what is causing it the problem.
 
  I have 3 shards with 20GB of data on each shard.
  Solr dashboard: http://i.imgur.com/6RWT2Dj.png
  Newrelic graphs when during the downtime of about 4hours:
  http://i.imgur.com/9vhKiB2.png
  JVM memory graph says its normal: http://i.imgur.com/pAycgdC.png
 
  I thought it was GC pauses but it should be in the newrelic logs.
 
  How can I go about investigating this problem? I am running solr 4.4.0, I
  don't see a strong reason to upgrade yet.

 Lots of questions:

 How many total machines?  What is your replicationFactor?  Does each
 machine have one shard replica and therefore 20GB of total index data,
 or if you add up all the index directories for the cores on each
 machine, is there more than 20GB of data?

 What options are you passing to your JVM when you start the servlet
 container that runs Solr?

 The dashboard says that this machine has 24GB of RAM and a 9GB heap.  Is
 this the case for all machines?  Is there any software other than Solr
 on the machine?

 If it's a linux/unix machine, can you run top, press shift-M to sort by
 memory, and grab a screenshot?  If it's a Windows machine, a similar
 list should be available in the task manager, but it must include all
 processes for all users on the whole machine, and it would be best if it
 showed virtual memory as well as private.

 Thanks,
 Shawn




-- 
Thanks,
-Utkarsh

Re: Investigating performance issues in solr cloud

2014-04-08 Thread Utkarsh Sengar

1. I am using Oracle JVM
user@host:~$ java -version
java version 1.6.0_45
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

2. I will try out jHiccup and your GC settings.

3. Yes, I am running ZK instances in an ensemble. I didn't know I need to
pass all the instances of ZK to a single solr node. I will try it out right
now. This means if you have a large cluster say of 50 solr nodes and 10 ZK
nodes then I will need to pass all the 10 nodes to -DzkHost of the 50 solr
processes? What is the reasoning behind this?

Thanks,
-Utkarsh




On Tue, Apr 8, 2014 at 5:37 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/8/2014 6:00 PM, Utkarsh Sengar wrote:
  Lots of questions indeed :)
 
  1. Total virtual machines: 3
  2. Replication factor: 0 (don't have any replicas yet)
  3. Each machine has 1 shard which has 20GB of data. So data for a
  collection is spread across 3 machines totalling to 60GB
  4. Start solr:
  java -Xmx1m
 -javaagent:newrelic/newrelic.jar
 -Dsolr.clustering.enabled=true
 -Dsolr.solr.home=multicore
 -Djetty.class.path=lib/ext/* 
 -Dbootstrap_conf=true
 -DnumShards=3
 -DzkHost=localhost:2181 -jar start.jar
  5. Yes, all machines have 24GB RAM and 9GB heap. Separate process of ZK
 is
  running on these machine.
  6. top screenshot: http://i.imgur.com/g6w9Bim.png

 A followup question:  What vendor and version of JVM are you running?
 Excellent choices include very recent Java 6 releases from Oracle,
 Oracle Java 7u25, and whatever OpenJDK version corresponds to Oracle
 7u25.  Good choices include most version of Oracle Java 7, Oracle Java
 6, and OpenJDK7.  The latest versions of Oracle Java 7 (from 7u40 to
 7u51) have known bugs that affect Solr.

 OpenJDK6 and commercial java versions from non-Oracle vendors like IBM
 are very bad choices, because they have known serious bugs.  I don't
 know much about the Zing JVM, but it is probably a good choice.  If you
 are running Zing, then what I'm saying below about GC pauses will not
 apply.

 Solr 4.8 will require Java 7, so if you plan to upgrade that far, be
 sure you're not using Java 6 at all.

 One possible problem that I always investigate first is whether or not
 there's enough RAM to cache the index effectively.  The 14GB of RAM in
 your disk cache is not a perfect setup for a 20GB index, but it should
 be plenty.  The fact that you still have 4GB of RAM free on your top
 screenshot is further evidence that you do have plenty of disk cache.
 No need to pursue that any further.

 Garbage collection pauses are however a likely problem here.  I have
 some personal experience with this problem.  Because you're using the
 default collector and have 7GB heap allocated, I can almost guarantee
 that this is a problem, even if New Relic isn't showing it.  A program
 called jHiccup *will* show the problem.

 http://www.azulsystems.com/jHiccup

 These are my GC settings.  They work very well and are not specific to a
 certain heap size, although I am sure that the config can be improved:

 http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

 Regarding zookeeper:  Are you running all three of your ZK instances in
 a redundant ensemble, where the config on each of them knows about all
 of them?  You should definitely be doing this.  If you are, then your
 zkHost parameter for Solr needs to reflect that:

 -DzkHost=host1:2181,host2:2181,host3:2181

 Using only localhost:2181 could cause problems, and they could look like
 the problems you are seeing.

 Thanks,
 Shawn




-- 
Thanks,
-Utkarsh

Re: implement relevency

2014-01-28 Thread Utkarsh Sengar

Hi Rashmi,

Relevancy needs some kind of training data which can lead to a chicken and
egg problem. If you dont have that training set, then you need to come up
with it or train manually (provide some seed).
Our existing search had 2 years worth clickstream data, i.e. we know if
someone searches for ipod they clicked on a UPC which was an iPod 4th gen
or an iPod 5th gen 32GB etc.

So, we have used that data to build an internal lookup table of millions of
queries which look something like this:

ipod 32gb - music^1000, apple^1000, 32gb^991, 8gb^800

We wrote an algorithm which computes the keyword relevancy score which is
used as the boost value.
Now, when a query like ipod 32gb comes in, we lookup this table, get the
boost values and query solr with these boost values and its score.

We are happy with the results. Our usecase was product search
(title+description) of about 60M documents, not sure how will this approach
work with a different usecase.

Thanks,
-Utkarsh


On Tue, Jan 28, 2014 at 9:22 AM, tamanjit.bin...@yahoo.co.in 
tamanjit.bin...@yahoo.co.in wrote:

 You may also want to look  here
 http://wiki.apache.org/solr/SolrRelevancyFAQ



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/implement-relevency-tp4113964p4113983.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Thanks,
-Utkarsh

Re: Complex nested structure in solr

2014-01-27 Thread Utkarsh Sengar

Bumping this one, with an update:

After thinking about this, I think getting rid of lat/lon will simplify
things a bit, the new query pattern:
Input: keyword=ipod, merchantId=922,locationId=81,82
Output: List of UPCs for ipod which exist inside stores 81 and 82 which
should be owned by 922

Also, based on some previous answers, flattening this one using dynamic
fields will create a lot of fields in my case - one field for every
merchantId and then I can use multivalued field to store locationId for
merchants. But is there a cleaner way of  implementing this?

Example:
upc,merchantid_922,merchantid_800
892828282,[82,82],
922932932,,[22,23]


http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201208.mbox/%3CCAEFAe-Hew1CKk=EyqACFUTKqGHExXZLSHtyrgym09aYQVJf=t...@mail.gmail.com%3E


Thanks,
-Utkarsh


On Fri, Jan 24, 2014 at 12:05 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Hi guys,

 I have to load extra meta data to an existing collection.

 This is what I am looking for:
 For a UPC: Store availability by merchantId per location (which has
 lat/lon)

 My query pattern will be: Given a keyword, find all available products for
 a merchantId around the given lat/lon.

 Example:
 Input: keyword=ipod, merchantId=922,lat/lon=28.222,82.333
 Output: List of UPCs which match the criteria

 So how should I go about doing it? Any suggestions?

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Complex nested structure in solr

2014-01-24 Thread Utkarsh Sengar

Hi guys,

I have to load extra meta data to an existing collection.

This is what I am looking for:
For a UPC: Store availability by merchantId per location (which has lat/lon)

My query pattern will be: Given a keyword, find all available products for
a merchantId around the given lat/lon.

Example:
Input: keyword=ipod, merchantId=922,lat/lon=28.222,82.333
Output: List of UPCs which match the criteria

So how should I go about doing it? Any suggestions?

-- 
Thanks,
-Utkarsh

shard merged into a another shard as replica

2014-01-22 Thread Utkarsh Sengar

I am not sure what happened, I updated merchant collection and then
restarted all the solr machines.

This is what I see right now: http://i.imgur.com/4bYuhaq.png

merchant collection looks fine. But deals and prodinfo collections should
have a total of 3 shards. But someone shard1 has converted to replica of
shard2.

This is running in production, so how can I fix it without dumping the
whole zk data?

-- 
Thanks,
-Utkarsh

Re: shard merged into a another shard as replica

2014-01-22 Thread Utkarsh Sengar

solr 4.4.0


On Wed, Jan 22, 2014 at 3:12 PM, Mark Miller markrmil...@gmail.com wrote:

 What version of Solr are you running?

 - Mark



 On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote: I am not sure what happened, I updated merchant collection and then
 restarted all the solr machines.

 This is what I see right now: http://i.imgur.com/4bYuhaq.png

 merchant collection looks fine. But deals and prodinfo collections should
 have a total of 3 shards. But someone shard1 has converted to replica of
 shard2.

 This is running in production, so how can I fix it without dumping the
 whole zk data?

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: shard merged into a another shard as replica

2014-01-22 Thread Utkarsh Sengar

Thanks Mark. I tried updating clusterstate manually, things went haywire J.
So to fix it, had to take 30secs-1min downtime where I stopped solr and zk,
deleted /zookeeper_data/version-2 directory and restarted everything
again.

I have auotmated these commands via fabric, so was easily able to recover
from the downtime.

Thanks,
-Utkarsh


On Wed, Jan 22, 2014 at 3:18 PM, Mark Miller markrmil...@gmail.com wrote:

 Hopefully an issue that has been fixed then. We should look into that.

 You should be able to fix it by directly modifying the clusterstate.json
 in ZooKeeper. Remember to back it up first!

 There are a variety of tools you can use to work with ZooKeeper - I like
 the eclipse plug-in that you can google for.

 Many, many SolrCloud bug fixes (we are about to release 4.6.1) since 4.4,
 so you might consider an upgrade if possible at some point soon.

 - Mark



 On Jan 22, 2014, 6:14:10 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote: solr 4.4.0


 On Wed, Jan 22, 2014 at 3:12 PM, Mark Miller markrmil...@gmail.com
 wrote:

  What version of Solr are you running?
 
  - Mark
 
 
 
  On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote: I am not sure what happened, I updated merchant collection and
 then
  restarted all the solr machines.
 
  This is what I see right now: http://i.imgur.com/4bYuhaq.png
 
  merchant collection looks fine. But deals and prodinfo collections should
  have a total of 3 shards. But someone shard1 has converted to replica of
  shard2.
 
  This is running in production, so how can I fix it without dumping the
  whole zk data?
 
  --
  Thanks,
  -Utkarsh
 



 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Trigger event on change of a field in a document

2013-12-27 Thread Utkarsh Sengar

I am experimenting with implementing a price drop feature.
Can I register some document's fields and trigger some sort of events if
the values change in those fields?

For example:
1. Price of itemX is $10
2. Say the price changes to $17 or $5 (increases or decreases) when the new
data loads.
3. Trigger an event to take an action on that change, like send out an
email.

I believe this is somewhat similar but not the same as the percolator
feature in elasticsearch.

-- 
Thanks,
-Utkarsh

Re: Trigger event on change of a field in a document

2013-12-27 Thread Utkarsh Sengar

Thanks! I think, I will explore how to implement it outside solr.

-Utkarsh


On Fri, Dec 27, 2013 at 3:20 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 And if you really really really wanted that in Solr then have a look at
 UpdateRequestProcessors.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Dec 27, 2013 6:19 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  Hi,
 
  This sounds like it would be best implemented outside the search engine.
 
  Otis
  Solr  ElasticSearch Support
  http://sematext.com/
  On Dec 27, 2013 4:29 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote:
 
  I am experimenting with implementing a price drop feature.
  Can I register some document's fields and trigger some sort of events if
  the values change in those fields?
 
  For example:
  1. Price of itemX is $10
  2. Say the price changes to $17 or $5 (increases or decreases) when the
  new
  data loads.
  3. Trigger an event to take an action on that change, like send out an
  email.
 
  I believe this is somewhat similar but not the same as the percolator
  feature in elasticsearch.
 
  --
  Thanks,
  -Utkarsh
 
 




-- 
Thanks,
-Utkarsh

Re: What is the difference between attorney:(Roger Miller) and attorney:Roger Miller

2013-11-19 Thread Utkarsh Sengar

Also, attorney:(Roger Miller) is same as attorney:Roger Miller right? Or
the term Roger Miller is run against attorney?

Thanks,
-Utkarsh


On Tue, Nov 19, 2013 at 12:42 PM, Rafał Kuć r@solr.pl wrote:

 Hello!

 In the first one, the two terms 'Roger' and 'Miller' are run against
 the attorney field. In the second the 'Roger' term is run against the
 attorney field and the 'Miller' term is run against the default search
 field.

 --
 Regards,
  Rafał Kuć
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


  We got different results for these two queries. The first one returned
 115
  records and the second returns 179 records.

  Thanks,

  Fudong




-- 
Thanks,
-Utkarsh

Re: High disk IO during UpdateCSV

2013-11-13 Thread Utkarsh Sengar

Bumping this one again, any suggestions?


On Tue, Nov 12, 2013 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Hello,

 I load data from csv to solr via UpdateCSV. There are about 50M documents
 with 10 columns in each document. The index size is about 15GB and I am
 using a 3 node distributed solr cluster.

 While loading the data the disk IO goes to 100%. if the load balancer in
 front of solr hits the machine which is doing the processing then the
 request times out. But in general, requests to all the machines become
 slow. I have attached a screenshot of the diskI/O and CPU usage.

 Is there a fix in solr which can possibly throttle the load or maybe its
 due to MergePolicy? How can I debug solr to get the exact cause?

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: High disk IO during UpdateCSV

2013-11-13 Thread Utkarsh Sengar

Hi Michael,

I am using solr cloud 4.5.
And update csv loads data to one of these nodes.
Attachment: http://i.imgur.com/1xmoNtt.png


Thanks,
-Utkarsh


On Wed, Nov 13, 2013 at 8:33 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Utkarsh,

 Your screenshot didn't come through. I don't think this list allows
 attachments. Maybe put it up on imgur or something?

 I'm a little unclear on whether you're using Solr in Cloud mode, or with a
 single master.

 Michael Della Bitta

 Applications Developer

 o: +1 646 532 3062  | c: +1 917 477 7906

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions https://twitter.com/Appinions | g+:
 plus.google.com/appinions
 https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
 
 w: appinions.com http://www.appinions.com/


 On Wed, Nov 13, 2013 at 11:22 AM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Bumping this one again, any suggestions?
 
 
  On Tue, Nov 12, 2013 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Hello,
  
   I load data from csv to solr via UpdateCSV. There are about 50M
 documents
   with 10 columns in each document. The index size is about 15GB and I am
   using a 3 node distributed solr cluster.
  
   While loading the data the disk IO goes to 100%. if the load balancer
 in
   front of solr hits the machine which is doing the processing then the
   request times out. But in general, requests to all the machines become
   slow. I have attached a screenshot of the diskI/O and CPU usage.
  
   Is there a fix in solr which can possibly throttle the load or maybe
 its
   due to MergePolicy? How can I debug solr to get the exact cause?
  
   --
   Thanks,
   -Utkarsh
  
 
 
 
  --
  Thanks,
  -Utkarsh
 




-- 
Thanks,
-Utkarsh

Re: High disk IO during UpdateCSV

2013-11-13 Thread Utkarsh Sengar

Thanks guys!
I will start splitting the file in chunks of 5M (10 chunks) to start with
reduce the size if needed.

Thanks,
-Utkarsh


On Wed, Nov 13, 2013 at 9:08 AM, Walter Underwood wun...@wunderwood.orgwrote:

 Don't load 50M documents in one shot. Break it up into reasonable chunks
 (100K?) with commits at each point.

 You will have a bottleneck somewhere, usually disk or CPU. Yours appears
 to be disk. If you get faster disks, it might become the CPU.

 wunder

 On Nov 13, 2013, at 8:22 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote:

  Bumping this one again, any suggestions?
 
 
  On Tue, Nov 12, 2013 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:
 
  Hello,
 
  I load data from csv to solr via UpdateCSV. There are about 50M
 documents
  with 10 columns in each document. The index size is about 15GB and I am
  using a 3 node distributed solr cluster.
 
  While loading the data the disk IO goes to 100%. if the load balancer in
  front of solr hits the machine which is doing the processing then the
  request times out. But in general, requests to all the machines become
  slow. I have attached a screenshot of the diskI/O and CPU usage.
 
  Is there a fix in solr which can possibly throttle the load or maybe its
  due to MergePolicy? How can I debug solr to get the exact cause?
 
  --
  Thanks,
  -Utkarsh
 
 
 
 
  --
  Thanks,
  -Utkarsh

 --
 Walter Underwood
 wun...@wunderwood.org






-- 
Thanks,
-Utkarsh

High disk IO during UpdateCSV

2013-11-12 Thread Utkarsh Sengar

Hello,

I load data from csv to solr via UpdateCSV. There are about 50M documents
with 10 columns in each document. The index size is about 15GB and I am
using a 3 node distributed solr cluster.

While loading the data the disk IO goes to 100%. if the load balancer in
front of solr hits the machine which is doing the processing then the
request times out. But in general, requests to all the machines become
slow. I have attached a screenshot of the diskI/O and CPU usage.

Is there a fix in solr which can possibly throttle the load or maybe its
due to MergePolicy? How can I debug solr to get the exact cause?

-- 
Thanks,
-Utkarsh

Re: Stop/Restart Solr

2013-10-22 Thread Utkarsh Sengar

We use this to start/stop solr:

Start:
java -Dsolr.clustering.enabled=true -Dsolr.solr.home=multicore
-Djetty.class.path=lib/ext/* -Dbootstrap_conf=true -DnumShards=3
-DSTOP.PORT=8079 -DSTOP.KEY=some_value -jar start.jar

Stop:
java -Dsolr.solr.home=multicore -Dbootstrap_conf=true  -DnumShards=3
-DSTOP.PORT=8079 -DSTOP.KEY=some_value -jar start.jar --stop


Thanks,
-Utkarsh



On Tue, Oct 22, 2013 at 10:09 AM, Raheel Hasan raheelhasan@gmail.comwrote:

 ok fantastic... thanks a lot guyz


 On Tue, Oct 22, 2013 at 10:00 PM, François Schiettecatte 
 fschietteca...@gmail.com wrote:

  Yago has the right command to search for the process, that will get you
  the process ID specifically the first number on the output line, then do
  'kill ###', if that fails 'kill -9 ###'.
 
  François
 
  On Oct 22, 2013, at 12:56 PM, Raheel Hasan raheelhasan@gmail.com
  wrote:
 
   its CentOS...
  
   and using jetty with solr here..
  
  
   On Tue, Oct 22, 2013 at 9:54 PM, François Schiettecatte 
   fschietteca...@gmail.com wrote:
  
   A few more specifics about the environment would help,
  Windows/Linux/...?
   Jetty/Tomcat/...?
  
   François
  
   On Oct 22, 2013, at 12:50 PM, Yago Riveiro yago.rive...@gmail.com
  wrote:
  
   If you are asking about if solr has a way to restart himself, I think
   that the answer is no.
  
   If you lost control of the remote machine someone will need to go and
   restart the machine ...
  
   You can try use a kvm or other remote control system
  
   --
   Yago Riveiro
   Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
  
  
   On Tuesday, October 22, 2013 at 5:46 PM, François Schiettecatte
 wrote:
  
   If you are on linux/unix, use the kill command.
  
   François
  
   On Oct 22, 2013, at 12:42 PM, Raheel Hasan 
 raheelhasan@gmail.com
  (mailto:
   raheelhasan@gmail.com) wrote:
  
   Hi,
  
   is there a way to stop/restart java? I lost control over it via SSH
  and
   connection was closed. But the Solr (start.jar) is still running.
  
   thanks.
  
   --
   Regards,
   Raheel Hasan
  
  
  
  
  
  
  
  
  
  
   --
   Regards,
   Raheel Hasan
 
 


 --
 Regards,
 Raheel Hasan




-- 
Thanks,
-Utkarsh

Re: Check if dynamic columns exists and query else ignore

2013-10-18 Thread Utkarsh Sengar

Bumping this one, any suggestions?
Looks like if() and exists() are meant to solve this problem, but I am
using it in a wrong way.

-Utkarsh


On Thu, Oct 17, 2013 at 1:16 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 I trying to do this:

 if (US_offers_i exists):
fq=US_offers_i:[1 TO *]
 else:
fq=offers_count:[1 TO *]

 Where:
 US_offers_i is a dynamic field containing an int
 offers_count is a status field containing an int.

 I have tried this so far but it doesn't work:

 http://solr_server/solr/col1/select?
 q=iphone+5s 
 fq=if(exist(US_offers_i),US_offers_i:[1 TO *], offers_count:[1 TO *])

 Also, there is a heavy performance penalty for this condition? I am
 planning to use this for all my queries.

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: Check if dynamic columns exists and query else ignore

2013-10-18 Thread Utkarsh Sengar

Thanks Chris! That worked!
I overengineered my query!

Thanks,
-Utkarsh


On Fri, Oct 18, 2013 at 12:02 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I trying to do this:
 :
 : if (US_offers_i exists):
 :fq=US_offers_i:[1 TO *]
 : else:
 :fq=offers_count:[1 TO *]

 if() and exist() are functions, so you would have to explicitly use
 them
 in a function context (ie: {!func} parser, or {!frange} parser) and to use
 those nested queries inside of functions you'd need to use the query()
 function.

 but nothing about your problem description suggests that you really need
 to worry about this.

 If a document doesn't contain the US_offers_i then US_offers_i:[1 TO *]
 won't match that document, and neither will US_offers_i:[* TO *] -- so you
 can implement the logic you describe with a simple query...

 fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *]))

 Which you can read as Match does with 1 or more US offers, or: docs that
 have 1 or more offers but no US offer field at all

 : Also, there is a heavy performance penalty for this condition? I am
 : planning to use this for all my queries.

 Any logic that you do at query time, which can be precomputed into a
 specific field in your index will *always* make the queries faster (at the
 expense of a little more time spent indexing and a little more disk used).
 If you know in advance that you are frequently going to want to ristrict
 on this type of logic, then unless you index docs more offten then you
 search docs, you should almost certainly index as has_offers boolean
 field that captures this logic.


 -Hoss




-- 
Thanks,
-Utkarsh

Check if dynamic columns exists and query else ignore

2013-10-17 Thread Utkarsh Sengar

I trying to do this:

if (US_offers_i exists):
   fq=US_offers_i:[1 TO *]
else:
   fq=offers_count:[1 TO *]

Where:
US_offers_i is a dynamic field containing an int
offers_count is a status field containing an int.

I have tried this so far but it doesn't work:

http://solr_server/solr/col1/select?
q=iphone+5s 
fq=if(exist(US_offers_i),US_offers_i:[1 TO *], offers_count:[1 TO *])

Also, there is a heavy performance penalty for this condition? I am
planning to use this for all my queries.

-- 
Thanks,
-Utkarsh

Re: Using split in updateCSV for SolrCloud 4.4

2013-10-11 Thread Utkarsh Sengar

Interestingly this URL by Jack works:
1. curl '
http://localhost/solr/prodinfo/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22stream.contentType=text/csvstream.file=/tmp/test.csv
'

But this doesn't (i.e. it doesn't split the column):
2. curl '
http://localhost/solr/prodinfo/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/catalog.txt
'

The only difference was escape=\, I added that in Jack's example and it
didn't work either. So the culprit was escape=\, not sure why.


Thanks,
-Utkarsh




On Thu, Oct 10, 2013 at 6:11 PM, Yonik Seeley ysee...@gmail.com wrote:

 Perhaps try adding echoParams=all
 to check that all of the input params are being parsed as expected.

 -Yonik

 On Thu, Oct 10, 2013 at 8:10 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:
  Didn't help.
 
  This is the complete data: https://gist.github.com/utkarsh2012/6927649(see
  merchantList column).
  I tried this URL:
  curl '
 
 http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/log_20130101
  '
 
  Can this be a bug in the UpdateCSV split function?
 
  Thanks,
  -Utkarsh
 
 
 
  On Thu, Oct 10, 2013 at 3:11 PM, Jack Krupansky j...@basetechnology.com
 wrote:
 
  Using the standard Solr example for Solr 4.5, the following works,
  splitting the features CSV field into multiple values:
 
  curl http://localhost:8983/solr/**update/csv?commit=truef.**
  features.split=truef.**features.separator=%3Af.**
  features.encapsulator=%22
 http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22
 
  -H Content-Type: text/csv -d '
  id,name,features
  doc-1,doc1,feat1:feat2'
 
  You may need to add stream.contentType=text/csv to you command.
 
  -- Jack Krupansky
 
  -Original Message- From: Utkarsh Sengar
  Sent: Thursday, October 10, 2013 4:51 PM
  To: solr-user@lucene.apache.org
  Subject: Using split in updateCSV for SolrCloud 4.4
 
 
  Hello,
 
  I am trying to use split: http://wiki.apache.org/solr/**UpdateCSV#split
 http://wiki.apache.org/solr/UpdateCSV#splitwhile
  loading some csv data via updateCSV.
 
  This is the field:
  field name=merchantList  type=string indexed=true  stored=true
  multiValued=true omitNorms=true termVectors=false
  termPositions=false termOffsets=false/
 
  This is the column in CSV (merchantList):
  values,16179:10950,.**values..
 
 
  This is the URL I call:
  http://localhost/solr/coll1/**update/csv?commit=truef.**
  merchantList.split=truef.**merchantList.separator=%3Af.**
  merchantList.encapsulator=
 http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=
 
  escape=\stream.file=/data/**dump/log_20130101'
 
  Currently when I load the data, I see this:
 merchantList: [16179:10950],
  But I want this:
 merchantList: [16179,10950],
 
 
  This example is int but I have intentionally kept it as a string since
 some
  values can also be a string.
 
  Any suggestions where I am going wrong?
 
  --
  Thanks,
  -Utkarsh
 
 
 
 
  --
  Thanks,
  -Utkarsh




-- 
Thanks,
-Utkarsh

Using split in updateCSV for SolrCloud 4.4

2013-10-10 Thread Utkarsh Sengar

Hello,

I am trying to use split: http://wiki.apache.org/solr/UpdateCSV#split while
loading some csv data via updateCSV.

This is the field:
field name=merchantList  type=string indexed=true  stored=true
multiValued=true omitNorms=true termVectors=false
termPositions=false termOffsets=false/

This is the column in CSV (merchantList):
values,16179:10950,.values..


This is the URL I call:
http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=
escape=\stream.file=/data/dump/log_20130101'

Currently when I load the data, I see this:
merchantList: [16179:10950],
But I want this:
merchantList: [16179,10950],


This example is int but I have intentionally kept it as a string since some
values can also be a string.

Any suggestions where I am going wrong?

-- 
Thanks,
-Utkarsh

Re: Using split in updateCSV for SolrCloud 4.4

2013-10-10 Thread Utkarsh Sengar

Didn't help.

This is the complete data: https://gist.github.com/utkarsh2012/6927649 (see
merchantList column).
I tried this URL:
curl '
http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/log_20130101
'

Can this be a bug in the UpdateCSV split function?

Thanks,
-Utkarsh



On Thu, Oct 10, 2013 at 3:11 PM, Jack Krupansky j...@basetechnology.comwrote:

 Using the standard Solr example for Solr 4.5, the following works,
 splitting the features CSV field into multiple values:

 curl http://localhost:8983/solr/**update/csv?commit=truef.**
 features.split=truef.**features.separator=%3Af.**
 features.encapsulator=%22http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22
 -H Content-Type: text/csv -d '
 id,name,features
 doc-1,doc1,feat1:feat2'

 You may need to add stream.contentType=text/csv to you command.

 -- Jack Krupansky

 -Original Message- From: Utkarsh Sengar
 Sent: Thursday, October 10, 2013 4:51 PM
 To: solr-user@lucene.apache.org
 Subject: Using split in updateCSV for SolrCloud 4.4


 Hello,

 I am trying to use split: 
 http://wiki.apache.org/solr/**UpdateCSV#splithttp://wiki.apache.org/solr/UpdateCSV#splitwhile
 loading some csv data via updateCSV.

 This is the field:
 field name=merchantList  type=string indexed=true  stored=true
 multiValued=true omitNorms=true termVectors=false
 termPositions=false termOffsets=false/

 This is the column in CSV (merchantList):
 values,16179:10950,.**values..


 This is the URL I call:
 http://localhost/solr/coll1/**update/csv?commit=truef.**
 merchantList.split=truef.**merchantList.separator=%3Af.**
 merchantList.encapsulator=http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=
 escape=\stream.file=/data/**dump/log_20130101'

 Currently when I load the data, I see this:
merchantList: [16179:10950],
 But I want this:
merchantList: [16179,10950],


 This example is int but I have intentionally kept it as a string since some
 values can also be a string.

 Any suggestions where I am going wrong?

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: Using split in updateCSV for SolrCloud 4.4

2013-10-10 Thread Utkarsh Sengar

@Jack I just noticed in your example that: feat1:feat2 is not in an
encapsulator .
Was that a typo or intentional?

You are passing f.features.encapsulator=%22 but don't have  around
feat1:feat2. The request should look:

curl 
http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22;
-H Content-Type: text/csv -d '
id,name,features
doc-1,doc1,feat1:feat2'


Thanks,
-Utkarsh



On Thu, Oct 10, 2013 at 5:10 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Didn't help.

 This is the complete data: https://gist.github.com/utkarsh2012/6927649(see 
 merchantList column).
 I tried this URL:
 curl '
 http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/log_20130101http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=%5Cstream.contentType=text/csvstream.file=/data/dump/log_20130101
 '

 Can this be a bug in the UpdateCSV split function?

 Thanks,
 -Utkarsh



 On Thu, Oct 10, 2013 at 3:11 PM, Jack Krupansky 
 j...@basetechnology.comwrote:

 Using the standard Solr example for Solr 4.5, the following works,
 splitting the features CSV field into multiple values:

 curl http://localhost:8983/solr/**update/csv?commit=truef.**
 features.split=truef.**features.separator=%3Af.**
 features.encapsulator=%22http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22
 -H Content-Type: text/csv -d '
 id,name,features
 doc-1,doc1,feat1:feat2'

 You may need to add stream.contentType=text/csv to you command.

 -- Jack Krupansky

 -Original Message- From: Utkarsh Sengar
 Sent: Thursday, October 10, 2013 4:51 PM
 To: solr-user@lucene.apache.org
 Subject: Using split in updateCSV for SolrCloud 4.4


 Hello,

 I am trying to use split: 
 http://wiki.apache.org/solr/**UpdateCSV#splithttp://wiki.apache.org/solr/UpdateCSV#splitwhile
 loading some csv data via updateCSV.

 This is the field:
 field name=merchantList  type=string indexed=true  stored=true
 multiValued=true omitNorms=true termVectors=false
 termPositions=false termOffsets=false/

 This is the column in CSV (merchantList):
 values,16179:10950,.**values..


 This is the URL I call:
 http://localhost/solr/coll1/**update/csv?commit=truef.**
 merchantList.split=truef.**merchantList.separator=%3Af.**
 merchantList.encapsulator=http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=
 escape=\stream.file=/data/**dump/log_20130101'

 Currently when I load the data, I see this:
merchantList: [16179:10950],
 But I want this:
merchantList: [16179,10950],


 This example is int but I have intentionally kept it as a string since
 some
 values can also be a string.

 Any suggestions where I am going wrong?

 --
 Thanks,
 -Utkarsh




 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: Some text not indexed in solr4.4

2013-09-24 Thread Utkarsh Sengar

@Furkan Yes, I have run a commit, other text is searchable.
Not sure what you mean there for MultiPhraseQuery. It is mentioned in
context to SynonymFilterFactory, RemoveDuplicatesTokenFilterFactory and
PositionFilterFactory. Which part are you referring to?

@Jason I get this response (I have multi-core setup) by hitting this URL:
http://SOLR_SERVER/solr/prodinfo/terms?terms.fl=textterms.prefix=dc

responselst name=responseHeaderint name=status0/intint
name=QTime0/int/lstlst name=termslst
name=text//lst/response

Not sure how can I infer this response. I get the same response for any
prefix like: a, b, iph etc.

My guess is this is happening due to WordDelimiterFilterFactory here:
https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L16, what do
you think? dc44 is somehow delimited during the query time?
Example here says:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
- Split on letter-number transitions (can be turned off - see
splitOnNumerics parameter) SD500 - SD, 500

I will test it out and update this thread with my findings.

Thanks,
-Utkarsh

On Tue, Sep 17, 2013 at 5:10 PM, Jason Hellman
jhell...@innoventsolutions.com wrote:

Utkarsh,

Check to see if the value is actually indexed into the field by using the
Terms request handler:

http://localhost:8983/solr/terms?terms.fl=textterms.prefix=d

(adjust the prefix to whatever you're looking for)

This should get you going in the right direction.

Jason

On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote:

I have a copyField called allText with type text_general:
https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68

I have ~100 documents which have the text: dyson and dc44 or dc41 etc.

For example:
title: Dyson DC44 Animal Digital Slim Cordless Vacuum
description: The DC44 Animal is the new Dyson Digital Slim vacuum
cleaner the cordless machine that doesn’t lose suction. It has been
engineered for floor to ceiling cleaning. DC44 Animal has a detachable
long-reach wand which is balanced for floor to ceiling cleaning. The
motorized floor tool has twice the power of the DC35 floor tool to drive
the bristles deeper into the carpet pile with more force. It attaches to
the wand or directly to the machine for cleaning awkward spaces. The
brush
bar has carbon fiber filaments for removing fine dust from hard floors.
DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
Powered by the Dyson digital motor DC44 Animal has a fade-free nickel
manganese cobalt battery and Root Cyclone technology for constant
powerful
suction.,
UPC: 0879957006362

The documents are indexed.

Analysis says its indexeD: http://i.imgur.com/O52ino1.png
But when I search for allText:dyson dc44 I get no results, response:
http://pastie.org/8334220

Any suggestions about the problem? I am out of ideas about how to debug
this.

--
Thanks,
-Utkarsh

Re: Some text not indexed in solr4.4

2013-09-24 Thread Utkarsh Sengar

WordDelimiterFilterFactory was the culprit. Removing that fixed the problem.

Thanks,
-Utkarsh

On Tue, Sep 24, 2013 at 12:17 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

@Jason I get this response (I have multi-core setup) by hitting this URL:
http://SOLR_SERVER/solr/prodinfo/terms?terms.fl=textterms.prefix=dc

responselst name=responseHeaderint name=status0/intint
name=QTime0/int/lstlst name=termslst
name=text//lst/response

Not sure how can I infer this response. I get the same response for any
prefix like: a, b, iph etc.

I will test it out and update this thread with my findings.

Thanks,
-Utkarsh

On Tue, Sep 17, 2013 at 5:10 PM, Jason Hellman
jhell...@innoventsolutions.com wrote:

Utkarsh,

Check to see if the value is actually indexed into the field by using the
Terms request handler:

http://localhost:8983/solr/terms?terms.fl=textterms.prefix=d

(adjust the prefix to whatever you're looking for)

This should get you going in the right direction.

Jason

On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

I have a copyField called allText with type text_general:
https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68

I have ~100 documents which have the text: dyson and dc44 or dc41 etc.

For example:
title: Dyson DC44 Animal Digital Slim Cordless Vacuum
description: The DC44 Animal is the new Dyson Digital Slim vacuum
cleaner the cordless machine that doesn’t lose suction. It has been
engineered for floor to ceiling cleaning. DC44 Animal has a detachable
long-reach wand which is balanced for floor to ceiling cleaning. The
motorized floor tool has twice the power of the DC35 floor tool to
drive
the bristles deeper into the carpet pile with more force. It attaches to
the wand or directly to the machine for cleaning awkward spaces. The
brush
bar has carbon fiber filaments for removing fine dust from hard floors.
DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
Powered by the Dyson digital motor DC44 Animal has a fade-free nickel
manganese cobalt battery and Root Cyclone technology for constant
powerful
suction.,
UPC: 0879957006362

The documents are indexed.

Analysis says its indexeD: http://i.imgur.com/O52ino1.png
But when I search for allText:dyson dc44 I get no results, response:
http://pastie.org/8334220

Any suggestions about the problem? I am out of ideas about how to debug
this.

--
Thanks,
-Utkarsh

Re: Dynamic row sizing for documents via UpdateCSV

2013-09-17 Thread Utkarsh Sengar

Yeah I think the only way to go about it is via SolrJ. The csv file is
generated by a pig job which computes the data to be loaded in solr.
I think this is what I will endup doing: Load all the possible columns in
the csv with a value of 0 if the value doesn't exist for a specific record.

I was just trying to avoid it and find an optimal solution with UpdateCSV.

Thanks,
-Utkarsh


On Tue, Sep 17, 2013 at 5:43 AM, Erick Erickson erickerick...@gmail.comwrote:

 Well, it's reasonably easy if you have empty columns, in the same
 order, for _all_ of the possible dynamic fields, but I really doubt
 you are that fortunate... It's especially ugly in that you have the
 different dynamic fields scattered around.

 How is the csv file generated? Could you force every row to have
 _all_ the possible columns in the same order with spaces or something
 in the columns that are empty?

 Otherwise I'd think about parsing them externally and using, say, SolrJ
 to transmit the individual records to Solr.

 Best,
 Erick


 On Mon, Sep 16, 2013 at 2:47 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Hello,
 
  I am using UpdateCSV to load data in solr.
 
  Currently I load this schema with a static set of values:
  userid,name,age,location
  john8322,John,32,CA
  tom22,Tom,30,NY
 
 
  But now I have this usecase where john8322 might have a state specific
  dynamic field for example:
  userid,name,age,location, ca_count_i
  john8322,John,32,CA, 7
 
  And tom22 might have different dynamic fields:
  userid,name,age,location, ny_count_i,oh_count_i
  tom22,Tom,30,NY, 981,11
 
  So is it possible to pass different columns sizes for each row, something
  like this:
  john8322,John,32,CA,ca_count_i:7
  tom22,Tom,30,NY, ny_count_i:981,oh_count_i:11
 
  I understand that the above syntax is not possible, but is there any
 other
  way of solving this problem?
 
  --
  Thanks,
  -Utkarsh
 




-- 
Thanks,
-Utkarsh

Some text not indexed in solr4.4

2013-09-17 Thread Utkarsh Sengar

I have a copyField called allText with type text_general:
https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68

I have ~100 documents which have the text: dyson and dc44 or dc41 etc.

For example:
title: Dyson DC44 Animal Digital Slim Cordless Vacuum
description: The DC44 Animal is the new Dyson Digital Slim vacuum
cleaner  the cordless machine that doesn’t lose suction. It has been
engineered for floor to ceiling cleaning. DC44 Animal has a detachable
long-reach wand  which is balanced for floor to ceiling cleaning.   The
motorized floor tool has twice the power of the DC35 floor tool  to drive
the bristles deeper into the carpet pile with more force. It attaches to
the wand or directly to the machine for cleaning awkward spaces. The brush
bar has carbon fiber filaments for removing fine dust from hard floors.
DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
manganese cobalt battery and Root Cyclone technology for constant  powerful
suction.,
UPC: 0879957006362

The documents are indexed.

Analysis says its indexeD: http://i.imgur.com/O52ino1.png
But when I search for allText:dyson dc44 I get no results, response:
http://pastie.org/8334220

Any suggestions about the problem? I am out of ideas about how to debug
this.

-- 
Thanks,
-Utkarsh

Re: Some text not indexed in solr4.4

2013-09-17 Thread Utkarsh Sengar

To add to it, I see the exact problem with the queries: nikon d7100,
nikon d5100, samsung ps-we450 etc.

Thanks,
-Utkarsh


On Tue, Sep 17, 2013 at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 I have a copyField called allText with type text_general:
 https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68

 I have ~100 documents which have the text: dyson and dc44 or dc41 etc.

 For example:
 title: Dyson DC44 Animal Digital Slim Cordless Vacuum
 description: The DC44 Animal is the new Dyson Digital Slim vacuum
 cleaner  the cordless machine that doesn’t lose suction. It has been
 engineered for floor to ceiling cleaning. DC44 Animal has a detachable
 long-reach wand  which is balanced for floor to ceiling cleaning.   The
 motorized floor tool has twice the power of the DC35 floor tool  to drive
 the bristles deeper into the carpet pile with more force. It attaches to
 the wand or directly to the machine for cleaning awkward spaces. The brush
 bar has carbon fiber filaments for removing fine dust from hard floors.
 DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode.
 Powered by the Dyson digital motor  DC44 Animal has a fade-free nickel
 manganese cobalt battery and Root Cyclone technology for constant  powerful
 suction.,
 UPC: 0879957006362

 The documents are indexed.

 Analysis says its indexeD: http://i.imgur.com/O52ino1.png
 But when I search for allText:dyson dc44 I get no results, response:
 http://pastie.org/8334220

 Any suggestions about the problem? I am out of ideas about how to debug
 this.

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Dynamic row sizing for documents via UpdateCSV

2013-09-16 Thread Utkarsh Sengar

Hello,

I am using UpdateCSV to load data in solr.

Currently I load this schema with a static set of values:
userid,name,age,location
john8322,John,32,CA
tom22,Tom,30,NY


But now I have this usecase where john8322 might have a state specific
dynamic field for example:
userid,name,age,location, ca_count_i
john8322,John,32,CA, 7

And tom22 might have different dynamic fields:
userid,name,age,location, ny_count_i,oh_count_i
tom22,Tom,30,NY, 981,11

So is it possible to pass different columns sizes for each row, something
like this:
john8322,John,32,CA,ca_count_i:7
tom22,Tom,30,NY, ny_count_i:981,oh_count_i:11

I understand that the above syntax is not possible, but is there any other
way of solving this problem?

-- 
Thanks,
-Utkarsh

Re: What does it mean when a shard is down in solr4.4?

2013-08-29 Thread Utkarsh Sengar

bumping this one, any suggestions?
I am sure this is solrcloud 101 but I couldn't find documentation anywhere.

Thanks,
-Utkarsh


On Wed, Aug 28, 2013 at 2:37 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 I have a 3 node solrcloud cluster with 3 shards for each collection/core.

 At times when I rebuild the index say on collectionA on nodeA (shard1) via
 UpdateCSV, the Cloud status page says that collectionA on nodeA (shard1)
 is down.

 Observations:
 1. Other collections on nodeA work.
 2. collectionA on nodeB and nodeC works.
 3. nodeA's solr admin is accessible too.

 So my questions are:
 1. What does it really mean when a shard goes down?
 2. How can I recover from that state?

 Solr cloud screenshot: http://i.imgur.com/2TgKXiC.png

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

What does it mean when a shard is down in solr4.4?

2013-08-28 Thread Utkarsh Sengar

I have a 3 node solrcloud cluster with 3 shards for each collection/core.

At times when I rebuild the index say on collectionA on nodeA (shard1) via
UpdateCSV, the Cloud status page says that collectionA on nodeA (shard1)
is down.

Observations:
1. Other collections on nodeA work.
2. collectionA on nodeB and nodeC works.
3. nodeA's solr admin is accessible too.

So my questions are:
1. What does it really mean when a shard goes down?
2. How can I recover from that state?

Solr cloud screenshot: http://i.imgur.com/2TgKXiC.png

-- 
Thanks,
-Utkarsh

Re: No documents found for some queries with special chars like mm

2013-08-27 Thread Utkarsh Sengar

Thanks for the info.

1.
http://SERVER/solr/prodinfo/select?q=o%27reillywt=jsonindent=truedebugQuery=truereturn:

{
  responseHeader:{
status:0,
QTime:16,
params:{
  debugQuery:true,
  indent:true,
  q:o'reilly,
  wt:json}},
  response:{numFound:0,start:0,maxScore:0.0,docs:[]
  },
  debug:{
rawquerystring:o'reilly,
querystring:o'reilly,
parsedquery:MultiPhraseQuery(allText:\o'reilly (reilly oreilly)\),
parsedquery_toString:allText:\o'reilly (reilly oreilly)\,
QParser:LuceneQParser,
explain:{}
   }
}



2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this means
tokens are same for o'reilly
3. I tried escaping ', it doesn’t help:
http://SERVER/solr/prodinfo/select?q=o\%27reillywt=jsonindent=true

I will add WordDelimiterFilterFactory for index and see if it fixes the
problem.

Thanks,
-Utkarsh



On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson erickerick...@gmail.comwrote:

 First thing to do is attach query=debug to your queries and look at the
 parsed output.

 Second thing to do is look at the admin/analysis page and see what happens
 at index and query time to things like o'reilly. You have
 WordDelimiterFilterFactory
 configured in your query but not index analysis chain. My bet on that is
 that
 you're getting different tokens at query and index time...

 Third thing is that you need to escape the  character. It's probably being
 interpreted as a delimiter on the URL and Solr ignores params it doesn't
 understand.

 Best
 Erick


 On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Some of the queries (not all) with special chars return no documents.
 
  Example: queries returning no documents
  q=mm (this can be explained, when I search for m m, no documents are
  returned)
  q=o'reilly (when I search for o reilly, I get documents back)
 
 
  Queries returning documents:
  q=helloworld (document matched is Hello World: A Life in Ham Radio)
 
 
  My questions are:
  1. What's wrong with o'reilly? What changes do I need in my field type?
  2. How can I make the query mm work?
  My indexe has a bunch of MM's docs like: M  M's Milk Chocolate Candy
  Coated Peanuts  19.2 oz and M and Ms Chocolate Candies - Peanut - 1
 Bag
  (42 oz)
 
 
  FIeld type:
  fieldType name=text_general class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishMinimalStemFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
 
  catenateWords=1
 
  catenateNumbers=1
 
  catenateAll=0
 
  preserveOriginal=1/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishMinimalStemFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  /fieldType
 
 
  --
  Thanks,
  -Utkarsh
 




-- 
Thanks,
-Utkarsh

Re: No documents found for some queries with special chars like mm

2013-08-27 Thread Utkarsh Sengar

Yup, the query o'reilly worked after adding WDF to the index analyser.


Although mm or m\m doesn't work.
Field analysis for mm says:
ST m, m
WDF m, m

ST m, m
WDF m, m

So essentially  is ignored during the index or the query. My guess is, the
standard tokenize is the problem. As the documentation says:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
Example: I.B.M. 8.5 can't!!! == ALPHANUM: I.B.M., NUM:8.5,
ALPHANUM:can't

The char  will be ignored I guess.

*So, my question is:*
Is there a way I can make mm index as one string AND also keep
StandardTokenizerFactory since I need it for other searches.

Thanks,
-Utkarsh


On Tue, Aug 27, 2013 at 11:44 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Thanks for the info.

 1.
 http://SERVER/solr/prodinfo/select?q=o%27reillywt=jsonindent=truedebugQuery=truereturn:

 {
   responseHeader:{
 status:0,
 QTime:16,
 params:{
   debugQuery:true,
   indent:true,
   q:o'reilly,
   wt:json}},
   response:{numFound:0,start:0,maxScore:0.0,docs:[]
   },
   debug:{
 rawquerystring:o'reilly,
 querystring:o'reilly,
 parsedquery:MultiPhraseQuery(allText:\o'reilly (reilly oreilly)\),
 parsedquery_toString:allText:\o'reilly (reilly oreilly)\,
 QParser:LuceneQParser,
 explain:{}
}
 }



 2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this
 means tokens are same for o'reilly
 3. I tried escaping ', it doesn’t help:
 http://SERVER/solr/prodinfo/select?q=o\%27reillywt=jsonindent=truehttp://SERVER/solr/prodinfo/select?q=o%5C%27reillywt=jsonindent=true

 I will add WordDelimiterFilterFactory for index and see if it fixes the
 problem.

 Thanks,
 -Utkarsh



 On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 First thing to do is attach query=debug to your queries and look at the
 parsed output.

 Second thing to do is look at the admin/analysis page and see what happens
 at index and query time to things like o'reilly. You have
 WordDelimiterFilterFactory
 configured in your query but not index analysis chain. My bet on that is
 that
 you're getting different tokens at query and index time...

 Third thing is that you need to escape the  character. It's probably
 being
 interpreted as a delimiter on the URL and Solr ignores params it doesn't
 understand.

 Best
 Erick


 On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Some of the queries (not all) with special chars return no documents.
 
  Example: queries returning no documents
  q=mm (this can be explained, when I search for m m, no documents are
  returned)
  q=o'reilly (when I search for o reilly, I get documents back)
 
 
  Queries returning documents:
  q=helloworld (document matched is Hello World: A Life in Ham Radio)
 
 
  My questions are:
  1. What's wrong with o'reilly? What changes do I need in my field
 type?
  2. How can I make the query mm work?
  My indexe has a bunch of MM's docs like: M  M's Milk Chocolate Candy
  Coated Peanuts  19.2 oz and M and Ms Chocolate Candies - Peanut - 1
 Bag
  (42 oz)
 
 
  FIeld type:
  fieldType name=text_general class=solr.TextField
  positionIncrementGap=100
   analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishMinimalStemFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
 
  catenateWords=1
 
  catenateNumbers=1
 
  catenateAll=0
 
  preserveOriginal=1/
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishMinimalStemFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  /fieldType
 
 
  --
  Thanks,
  -Utkarsh
 




 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: No documents found for some queries with special chars like mm

2013-08-27 Thread Utkarsh Sengar

Use a different tokenizer, possibly one of the regex ones.
fake it with phrase queries.
Take a really good look at the various filter combinations. It's
possible that WhitespaceTokenizer and WordDelimiterFilterFactory
might be able to do good things.
Will try to play with these two options.

Clearly define whether this is capability that you really need.
Yes, this is a needed feature. Some of our queries are att, hm, mm.
Returning an empty response is not one of the best experience.

I also tried:

filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1

catenateWords=1

catenateNumbers=1

catenateAll=0

preserveOriginal=1

types=wdfftypes.txt/

With: wdfftypes.txt:
= ALPHA
\u0026 = ALPHA
$ = DIGIT
% = DIGIT
. = DIGIT
\u002C = DIGIT

But it didn't work.

Thanks,
-Utkarsh

On Tue, Aug 27, 2013 at 3:07 PM, Erick Erickson erickerick...@gmail.comwrote:

bq: Is there a way I can make mm index as one string AND also keep
StandardTokenizerFactory since I need it for other searches.

In a word, no. You get one and only one tokenizer per field. But there
are lots of options:
Use a different tokenizer, possibly one of the regex ones.
fake it with phrase queries.
Take a really good look at the various filter combinations. It's
possible that WhitespaceTokenizer and WordDelimiterFilterFactory
might be able to do good things.
Clearly define whether this is capability that you really need.

This last is my recurring plea to insure that the effort is of real benefit
to the user and not just something someone noticed that's actually
only useful 0.001% of the time.

Best
Erick

On Tue, Aug 27, 2013 at 5:00 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Yup, the query o'reilly worked after adding WDF to the index analyser.

Although mm or m\m doesn't work.
Field analysis for mm says:
ST m, m
WDF m, m

ST m, m
WDF m, m

So essentially is ignored during the index or the query. My guess is,
the
standard tokenize is the problem. As the documentation says:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
Example: I.B.M. 8.5 can't!!! == ALPHANUM: I.B.M., NUM:8.5,
ALPHANUM:can't

The char will be ignored I guess.

*So, my question is:*
Is there a way I can make mm index as one string AND also keep
StandardTokenizerFactory since I need it for other searches.

Thanks,
-Utkarsh

On Tue, Aug 27, 2013 at 11:44 AM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Thanks for the info.

http://SERVER/solr/prodinfo/select?q=o%27reillywt=jsonindent=truedebugQuery=truereturn
:

{
responseHeader:{
status:0,
QTime:16,
params:{
debugQuery:true,
indent:true,
q:o'reilly,
wt:json}},
response:{numFound:0,start:0,maxScore:0.0,docs:[]
},
debug:{
rawquerystring:o'reilly,
querystring:o'reilly,
parsedquery:MultiPhraseQuery(allText:\o'reilly (reilly
oreilly)\),
parsedquery_toString:allText:\o'reilly (reilly oreilly)\,
QParser:LuceneQParser,
explain:{}
}
}

2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this
means tokens are same for o'reilly
3. I tried escaping ', it doesn’t help:
http://SERVER/solr/prodinfo/select?q=o\%27reillywt=jsonindent=true
http://SERVER/solr/prodinfo/select?q=o%5C%27reillywt=jsonindent=true

I will add WordDelimiterFilterFactory for index and see if it fixes the
problem.

Thanks,
-Utkarsh

On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson
erickerick...@gmail.com
wrote:

First thing to do is attach query=debug to your queries and look at
the
parsed output.

Second thing to do is look at the admin/analysis page and see what
happens
at index and query time to things like o'reilly. You have
WordDelimiterFilterFactory
configured in your query but not index analysis chain. My bet on that
is
that
you're getting different tokens at query and index time...

Third thing is that you need to escape the character. It's probably
being
interpreted as a delimiter on the URL and Solr ignores params it
doesn't
understand.

Best
Erick

On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar
utkarsh2...@gmail.com
wrote:

Some of the queries (not all) with special chars return no
documents.

Example: queries returning no documents
q=mm (this can be explained, when I search for m m, no documents
are
returned)
q=o'reilly (when I search for o reilly, I get documents back)

Queries returning documents:
q=helloworld (document matched is Hello World: A Life in Ham
Radio)

My questions are:
1. What's wrong with o'reilly? What changes do I need in my field
type?
2. How can I make the query mm work?
My indexe has a bunch

No documents found for some queries with special chars like mm

2013-08-26 Thread Utkarsh Sengar

Some of the queries (not all) with special chars return no documents.

Example: queries returning no documents
q=mm (this can be explained, when I search for m m, no documents are
returned)
q=o'reilly (when I search for o reilly, I get documents back)


Queries returning documents:
q=helloworld (document matched is Hello World: A Life in Ham Radio)


My questions are:
1. What's wrong with o'reilly? What changes do I need in my field type?
2. How can I make the query mm work?
My indexe has a bunch of MM's docs like: M  M's Milk Chocolate Candy
Coated Peanuts  19.2 oz and M and Ms Chocolate Candies - Peanut - 1 Bag
(42 oz)


FIeld type:
fieldType name=text_general class=solr.TextField
positionIncrementGap=100
 analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishMinimalStemFilterFactory/
  filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1

catenateWords=1

catenateNumbers=1

catenateAll=0

preserveOriginal=1/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EnglishMinimalStemFilterFactory/
  filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType


-- 
Thanks,
-Utkarsh

Re: loading solr from Pig?

2013-08-21 Thread Utkarsh Sengar

That's a good point, we load data from pig to solr everyday.

1. What we do:
Pig jobs creates a csv dump, scp it over to a solr node and UpdateCSV
request handler loads the data in solr. A complete rebuild of index for
about 50M documents (20GB) takes 20mins (pig job which pulls and processes
data in cassandra and UpdateCSV loads).

2. Alternate way:
Another way I explored was writing a PIG UDF which POSTS to solr. But batch
http posts were slower than a CSV load for a full index rebuild (and that
was an important usecase for us).

These might not be the best practices, would like to know how others
handling this problem.

Thanks,
-Utkarsh



On Wed, Aug 21, 2013 at 11:29 AM, geeky2 gee...@hotmail.com wrote:

 Hello All,

 Is anyone loading Solr from a Pig script / process?

 I was talking to another group in our company and they have standardized on
 MongoDB instead of Solr - apparently there is very good support between
 MongoDB and Pig - allowing users to stream data directly from a Pig
 process in to MongoDB.

 Does solr have anything like this as well?

 thx
 mark







 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/loading-solr-from-Pig-tp4085933.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Thanks,
-Utkarsh

Re: What filter to use to search with spaces omitted/included between words?

2013-08-20 Thread Utkarsh Sengar

Thanks Tamanjit and Erick.
I tried out the filters, most of the usecases work except q=bestbuy. As
mentioned by Erick, that is a hard one to crack.

I am looking into DictionaryCompoundWordTokenFilterFactory but compound
words like these:
http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_words and
generic english words, it won't cover my need of custom compound words of
store names like BestBuy, WalMart or CirtuitCity.

Thanks,
-Utkarsh


On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky j...@basetechnology.comwrote:

 You could either have a synonym filter to replace bestbuy with best
 buy or use DictionaryCompoundWordTokenFil**terFactory to do the same.

 See:
 http://lucene.apache.org/core/**4_4_0/analyzers-common/org/**
 apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil**
 terFactory.htmlhttp://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html

 There are some examples in my book, but they are for German compound words
 since that was the original primary intent for this filter. But it should
 work for any words since it is a simple dictionary.

 -- Jack Krupansky

 -Original Message- From: Erick Erickson
 Sent: Tuesday, August 20, 2013 7:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: What filter to use to search with spaces omitted/included
 between words?


 Also consider WordDelimterFilterFactory, which will break up the
 tokens on upper/lower case transitions.

 to get relevance, consider edismax-style query parsers and adding
 automatic phrase generation (with boosts usually).

 This one will be a problem:
 q=bestbuy

 There's no good generic way to get this to split up. One
 possibility is to use synonyms if the list is known, but
 otherwise there's no information here to distinguish it
 from legitimate words.

 edgeNgrams work on _tokens_, not words so I doubt
 they would help in this case either since there is only
 one token.

 Best
 Erick


 On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bin...@yahoo.co.in 
 tamanjit.bin...@yahoo.co.in wrote:

  Additionally, if you dont want results like q=best and result=bestbuy; you
 can use charFilter class=solr.**PatternReplaceCharFilterFactor**y
 pattern=\W+ replacement=/ to actually replace whitespaces with
 nothing.


 http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
 s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
 
 http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
 s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
 



 --
 View this message in context:
 http://lucene.472066.n3.**nabble.com/What-filter-to-use-**
 to-search-with-spaces-omitted-**included-between-words-**
 tp4085576p4085601.htmlhttp://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Thanks,
-Utkarsh

Re: What filter to use to search with spaces omitted/included between words?

2013-08-20 Thread Utkarsh Sengar

Let me take that back, this actually works. q=bestbuy matches Best Buy
and documents are returned.

fieldType name=rl_keywords class=solr.TextField
positionIncrementGap=100
 analyzer type=index
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1

catenateWords=1

catenateNumbers=1

catenateAll=0

preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
analyzer type=query
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1

catenateWords=1

catenateNumbers=1

catenateAll=0

preserveOriginal=1/
filter class=solr.LowerCaseFilterFactory/
tokenizer class=solr.KeywordTokenizerFactory/
/analyzer
/fieldType

I was using tokenizer class=solr.StandardTokenizerFactory/, replacing
it with tokenizer class=solr.KeywordTokenizerFactory/ did the trick.
Not sure how it worked. The field value I am searching is Best Buy, but
when I search for bestbuy, it returns a result.

Thanks,
-Utkarsh



On Tue, Aug 20, 2013 at 4:48 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Thanks Tamanjit and Erick.
 I tried out the filters, most of the usecases work except q=bestbuy. As
 mentioned by Erick, that is a hard one to crack.

 I am looking into DictionaryCompoundWordTokenFilterFactory but compound
 words like these:
 http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_wordsand 
 generic english words, it won't cover my need of custom compound words
 of store names like BestBuy, WalMart or CirtuitCity.

 Thanks,
 -Utkarsh


 On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky 
 j...@basetechnology.comwrote:

 You could either have a synonym filter to replace bestbuy with best
 buy or use DictionaryCompoundWordTokenFil**terFactory to do the same.

 See:
 http://lucene.apache.org/core/**4_4_0/analyzers-common/org/**
 apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil**
 terFactory.htmlhttp://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html

 There are some examples in my book, but they are for German compound
 words since that was the original primary intent for this filter. But it
 should work for any words since it is a simple dictionary.

 -- Jack Krupansky

 -Original Message- From: Erick Erickson
 Sent: Tuesday, August 20, 2013 7:21 AM
 To: solr-user@lucene.apache.org
 Subject: Re: What filter to use to search with spaces omitted/included
 between words?


 Also consider WordDelimterFilterFactory, which will break up the
 tokens on upper/lower case transitions.

 to get relevance, consider edismax-style query parsers and adding
 automatic phrase generation (with boosts usually).

 This one will be a problem:
 q=bestbuy

 There's no good generic way to get this to split up. One
 possibility is to use synonyms if the list is known, but
 otherwise there's no information here to distinguish it
 from legitimate words.

 edgeNgrams work on _tokens_, not words so I doubt
 they would help in this case either since there is only
 one token.

 Best
 Erick


 On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bin...@yahoo.co.in 
 tamanjit.bin...@yahoo.co.in wrote:

  Additionally, if you dont want results like q=best and result=bestbuy;
 you
 can use charFilter class=solr.**PatternReplaceCharFilterFactor**y
 pattern=\W+ replacement=/ to actually replace whitespaces with
 nothing.


 http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
 s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
 
 http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
 s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
 



 --
 View this message in context:
 http://lucene.472066.n3.**nabble.com/What-filter-to-use-**
 to-search-with-spaces-omitted-**included-between-words-**
 tp4085576p4085601.htmlhttp://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html
 Sent from the Solr - User mailing list archive at Nabble.com.





 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

What filter to use to search with spaces omitted/included between words?

2013-08-19 Thread Utkarsh Sengar

I have a field which consists of a store name.
How can I make sure that these queries return relevant results when
searched against this column:

*Example1: Best Buy*
q=best (tokenizer filter makes this work)
q=bestbuy
q=buy (tokenizer filter makes this work)
q=best buy (lower case filter makes this work)
q=Best Buy (this should work)


*Example2: CircuitCity*
q=circuit (adding * will fix it, but if I append it to every query, it
creates a lot of noise too)
q=CircuitCity (this should work)
q=city (adding * will fix it, but if I append it to every query, it
creates a lot of noise too)
q=circuit city
q=Circuit City

-- 
Thanks,
-Utkarsh

Load a list of values in a solr field and query over its items

2013-08-14 Thread Utkarsh Sengar

Hello,

Is it possible to load a list in a solr filed and query for items in that
list?

example_core1:

document1:
FieldName=user_ids
Value=8,6,1,9,3,5,7
FieldName=allText
Value=text to be searched over with title and description

document2:
FieldName=user_ids
Value=8738,624623,7272.82272,733
FieldName=allText
Value=more text for document2

Query: allText:hello
fq:user_ids:8,8738

Result: All documents who have the text hello in allText and userId=8

If this is not possible, what is a better way to solve this problem?

-- 
Thanks,
-Utkarsh

Re: Load a list of values in a solr field and query over its items

2013-08-14 Thread Utkarsh Sengar

Never mind,got my answer here: http://stackoverflow.com/a/5800830/231917

field name=tagstag1/tags
field name=tagstag2/tags
...
field name=tagstagn/tags

once you have all the values index you can search or filter results by any
value, e,g. you can find all documents with tag1 using query like

q=tags:tag1

or use the tags to filter out results like

q=queryfq=tags:tag1


Thanks!

-Utkarsh


On Wed, Aug 14, 2013 at 11:57 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Thanks Aloke!

 So a multivalued field assumes:

 1. if data is inserted in this form: 8738,624623,7272,82272,733, there
 are 5 unique values separated by a comma (or any other separator)?
 2. And a filter query can be applied over it?

 Thanks,
 -Utkarsh



 On Wed, Aug 14, 2013 at 11:45 AM, Aloke Ghoshal alghos...@gmail.comwrote:

 Should work once you set up both fields as multiValued (
 http://wiki.apache.org/solr/SchemaXml#Common_field_options).


 On Thu, Aug 15, 2013 at 12:07 AM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Hello,
 
  Is it possible to load a list in a solr filed and query for items in
 that
  list?
 
  example_core1:
 
  document1:
  FieldName=user_ids
  Value=8,6,1,9,3,5,7
  FieldName=allText
  Value=text to be searched over with title and description
 
  document2:
  FieldName=user_ids
  Value=8738,624623,7272.82272,733
  FieldName=allText
  Value=more text for document2
 
  Query: allText:hello
  fq:user_ids:8,8738
 
  Result: All documents who have the text hello in allText and userId=8
 
  If this is not possible, what is a better way to solve this problem?
 
  --
  Thanks,
  -Utkarsh
 




 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: Suggest aka autocomplete request handler with solr 4.4

2013-08-08 Thread Utkarsh Sengar

HI Chris,
You were right, appl was  matched to application. So, I created a new
type without the stemmer.

New type:
fieldType name=text_spell class=solr.TextField
positionIncrementGap=100
 analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1

catenateWords=1

catenateNumbers=1

catenateAll=0

preserveOriginal=1/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer

Which has a field:
field name=spellText type=text_spell indexed=true stored=false
multiValued=true omitNorms=true termVectors=false
termPositions=false termOffsets=false/

Which is a copyField:
copyField source=title dest=spellText/
copyField source=description dest=spellText/
copyField source=category dest=spellText/
copyField source=brand dest=spellText/
copyField source=subtitle dest=spellText/


Although this is my problem now:
When I run this query:
http://SOLR_SERVER/solr/prodinfo/spell?q=delllspellcheck=truespellcheck.collate=truespellcheck.build=true

I get this response:
response
lst name=responseHeader
int name=status0/int
int name=QTime9/int
/lst
str name=commandbuild/str
result name=response numFound=0 start=0 maxScore=0.0/
lst name=spellcheck
lst name=suggestions
bool name=correctlySpelledfalse/bool
/lst
/lst
/response

It knows the term is incorrect, but I don't get any suggestions back. What
can be wrong here?

Thanks,
-Utkarsh


On Thu, Aug 8, 2013 at 7:19 AM, Vinícius vinicius.remi...@gmail.com wrote:

 if correctSpelled is true, then appl was found in solr index. In this
 case, maybe  the EnglishMinimalStemFilterFactory filter in text_general
 fieldType is messing your suggestion.


 On 6 August 2013 15:33, Utkarsh Sengar utkarsh2...@gmail.com wrote:

  Jack/Chris,
 
  1. This is my complete schema.xml:
 
 
 https://gist.github.com/utkarsh2012/6167128/raw/1d5ac6520b666435cd040b5cc6dcb434cdfd7925/schema.xml
  More specifically, allText is of type: text_general which has a
  LowerCaseFatcory during index time.
 
  2. allText has values:
 
 
 http://solr_server/solr/prodinfo/terms?terms.fl=allTextterms.limit=100indent=truereturns
  a lot of values. I have never used the /term request handler, but
  it very slow.
 
  3. When I try this query:
 
 
 http://solr_server/solr/prodinfo/spell?q=applspellcheck=truespellcheck.collate=truespellcheck.build=true
  ,
  I get documents back which match the query: appl. But my expectation is
  to get the spell corrected keywords back like apple AND the documents
  with the keyword apple.
  Response from the above query:
  result
  doc./doc
  doc./doc
  ..
  /result
  lst name=spellcheck
  lst name=suggestions
  bool name=correctlySpelledtrue/bool
  /lst
  /lst
 
  Thanks,
  -Utkarsh
 
 
 
  On Mon, Aug 5, 2013 at 4:56 PM, Chris Hostetter 
 hossman_luc...@fucit.org
  wrote:
 
  
   : Where allText is a copy field which indexes all the content I have
 in
   : document title, description etc.
  
   what does the field  fieldType of allText look like?
  
   : I have reindexed my data after adding this config (i.e. loading the
  whole
   : dataset again via UpdateCSV), also tried to reload the core via http.
  
   did you note the comments on that page regarding spellcheck.build ?
  
NOTE: currently implemented Lookup-s keep their data in memory, so
   unlike spellchecker data this data is discarded on core reload and not
   available until you invoke the build command, either explicitly or
   implicitly via commit. 
  
  
  
   -Hoss
  
 
 
 
  --
  Thanks,
  -Utkarsh
 




-- 
Thanks,
-Utkarsh

Re: Suggest aka autocomplete request handler with solr 4.4

2013-08-06 Thread Utkarsh Sengar

Jack/Chris,

1. This is my complete schema.xml:
https://gist.github.com/utkarsh2012/6167128/raw/1d5ac6520b666435cd040b5cc6dcb434cdfd7925/schema.xml
More specifically, allText is of type: text_general which has a
LowerCaseFatcory during index time.

2. allText has values:
http://solr_server/solr/prodinfo/terms?terms.fl=allTextterms.limit=100indent=truereturns
a lot of values. I have never used the /term request handler, but
it very slow.

3. When I try this query:
http://solr_server/solr/prodinfo/spell?q=applspellcheck=truespellcheck.collate=truespellcheck.build=true,
I get documents back which match the query: appl. But my expectation is
to get the spell corrected keywords back like apple AND the documents
with the keyword apple.
Response from the above query:
result
doc./doc
doc./doc
..
/result
lst name=spellcheck
lst name=suggestions
bool name=correctlySpelledtrue/bool
/lst
/lst

Thanks,
-Utkarsh

On Mon, Aug 5, 2013 at 4:56 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:

: Where allText is a copy field which indexes all the content I have in
: document title, description etc.

what does the field fieldType of allText look like?

: I have reindexed my data after adding this config (i.e. loading the whole
: dataset again via UpdateCSV), also tried to reload the core via http.

did you note the comments on that page regarding spellcheck.build ?

NOTE: currently implemented Lookup-s keep their data in memory, so
unlike spellchecker data this data is discarded on core reload and not
available until you invoke the build command, either explicitly or
implicitly via commit.

-Hoss

--
Thanks,
-Utkarsh

Re: Suggest aka autocomplete request handler with solr 4.4

2013-08-05 Thread Utkarsh Sengar

Bumping this one, is this feature maintained anymore?

Thanks,
-Utkarsh


On Fri, Aug 2, 2013 at 2:27 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 I am trying to get autocorrect and suggest feature work on my solr 4.4
 setup.

 As recommended here: http://wiki.apache.org/solr/Suggester, this is my
 solrconfig: http://apaste.info/eBPr

 Where allText is a copy field which indexes all the content I have in
 document title, description etc.

 I am trying to use it like this:
 http://solr_server/solr/core1/suggest?q=appl and I expect to see apple
 but I get this response:

 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 /lst
 /response

 I have reindexed my data after adding this config (i.e. loading the whole
 dataset again via UpdateCSV), also tried to reload the core via http.

 So, I have 2 questions:

 1. Is there a better way to reindex from the solr admin panel directly
 without actually going through the process of loading the data again?

 2. Any suggestions on what am I missing with the suggest request handler?

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Suggest aka autocomplete request handler with solr 4.4

2013-08-02 Thread Utkarsh Sengar

I am trying to get autocorrect and suggest feature work on my solr 4.4
setup.

As recommended here: http://wiki.apache.org/solr/Suggester, this is my
solrconfig: http://apaste.info/eBPr

Where allText is a copy field which indexes all the content I have in
document title, description etc.

I am trying to use it like this:
http://solr_server/solr/core1/suggest?q=appl and I expect to see apple
but I get this response:

response
lst name=responseHeader
int name=status0/int
int name=QTime0/int
/lst
/response

I have reindexed my data after adding this config (i.e. loading the whole
dataset again via UpdateCSV), also tried to reload the core via http.

So, I have 2 questions:

1. Is there a better way to reindex from the solr admin panel directly
without actually going through the process of loading the data again?

2. Any suggestions on what am I missing with the suggest request handler?

-- 
Thanks,
-Utkarsh

Re: Sort top N results in solr after boosting

2013-07-30 Thread Utkarsh Sengar

Thanks guys! Will play around with it function query.

Thanks,
-Utkarsh


On Tue, Jul 30, 2013 at 10:50 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : bq: I am also trying to figure out if I can place
 : extra dimensions to the solr score which takes other attributes into
 : consideration

 To re-iterate erick's point, you should definitely look at using things
 like the {!boost} qparser combined with funciton queries that take into
 account pre-comuted numeric data bsaed on your domain knowledge to
 *augment* the scoring you get from text relevancy -- that is likeley to
 prove far superior to taking some arbitrary cut-off of the top N documents
 and then sorting based on your domain knowledge...

 https://people.apache.org/~hossman/ac2012eu/

 https://www.youtube.com/watch?v=AosaVoBk8oklist=PLsj1Ri57ZE94lISrJuy7W8COc2RNFC1Flindex=2


 -Hoss




-- 
Thanks,
-Utkarsh

Re: monitor jvm heap size for solrcloud

2013-07-26 Thread Utkarsh Sengar

We have been using newrelic (they have a free plan too) and gives all
needed info like: jvm heap usage in eden space, survivor space and old gen.
Garbage collection info, detailed info about the solr requests and its
response times, error rates etc.

I highly recommend using newrelic to monitor your solr cluster:
http://blog.newrelic.com/2010/05/11/got-apache-solr-search-server-use-rpm-to-monitor-troubleshoot-and-tune-solr-operations/

Thanks,
-Utkarsh


On Fri, Jul 26, 2013 at 2:38 PM, SolrLover bbar...@gmail.com wrote:

 I have used JMX with SOLR before..

 http://docs.lucidworks.com/display/solr/Using+JMX+with+Solr



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/monitor-jvm-heap-size-for-solrcloud-tp4080713p4080725.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Thanks,
-Utkarsh

Re: Sort top N results in solr after boosting

2013-07-25 Thread Utkarsh Sengar

I agree with your comment on separating noise with the actual relevant
result.
My approach to separate relevant result with noise is not algorithmic but
an absolute measure, i.e. top 5 or top 10 results will always be relevant
(at-least the probability is higher).
But again, that kind of simple sort can be done by the client too.

The current relevant results are purely based off PMIs which is calculated
using the clickstream data. I am also trying to figure out if I can place
extra dimensions to the solr score which takes other attributes into
consideration.
i.e. extending the way solr computes the score with attachment_count (more
attachments, more important), confidence (stronger source has higher
confidence) etc.

Is there a way I can have my custom scoring function which extends (and not
overwrites) solr's scores?

Thanks,
-Utkarsh


On Wed, Jul 24, 2013 at 7:35 PM, Erick Erickson erickerick...@gmail.comwrote:

 You can certainly just include the attachment count in the
 response and have the app apply the secondary sort. But
 that doesn't separate the noise as you say.

 How would you identify noise? If you don't have an algorithmic
 way to do that, I don't know how you'd manage to separate
 the signal from the noise

 Best
 Erick

 On Wed, Jul 24, 2013 at 4:37 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:
  I have a solr query which has a bunch of boost params for relevancy. This
  search works fine and returns the most relevant documents as per the user
  query. For example, if user searches for: iphone 5, keywords like
  apple, wifi etc are boosted. I get these keywords from external
  training. The top 10-20 results are iphone 5 phones and then it follows
  iphone cases and other noise.
 
  But I also have a field in the schema called: attachment_count. I need to
  sort the top N result I get after boost based on this field.
 
  Example:
  I want to sort the top 5 documents based on attachment_count on the
 boosted
  result (which are relevant for the user).
 
  1. iphone 5 32gb, attachment_count=0
  2. iphone 5 16gb, attachment_count=5
  3. iphone 5 32gb, attachment_count=10
  4. iphone 4gs, attachment_count=3
  5. iphone 4, attachment_count=1
  ...
  11. iphone 5 case, attachment_count=100
 
 
  Expected result:
  1. iphone 5 32gb, attachment_count=10
  2. iphone 5 16gb, attachment_count=5
  3. iphone 4gs, attachment_count=3
  4. iphone 4, attachment_count=1
  5. iphone 5 32gb, attachment_count=0
  ...
  11. iphone 5 case, attachment_count=100
 
 
  Is this possible using a function query? I am not sure how the results
 will
  look like but I want to try it out.
 
  --
  Thanks,
  -Utkarsh




-- 
Thanks,
-Utkarsh

Sort top N results in solr after boosting

2013-07-24 Thread Utkarsh Sengar

I have a solr query which has a bunch of boost params for relevancy. This
search works fine and returns the most relevant documents as per the user
query. For example, if user searches for: iphone 5, keywords like
apple, wifi etc are boosted. I get these keywords from external
training. The top 10-20 results are iphone 5 phones and then it follows
iphone cases and other noise.

But I also have a field in the schema called: attachment_count. I need to
sort the top N result I get after boost based on this field.

Example:
I want to sort the top 5 documents based on attachment_count on the boosted
result (which are relevant for the user).

1. iphone 5 32gb, attachment_count=0
2. iphone 5 16gb, attachment_count=5
3. iphone 5 32gb, attachment_count=10
4. iphone 4gs, attachment_count=3
5. iphone 4, attachment_count=1
...
11. iphone 5 case, attachment_count=100


Expected result:
1. iphone 5 32gb, attachment_count=10
2. iphone 5 16gb, attachment_count=5
3. iphone 4gs, attachment_count=3
4. iphone 4, attachment_count=1
5. iphone 5 32gb, attachment_count=0
...
11. iphone 5 case, attachment_count=100


Is this possible using a function query? I am not sure how the results will
look like but I want to try it out.

-- 
Thanks,
-Utkarsh

Re: How to use joins in solr 4.3.1

2013-07-16 Thread Utkarsh Sengar

)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:662)

84343363 [qtp2012387303-17] ERROR org.apache.solr.core.SolrCore  –
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at http://x:8983/solr/location returned non ok status:500,
message:Server Error
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

84343364 [qtp2012387303-17] INFO  org.apache.solr.core.SolrCore  –
[location] webapp=/solr path=/select
params={indent=trueq=*:*_=1373999505886wt=xmlfq={!join+from%3Dkey+to%3DmerchantId+fromIndex%3Dmerchant}}
status=500 QTime=185
84343365 [qtp2012387303-17] ERROR
org.apache.solr.servlet.SolrDispatchFilter  –
null:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at http://x:8983/solr/location returned non ok status:500,
message:Server Error
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)


Thanks,
-Utkarsh



On Tue, Jul 16, 2013 at 5:24 AM, Erick Erickson erickerick...@gmail.comwrote:

 Not quite sure what's the problem with the second, but the
 first is:
 q=:

 That just isn't legal, try q=*:*

 As for the second, are there any other errors in the solr log?
 Sometimes what's returned in the response packet does not
 include the true source of the problem.

 Best
 Erick

 On Mon, Jul 15, 2013 at 7:40 PM, Utkarsh Sengar utkarsh2...@gmail.com

Re: How to use joins in solr 4.3.1

2013-07-16 Thread Utkarsh Sengar

Found this post:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201302.mbox/%3CCAB_8Yd82aqq=oY6dBRmVjG7gvBBewmkZGF9V=fpne4xgkbu...@mail.gmail.com%3E

And based on the answer, I modified my query: localhost:8983/solr/location/
select?fq={!join from=key to=merchantId fromIndex=merchant}*:*

I don't see any errors, but my original problem still persists, no
documents are returned.
The two fields on which I am trying to join is:

Merchant: field name=merchantId type=string   indexed=true
stored=true  multiValued=false /
Location:  field name=merchantId type=string   indexed=false
stored=true  multiValued=false /

Thanks,
-Utkarsh


On Tue, Jul 16, 2013 at 11:39 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Looks like the JoinQParserPlugin is throwing an NPE.
 Query: localhost:8983/solr/location/select?q=*:*fq={!join from=key
 to=merchantId fromIndex=merchant}

 84343345 [qtp2012387303-16] ERROR org.apache.solr.core.SolrCore  –
 java.lang.NullPointerException
 at
 org.apache.solr.search.JoinQuery.hashCode(JoinQParserPlugin.java:580)
 at org.apache.solr.search.QueryResultKey.init(QueryResultKey.java:50)
 at
 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1274)
 at
 org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457)
 at
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:365)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:662)

 84343350 [qtp2012387303-16] INFO  org.apache.solr.core.SolrCore  –
 [location] webapp=/solr path=/select
 params={distrib=falsewt=javabinversion=2rows=10df=allTextfl=key,scoreshard.url=x:8983/solr/location/NOW=1373999694930start=0q=*:*_=1373999505886isShard=truefq={!join+from%3Dkey+to%3DmerchantId+fromIndex%3Dmerchant}fsv=true}
 status=500 QTime=6
 84343351 [qtp2012387303-16] ERROR
 org.apache.solr.servlet.SolrDispatchFilter  –
 null:java.lang.NullPointerException
 at
 org.apache.solr.search.JoinQuery.hashCode(JoinQParserPlugin.java:580)
 at org.apache.solr.search.QueryResultKey.init(QueryResultKey.java:50)
 at
 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1274)
 at
 org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457

How to use joins in solr 4.3.1

2013-07-15 Thread Utkarsh Sengar

Hello,

I am trying to join data between two cores: merchant and location

This is my query:
http://_server_.com:8983/solr/location/select?q={!join from=merchantId
to=merchantId fromIndex=merchant}walgreens
Ref: http://wiki.apache.org/solr/Join


Merchants core has documents for the query: walgreens with an merchantId 1
A simple query: http://_server_.com:8983/solr/location/select?q=walgreens
returns documents called walgreens with merchantId=1

Location core has documents with merchantId=1 too.

But my join query returns no documents.

This is the response I get:
{
  responseHeader:{
status:0,
QTime:5,
params:{
  debugQuery:true,
  indent:true,
  q:{!join from=merchantId to=merchantId
fromIndex=merchant}walgreens,
  wt:json}},
  response:{numFound:0,start:0,maxScore:0.0,docs:[]
  },
  debug:{
rawquerystring:{!join from=merchantId to=merchantId
fromIndex=merchant}walgreens,
querystring:{!join from=merchantId to=merchantId
fromIndex=merchant}walgreens,
parsedquery:JoinQuery({!join from=merchantId to=merchantId
fromIndex=merchant}allText:walgreens),
parsedquery_toString:{!join from=merchantId to=merchantId
fromIndex=merchant}allText:walgreens,
QParser:,
explain:{}}}


Any suggestions?


-- 
Thanks,
-Utkarsh

Re: How to use joins in solr 4.3.1

2013-07-15 Thread Utkarsh Sengar

I have also tried these queries (as per this SO answer:
http://stackoverflow.com/questions/12665797/is-solr-4-0-capable-of-using-join-for-multiple-core
)

1. http://_server_.com:8983/solr/location/select?q=:fq={!join
from=merchantId to=merchantId fromIndex=merchant}walgreens

And I get this:

{
  responseHeader:{
status:400,
QTime:1,
params:{
  indent:true,
  q::,
  wt:json,
  fq:{!join from=merchantId to=merchantId
fromIndex=merchant}walgreens}},
  error:{
msg:org.apache.solr.search.SyntaxError: Cannot parse ':':
Encountered \ \:\ \: \\ at line 1, column 0.\nWas expecting one
of:\nNOT ...\n\+\ ...\n\-\ ...\nBAREOPER ...\n
   \(\ ...\n\*\ ...\nQUOTED ...\nTERM ...\n
PREFIXTERM ...\nWILDTERM ...\nREGEXPTERM ...\n\[\
...\n\{\ ...\nLPARAMS ...\nNUMBER ...\nTERM
...\n\*\ ...\n,
code:400}}

And this:
2.http://_server_.com:8983/solr/location/select?q=walgreensfq={!join
from=merchantId to=merchantId fromIndex=merchant}

{
  responseHeader:{
status:500,
QTime:5,
params:{
  indent:true,
  q:walgreens,
  wt:json,
  fq:{!join from=merchantId to=merchantId fromIndex=merchant}}},
  error:{
msg:Server at http://_SERVER_:8983/solr/location returned non
ok status:500, message:Server Error,

trace:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at http://_SERVER_:8983/solr/location returned non ok
status:500, message:Server Error\n\tat
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)\n\tat
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)\n\tat
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156)\n\tat
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)\n\tat
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:138)\n\tat
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)\n\tat
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:138)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)\n\tat
java.lang.Thread.run(Thread.java:662)\n,
code:500}}

Thanks,
-Utkarsh



On Mon, Jul 15, 2013 at 4:27 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Hello,

 I am trying to join data between two cores: merchant and location

 This is my query:
 http://_server_.com:8983/solr/location/select?q={!join from=merchantId
 to=merchantId fromIndex=merchant}walgreens
 Ref: http://wiki.apache.org/solr/Join


 Merchants core has documents for the query: walgreens with an merchantId
 1
  A simple query: http://_server_.com:8983/solr/location/select?q=walgreens
 returns documents called walgreens with merchantId=1

 Location core has documents with merchantId=1 too.

 But my join query returns no documents.

 This is the response I get:
 {
   responseHeader:{
 status:0,
 QTime:5,
 params:{
   debugQuery:true,
   indent:true,
   q:{!join from=merchantId to=merchantId
 fromIndex=merchant}walgreens,
   wt:json}},
   response:{numFound:0,start:0,maxScore:0.0,docs:[]
   },
   debug:{
 rawquerystring:{!join from=merchantId to=merchantId
 fromIndex=merchant}walgreens,
 querystring:{!join from=merchantId to=merchantId
 fromIndex=merchant}walgreens,
 parsedquery:JoinQuery({!join from=merchantId to=merchantId
 fromIndex=merchant}allText:walgreens),
 parsedquery_toString:{!join from=merchantId to=merchantId
 fromIndex=merchant}allText:walgreens,
 QParser:,
 explain:{}}}


 Any suggestions?


 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: Improving performance to return 2000+ documents

2013-07-01 Thread Utkarsh Sengar

Thanks Erick/Jagdish.

Just to give some background on my queries.

1. All my queries are unique. A query can be: ipod and ipod 8gb (but
these are unique). These are about 1.2M in total.
So, I assume setting a high queryResultCache, queryResultWindowSize and
queryResultMaxDocsCached won't help.

2. I have this cache settings:
documentCache class=solr.LRUCache
   size=1
   initialSize=1
   autowarmCount=0
   cleanupThread=true/
//My understanding is, documentCache will help me the most because solr
will cache documents retrieved.
//Stats for documentCache: http://apaste.info/hknh

queryResultCache class=solr.LRUCache
 size=512
 initialSize=512
 autowarmCount=0
 cleanupThread=true/
//Default, since my queries are unique.

filterCache class=solr.FastLRUCache
 size=512
 initialSize=512
 autowarmCount=0/
//Now sure how can I use filterCache, so I am keeping it as the default

enableLazyFieldLoadingtrue/enableLazyFieldLoading
queryResultWindowSize100/queryResultWindowSize
queryResultMaxDocsCached100/queryResultMaxDocsCached


I think the question can also be framed as: How can I optimize solr
response time for 50M product catalog for unique queries which retrieves
2000 documents in one go.
I looked at a solr search component, I think writing a proxy around solr
was easier, so I went ahead with this approach.


Thanks,
-Utkarsh




On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula jagd...@simplyhired.comwrote:

 Solrconfig.xml has got entries which you can tweak for your use case. One
 of them is queryresultwindowsize. You can try using the value of 2000 and
 see if it helps improving performance. Please make sure you have enough
 memory allocated for queryresultcache.
 A combination of sharding and distribution of workload(requesting
 2000/number of shards) with an aggregator would be a good way to maximize
 performance.

 Thanks,

 Jagdish


 On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  50M documents, depending on a bunch of things,
  may not be unreasonable for a single node, only
  testing will tell.
 
  But the question I have is whether you should be
  using standard Solr queries for this or building a custom
  component that goes at the base Lucene index
  and does the right thing. Or even re-indexing your
  entire corpus periodically to add this kind of data.
 
  FWIW,
  Erick
 
 
  On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Thanks Erick/Peter.
  
   This is an offline process, used by a relevancy engine implemented
 around
   solr. The engine computes boost scores for related keywords based on
   clickstream data.
   i.e.: say clickstream has: ipad=upc1,upc2,upc3
   I query solr with keyword: ipad (to get 2000 documents) and then
 make 3
   individual queries for upc1,upc2,upc3 (which are fast).
   The data is then used to compute related keywords to ipad with their
   boost values.
  
   So, I cannot really replace that, since I need full text search over my
   dataset to retrieve top 2000 documents.
  
   I tried paging: I retrieve 500 solr documents 4 times (0-500,
  500-1000...),
   but don't see any improvements.
  
  
   Some questions:
   1. Maybe the JVM size might help?
   This is what I see in the dashboard:
   Physical Memory 76.2%
   Swap Space NaN% (don't have any swap space, running on AWS EBS)
   File Descriptor Count 4.7%
   JVM-Memory 73.8%
  
   Screenshot: http://i.imgur.com/aegKzP6.png
  
   2. Will reducing the shards from 3 to 1 improve performance? (maybe
   increase the RAM from 30 to 60GB) The problem I will face in that case
  will
   be fitting 50M documents on 1 machine.
  
   Thanks,
   -Utkarsh
  
  
   On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
   wrote:
  
Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we
 deal
   with
this scenario is to retrieve the top N documents 5,10,20or100 at a
 time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve'
 millions
   of
documents - we just do it at the user's leisure, rather than make
 them
   wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or
   whatever
number) documents at one time - it's simply too much to take in at
 one
time.
If your use-case involves an automated or offline procedure (e.g.
   running a
report or some data-mining op), then presumably it doesn't matter so
  much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter
   
   
   
On Sat, Jun 29, 2013 at 6:17 PM, Erick

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Utkarsh Sengar

Thanks Erick/Peter.

This is an offline process, used by a relevancy engine implemented around
solr. The engine computes boost scores for related keywords based on
clickstream data.
i.e.: say clickstream has: ipad=upc1,upc2,upc3
I query solr with keyword: ipad (to get 2000 documents) and then make 3
individual queries for upc1,upc2,upc3 (which are fast).
The data is then used to compute related keywords to ipad with their
boost values.

So, I cannot really replace that, since I need full text search over my
dataset to retrieve top 2000 documents.

I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
but don't see any improvements.

Some questions:
1. Maybe the JVM size might help?
This is what I see in the dashboard:
Physical Memory 76.2%
Swap Space NaN% (don't have any swap space, running on AWS EBS)
File Descriptor Count 4.7%
JVM-Memory 73.8%

Screenshot: http://i.imgur.com/aegKzP6.png

2. Will reducing the shards from 3 to 1 improve performance? (maybe
increase the RAM from 30 to 60GB) The problem I will face in that case will
be fitting 50M documents on 1 machine.

Thanks,
-Utkarsh

On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.comwrote:

Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions of
documents - we just do it at the user's leisure, rather than make them wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or whatever
number) documents at one time - it's simply too much to take in at one
time.
If your use-case involves an automated or offline procedure (e.g. running a
report or some data-mining op), then presumably it doesn't matter so much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter

On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com
wrote:

Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.

Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).

Then the node serving the original request merges
the sub-lists to pick the top 2,000. Then the node
sends another request to each shard to get
the full document. Then the node merges this
into the full list to return to the user.

Solr really isn't built for this use-case, is it actually
a compelling situation?

And having your document cache set at 1M is kinda
high if you have very big documents.

FWIW,
Erick

On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Also, I don't see a consistent response time from solr, I ran ab again
and
I get this:

ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500

http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json

Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests

Server Software:
Server Hostname: x.amazonaws.com
Server Port:8983

Document Path:

/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level: 10
Time taken for tests: 10.858 seconds
Complete requests: 500
Failed requests:8
(Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
Write errors: 0
Total transferred: 769297992 bytes
HTML transferred: 769268492 bytes
Requests per second:46.05 [#/sec] (mean)
Time per request: 217.167 [ms] (mean)
Time per request: 21.717 [ms] (mean, across all concurrent
requests)
Transfer rate: 69187.90 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect:00 0.3 0 2
Processing: 110 215 72.0190 497
Waiting: 91 180 70.5152 473
Total:112 216 72.0191 497

Percentage of the requests served within a certain time (ms)
50%191
66%225
75%252
80%272
90%319
95%364
98%420
99%453
100%497 (longest request)

Sometimes it takes a lot of time, sometimes its pretty quick.

Thanks,
-Utkarsh

On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Hello,

I have

Improving performance to return 2000+ documents

2013-06-28 Thread Utkarsh Sengar

Hello,

I have a usecase where I need to retrive top 2000 documents matching a
query.
What are the parameters (in query, solrconfig, schema) I shoud look at to
improve this?

I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM,
8vCPU and 7GB JVM heap size.

I have documentCache:
  documentCache class=solr.LRUCache  size=100
initialSize=100   autowarmCount=0/

allText is a copyField.

This is the result I get:
ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json


Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:x.amazonaws.com
Server Port:8983

Document Path:
/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level:  10
Time taken for tests:   35.999 seconds
Complete requests:  500
Failed requests:21
   (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
Write errors:   0
Non-2xx responses:  2
Total transferred:  766221660 bytes
HTML transferred:   766191806 bytes
Requests per second:13.89 [#/sec] (mean)
Time per request:   719.981 [ms] (mean)
Time per request:   71.998 [ms] (mean, across all concurrent requests)
Transfer rate:  20785.65 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.6  0   8
Processing: 9  717 2339.6199   12611
Waiting:9  635 2233.6164   12580
Total:  9  718 2339.6199   12611

Percentage of the requests served within a certain time (ms)
  50%199
  66%236
  75%263
  80%281
  90%548
  95%838
  98%  12475
  99%  12545
 100%  12611 (longest request)

-- 
Thanks,
-Utkarsh

Re: Improving performance to return 2000+ documents

2013-06-28 Thread Utkarsh Sengar

Also, I don't see a consistent response time from solr, I ran ab again and
I get this:

ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json



Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:   x.amazonaws.com
Server Port:8983

Document Path:
/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level:  10
Time taken for tests:   10.858 seconds
Complete requests:  500
Failed requests:8
   (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
Write errors:   0
Total transferred:  769297992 bytes
HTML transferred:   769268492 bytes
Requests per second:46.05 [#/sec] (mean)
Time per request:   217.167 [ms] (mean)
Time per request:   21.717 [ms] (mean, across all concurrent requests)
Transfer rate:  69187.90 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.3  0   2
Processing:   110  215  72.0190 497
Waiting:   91  180  70.5152 473
Total:112  216  72.0191 497

Percentage of the requests served within a certain time (ms)
  50%191
  66%225
  75%252
  80%272
  90%319
  95%364
  98%420
  99%453
 100%497 (longest request)


Sometimes it takes a lot of time, sometimes its pretty quick.

Thanks,
-Utkarsh


On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Hello,

 I have a usecase where I need to retrive top 2000 documents matching a
 query.
 What are the parameters (in query, solrconfig, schema) I shoud look at to
 improve this?

 I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
 RAM, 8vCPU and 7GB JVM heap size.

 I have documentCache:
   documentCache class=solr.LRUCache  size=100
 initialSize=100   autowarmCount=0/

 allText is a copyField.

 This is the result I get:
 ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 

 Benchmarking x.amazonaws.com (be patient)
 Completed 100 requests
 Completed 200 requests
 Completed 300 requests
 Completed 400 requests
 Completed 500 requests
 Finished 500 requests


 Server Software:
 Server Hostname:x.amazonaws.com
 Server Port:8983

 Document Path:
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 Document Length:1538537 bytes

 Concurrency Level:  10
 Time taken for tests:   35.999 seconds
 Complete requests:  500
 Failed requests:21
(Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
 Write errors:   0
 Non-2xx responses:  2
 Total transferred:  766221660 bytes
 HTML transferred:   766191806 bytes
 Requests per second:13.89 [#/sec] (mean)
 Time per request:   719.981 [ms] (mean)
 Time per request:   71.998 [ms] (mean, across all concurrent requests)
 Transfer rate:  20785.65 [Kbytes/sec] received

 Connection Times (ms)
   min  mean[+/-sd] median   max
 Connect:00   0.6  0   8
 Processing: 9  717 2339.6199   12611
 Waiting:9  635 2233.6164   12580
 Total:  9  718 2339.6199   12611

 Percentage of the requests served within a certain time (ms)
   50%199
   66%236
   75%263
   80%281
   90%548
   95%838
   98%  12475
   99%  12545
  100%  12611 (longest request)

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Updating solrconfig and schema.xml for solrcloud in multicore setup

2013-06-25 Thread Utkarsh Sengar

Hello,

I am trying to update schema.xml for a core in a multicore setup and this
is what I do to update it:

I have 3 nodes in my solr cluster.

1. Pick node1 and manually update schema.xml

2. Restart node1 with -Dbootstrap_conf=true
java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true
-DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar

3. Restart the other 2 nodes using this command (without
-Dbootstrap_conf=true since these should pull from zk).:
java -Dsolr.solr.home=multicore -DnumShards=3 -DzkHost=localhost:2181
-DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar

But, when I do that. node1 displays all of my cores and the other 2 nodes
displays just one core.

Then, I found this:
http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E
Which says bootstrap_conf is used for multicore setup.


But if I use bootstrap_conf for every node, then I will have to manually
update schema.xml (for any config file) everywhere? That does not sound
like an efficient way of managing configuration right?


-- 
Thanks,
-Utkarsh

Re: Updating solrconfig and schema.xml for solrcloud in multicore setup

2013-06-25 Thread Utkarsh Sengar

But as when I launch a solr instance without -Dbootstrap_conf=true, just
once core is launched and I cannot see the other core.

This behavior is the same as Mark's reply here:
http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E

- bootstrap_conf: you pass it true and it reads solr.xml and uploads
the conf set for each
SolrCore it finds, gives the conf set the name of the collection and
associates each collection
with the same named config set.

So the first just lets you boot strap one collection easily...but what
if you start with a
multi-core, multi-collection setup that you want to bootstrap into
SolrCloud? And they don't
share a common config set? That's what the second command is for. You
can setup 30 local SolrCores
in solr.xml and then just bootstrap all 30 different config sets up
and have them fully linked
with each collection just by passing bootstrap_conf=true.

Note: I am using -Dbootstrap_conf=true and not -Dbootstrap_confdir

Thanks,
-Utkarsh

On Tue, Jun 25, 2013 at 2:14 AM, Jan Høydahl jan@cominvent.com wrote:

Hi,

The -Dbootstrap_confdir option is really only meant for a first-time
bootstrap for your development environment, not for serious use.

Once you got your config into ZK you should modify the config directly in
ZK.
There are many tools (also 3rd party) for this. But your best choice is
probably zkCli shipping with Solr.
See http://wiki.apache.org/solr/SolrCloud#Command_Line_Util
This means you will NOT need to start Solr with -Dboostrap_confdir at all.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

25. juni 2013 kl. 10:29 skrev Utkarsh Sengar utkarsh2...@gmail.com:

Hello,

I am trying to update schema.xml for a core in a multicore setup and this
is what I do to update it:

I have 3 nodes in my solr cluster.

1. Pick node1 and manually update schema.xml

2. Restart node1 with -Dbootstrap_conf=true
java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true
-DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar
start.jar

3. Restart the other 2 nodes using this command (without
-Dbootstrap_conf=true since these should pull from zk).:
java -Dsolr.solr.home=multicore -DnumShards=3 -DzkHost=localhost:2181
-DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar

But, when I do that. node1 displays all of my cores and the other 2 nodes
displays just one core.

Then, I found this:

http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E
Which says bootstrap_conf is used for multicore setup.

But if I use bootstrap_conf for every node, then I will have to manually
update schema.xml (for any config file) everywhere? That does not sound
like an efficient way of managing configuration right?

--
Thanks,
-Utkarsh

Re: Updating solrconfig and schema.xml for solrcloud in multicore setup

2013-06-25 Thread Utkarsh Sengar

Yes, I have tried zkCli and it works.
But I also need to restart solr after the schema change right?

I tried to reload the core, but I think there is an open bug where a core
reload is successful but a shard goes down for that core. I just tried it
out, i.e tried to reload a core after config change via zkCli and a shard
went down.

Since I am not able to reload a core, I am restarting the whole solr
process for make the change.

Thanks,
-Utkarsh

On Tue, Jun 25, 2013 at 2:46 AM, Jan Høydahl jan@cominvent.com wrote:

Hi,

As I understand, your initial bootstrap works ok (boostrap_conf). What you
want help with is *changing* the config on a live system.
That's when you are encouraged to use zkCli and don't mess with trying to
let Solr bootstrap things - after all it's not a bootstrap anymore, it's a
change.

Did you try updating schema.xml for a specific collection using zkCli? Any
issues?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

25. juni 2013 kl. 11:24 skrev Utkarsh Sengar utkarsh2...@gmail.com:

But as when I launch a solr instance without -Dbootstrap_conf=true,
just
once core is launched and I cannot see the other core.

This behavior is the same as Mark's reply here:

http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E

Note: I am using -Dbootstrap_conf=true and not -Dbootstrap_confdir

Thanks,
-Utkarsh

On Tue, Jun 25, 2013 at 2:14 AM, Jan Høydahl jan@cominvent.com
wrote:

Hi,

The -Dbootstrap_confdir option is really only meant for a first-time
bootstrap for your development environment, not for serious use.

Once you got your config into ZK you should modify the config directly
in
ZK.
There are many tools (also 3rd party) for this. But your best choice is
probably zkCli shipping with Solr.
See http://wiki.apache.org/solr/SolrCloud#Command_Line_Util
This means you will NOT need to start Solr with -Dboostrap_confdir at
all.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

25. juni 2013 kl. 10:29 skrev Utkarsh Sengar utkarsh2...@gmail.com:

Hello,

I am trying to update schema.xml for a core in a multicore setup and
this
is what I do to update it:

I have 3 nodes in my solr cluster.

1. Pick node1 and manually update schema.xml

2. Restart node1 with -Dbootstrap_conf=true
java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true
-DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar
start.jar

But, when I do that. node1 displays all of my cores and the other 2
nodes
displays just one core.

Then, I found this:

http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E
Which says bootstrap_conf is used for multicore setup.

But if I use bootstrap_conf for every node, then I will have to
manually
update schema.xml (for any config file) everywhere? That does not sound
like an efficient way of managing configuration right?

--
Thanks,
-Utkarsh

Re: Updating solrconfig and schema.xml for solrcloud in multicore setup

2013-06-25 Thread Utkarsh Sengar

I believe I am hitting this bug:
https://issues.apache.org/jira/browse/SOLR-4805
I am using solr 4.3.1

-Utkarsh

On Tue, Jun 25, 2013 at 2:56 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

Yes, I have tried zkCli and it works.
But I also need to restart solr after the schema change right?

Since I am not able to reload a core, I am restarting the whole solr
process for make the change.

Thanks,
-Utkarsh

On Tue, Jun 25, 2013 at 2:46 AM, Jan Høydahl jan@cominvent.comwrote:

Hi,

As I understand, your initial bootstrap works ok (boostrap_conf). What
you want help with is *changing* the config on a live system.
That's when you are encouraged to use zkCli and don't mess with trying to
let Solr bootstrap things - after all it's not a bootstrap anymore, it's a
change.

Did you try updating schema.xml for a specific collection using zkCli?
Any issues?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

25. juni 2013 kl. 11:24 skrev Utkarsh Sengar utkarsh2...@gmail.com:

But as when I launch a solr instance without -Dbootstrap_conf=true,
just
once core is launched and I cannot see the other core.

This behavior is the same as Mark's reply here:

http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E

Note: I am using -Dbootstrap_conf=true and not -Dbootstrap_confdir

Thanks,
-Utkarsh

On Tue, Jun 25, 2013 at 2:14 AM, Jan Høydahl jan@cominvent.com
wrote:

Hi,

The -Dbootstrap_confdir option is really only meant for a first-time
bootstrap for your development environment, not for serious use.

Once you got your config into ZK you should modify the config directly
in
ZK.
There are many tools (also 3rd party) for this. But your best choice is
probably zkCli shipping with Solr.
See http://wiki.apache.org/solr/SolrCloud#Command_Line_Util
This means you will NOT need to start Solr with -Dboostrap_confdir at
all.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

25. juni 2013 kl. 10:29 skrev Utkarsh Sengar utkarsh2...@gmail.com:

Hello,

I am trying to update schema.xml for a core in a multicore setup and
this
is what I do to update it:

I have 3 nodes in my solr cluster.

1. Pick node1 and manually update schema.xml

2. Restart node1 with -Dbootstrap_conf=true
java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true
-DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar
start.jar

But, when I do that. node1 displays all of my cores and the other 2
nodes
displays just one core.

Then, I found this:

http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E
Which says bootstrap_conf is used for multicore setup.

But if I use bootstrap_conf for every node, then I will have to
manually
update schema.xml (for any config file) everywhere? That does not
sound
like an efficient way of managing configuration right?

--
Thanks,
-Utkarsh

Re: solrcloud 4.3.1 - stability and failure scenario questions

2013-06-23 Thread Utkarsh Sengar

Thanks!

1. shards.tolerant=true works, shouldn't this parameter be default?

2. Regarding zk, yes it should be outside the solr nodes and I am
evaluating what difference does it make.

3. Regarding usecase: Daily queries will be about 100k to 200k, not much.
The total data to be indexed is about 45M documents with a total size of
20GB. 3 nodes (sharded and RAM of 30GB each) with 3 replicas sounds like an
overkill for this?

Thanks,
-Utkarsh


On Sat, Jun 22, 2013 at 8:53 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Use shards.tolerant=true to return documents that are available in the
 shards that are still alive.

 Typically people setup ZooKeeper outside of Solr so that solr nodes
 can be added/removed easily independent of ZooKeeper plus it isolates
 ZK from large GC pauses due to Solr's garbage. See
 http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7

 Depending on you use-case, 2-3 replicas might be okay. We don't have
 enough information to answer that question.



 On Sat, Jun 22, 2013 at 10:40 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:
  Thanks Anshum.
 
  Sure, creating a replica will make it failure resistant, but death of
 one shard should not make the whole cluster unusable.
 
  1/3rd of the keys hosted in the killed shard should be unavailable but
 others should be available. Right?
 
  Also, any suggestions on the recommended size of zk and solr cluster
 size and configuration?
 
  Example: 3 shards with 3 replicas and 3 zk processes running on the same
 solr mode sounds acceptable? (Total of 6 VMs)
 
  Thanks,
  -Utkarsh
 
  On Jun 22, 2013, at 4:20 AM, Anshum Gupta ans...@anshumgupta.net
 wrote:
 
  You need to have at least 1 replica from each shard for the SolrCloud
 setup
  to work for you.
  When you kill 1 shard, you essentially are taking away 1/3 of the range
 of
  shard key.
 
 
  On Sat, Jun 22, 2013 at 4:31 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:
 
  Hello,
 
  I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are
  running in a different process in the same machines.
 
  I wanted to know the recommended size of a solrcloud cluster (min zk
  nodes?)
 
  This is the SolrCloud dump:
 https://gist.github.com/utkarsh2012/5840455
 
  And, I am not sure if I am hitting this frustrating bug or this is
 just a
  configuration error from my side. When I kill any *one* of the nodes,
 the
  whole cluster stops responding and I get this request when I query any
 one
  of the two alive nodes.
 
  {
   responseHeader:{
 status:503,
 QTime:2,
 params:{
   indent:true,
   q:*:*,
   wt:json}},
   error:{
 msg:no servers hosting shard: ,
 code:503}}
 
 
 
  I see this exception:
  952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore  –
  org.apache.solr.common.SolrException: no servers hosting shard:
 at
 
 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
 at
 
 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
  java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)
 
 
  --
  Thanks,
  -Utkarsh
 
 
 
  --
 
  Anshum Gupta
  http://www.anshumgupta.net



 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Thanks,
-Utkarsh

solrcloud 4.3.1 - stability and failure scenario questions

2013-06-22 Thread Utkarsh Sengar

Hello,

I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are
running in a different process in the same machines.

I wanted to know the recommended size of a solrcloud cluster (min zk nodes?)

This is the SolrCloud dump: https://gist.github.com/utkarsh2012/5840455

And, I am not sure if I am hitting this frustrating bug or this is just a
configuration error from my side. When I kill any *one* of the nodes, the
whole cluster stops responding and I get this request when I query any one
of the two alive nodes.

{
  responseHeader:{
status:503,
QTime:2,
params:{
  indent:true,
  q:*:*,
  wt:json}},
  error:{
msg:no servers hosting shard: ,
code:503}}



I see this exception:
952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore  –
org.apache.solr.common.SolrException: no servers hosting shard:
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)


-- 
Thanks,
-Utkarsh

Re: solrcloud 4.3.1 - stability and failure scenario questions

2013-06-22 Thread Utkarsh Sengar

Just to be clear here, I when I say I killed a node. I just killed the
solr process on that node. zk on all the 3 nodes were still running.

Thanks,
-Utkarsh


On Sat, Jun 22, 2013 at 4:01 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Hello,

 I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are
 running in a different process in the same machines.

 I wanted to know the recommended size of a solrcloud cluster (min zk
 nodes?)

 This is the SolrCloud dump: https://gist.github.com/utkarsh2012/5840455

 And, I am not sure if I am hitting this frustrating bug or this is just a
 configuration error from my side. When I kill any *one* of the nodes, the
 whole cluster stops responding and I get this request when I query any one
 of the two alive nodes.

 {
   responseHeader:{
 status:503,
 QTime:2,
 params:{
   indent:true,
   q:*:*,
   wt:json}},
   error:{
 msg:no servers hosting shard: ,
 code:503}}



 I see this exception:
 952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore  –
 org.apache.solr.common.SolrException: no servers hosting shard:
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
 at
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)


 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: solrcloud 4.3.1 - stability and failure scenario questions

2013-06-22 Thread Utkarsh Sengar

Thanks Anshum.

Sure, creating a replica will make it failure resistant, but death of one shard 
should not make the whole cluster unusable.

1/3rd of the keys hosted in the killed shard should be unavailable but others 
should be available. Right?

Also, any suggestions on the recommended size of zk and solr cluster size and 
configuration?

Example: 3 shards with 3 replicas and 3 zk processes running on the same solr 
mode sounds acceptable? (Total of 6 VMs)

Thanks,
-Utkarsh 

On Jun 22, 2013, at 4:20 AM, Anshum Gupta ans...@anshumgupta.net wrote:

 You need to have at least 1 replica from each shard for the SolrCloud setup
 to work for you.
 When you kill 1 shard, you essentially are taking away 1/3 of the range of
 shard key.
 
 
 On Sat, Jun 22, 2013 at 4:31 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:
 
 Hello,
 
 I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are
 running in a different process in the same machines.
 
 I wanted to know the recommended size of a solrcloud cluster (min zk
 nodes?)
 
 This is the SolrCloud dump: https://gist.github.com/utkarsh2012/5840455
 
 And, I am not sure if I am hitting this frustrating bug or this is just a
 configuration error from my side. When I kill any *one* of the nodes, the
 whole cluster stops responding and I get this request when I query any one
 of the two alive nodes.
 
 {
  responseHeader:{
status:503,
QTime:2,
params:{
  indent:true,
  q:*:*,
  wt:json}},
  error:{
msg:no servers hosting shard: ,
code:503}}
 
 
 
 I see this exception:
 952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore  –
 org.apache.solr.common.SolrException: no servers hosting shard:
at
 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149)
at
 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
 
 
 --
 Thanks,
 -Utkarsh
 
 
 
 -- 
 
 Anshum Gupta
 http://www.anshumgupta.net

Re: Solr4 cluster setup for high performance reads

2013-06-21 Thread Utkarsh Sengar

Thanks for the update guys, I am working on the suggestions shared by you.
One last question about the solrcloud setup.

What is the recommended cluster size for solrcloud? I have 3 nodes of solr
and 3 nodes of ZK (running on the same machine, but a different JVM).
And after 2-3 days I notice that zk returns one node is down, but
everything is fine on that machine.
And then I get this error when I query any node: no servers hosting shard:
solr.

This has definitiely has to do with my setup, even if one node goes down,
the whole cluster should not start barfing.

Suggestions?

Thanks,
-Utkarsh


On Thu, Jun 13, 2013 at 7:28 PM, Shawn Heisey s...@elyograg.org wrote:

 On 6/13/2013 7:51 PM, Utkarsh Sengar wrote:
  Sure, I will reduce the count and see how it goes. The problem I have is,
  after such a change, I need to reindex everything again, which again is
  slow and takes time (40-60hours).

 There should be no need to reindex after changing most things in
 solrconfig.xml.  Changing cache sizes does not require it.  Most of the
 time, reindexing is only required after changing schema.xml, but there
 are a few changes you can make to schema that don't require it.

  Some queries are really bad, like this one:
  http://explain.solr.pl/explains/bzy034qi
  How can this be improved? I understand that there is something horribly
  wrong here, but not sure what points to look at (Been using solr from the
  last 20 days).

 You are using a *LOT* of query clauses against your allText field in
 that boost query.  I assume that allText is your largest field.  I'm not
 really sure, but based on what we're seeing here, I bet that a bq
 parameter doesn't get cached.  With some additional RAM available, this
 might not be such a big problem.

  The query is simple, although it used edismax. I have shared an explain
  query above. Other than the query, this is my performance stats:
 
  iostat -m 5 result: http://apaste.info/hjNV
 
  top result: http://apaste.info/jlHN

 You've got a pretty well-sustained iowait around ten percent.  You are
 I/O bound.  You need more total RAM.  With indexing only happening once
 a day, that doesn't sound like it's a factor.  If you are also having
 problems with garbage collection because your heap is a little bit too
 small, that makes all the other problems worse.

  For the initial training, I will hit solr 1.3M times and request 2000
  documents in each query. By the current speed (just one machine), it will
  take me ~20 days to do the initial training.

 This is really mystifying.  There is no need to send a million plus
 queries to warm your index.  A few dozen or a few hundred queries should
 be all you need, and you don't need 2000 docs returned per query.  Go
 with ten rows, or maybe a few dozen rows at most.  Because you're using
 SSD, I'm not sure you need warming queries at all.

 Thanks,
 Shawn




-- 
Thanks,
-Utkarsh

Re: Running solr cloud

2013-06-18 Thread Utkarsh Sengar

Looks like zk does not contain the configuration called: collection1.
You can use zkCli.sh to see what's inside configs zk node. You can
manually push config via zkCli's upconfig (not very sure how it works).

Try adding this arg:  -Dbootstrap_conf=true in place of
-Dbootstrap_confdir=./solr/collection1/conf and start solr. This might
push the config to zk.

bootstrap_conf uploads the index configuration files for all the cores to
zk.

Thanks,
-Utkarsh


On Tue, Jun 18, 2013 at 4:49 AM, Daniel Mosesson
daniel.moses...@ipreo.comwrote:

 I cannot seem to be able to get the default cloud setup to work properly.

 What I did:
 Downloaded the binaries, extracted.
 Made the pwd example
 Ran: java -Dbootstrap_confdir=./solr/collection1/conf
 -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
 And got the error message:  Caused by:
 org.apache.solr.common.cloud.ZooKeeperException: Specified config does not
 exist in ZooKeeper:collection1
 Which caused follow up messages, etc.

 What am I doing wrong here?
 Windows 7 pro

 

 **
 This e-mail message and any attachments are confidential. Dissemination,
 distribution or copying of this e-mail or any attachments by anyone other
 than the intended recipient is prohibited. If you are not the intended
 recipient, please notify Ipreo immediately by replying to this e-mail, and
 destroy all copies of this e-mail and any attachments. Thank you!
 **




-- 
Thanks,
-Utkarsh

Solr4 cluster setup for high performance reads

2013-06-13 Thread Utkarsh Sengar

Hello,

I am evaluating solr for indexing about 45M product catalog info. Catalog
mainly contains title and description which takes most of the space (other
attributes are brand, category, price, etc)

The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2)
which handles incremental updates. The column family I am indexing is about
50GB in size and solr.data's size is about 15GB for now.

*Points of interest in solr config/schema:*
1. schema.xml has a copyField called allText which merges title and
description.
2. solconfig has the following config:

directoryFactory name=DirectoryFactory
  class=${solr.directoryFactory:solr.MMapDirectoryFactory}/
  indexConfig

filterCache class=solr.FastLRUCache
 size=512
 initialSize=512
 autowarmCount=0/
queryResultCache class=solr.LRUCache
 size=100
 initialSize=100
 autowarmCount=10/
documentCache class=solr.LRUCache
   size=5000
   initialSize=500
   autowarmCount=0/




*Relevancy:*
Now, the default text matching does not suite our search needs, so I have
implemented a wrapper around the Solr API which adds boost queries to the
default solr query. For example:

Original query: ipod
Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc.

So as you can see, I construct new query based on related keywords and
assign score to those keywords based on relevance. This approach looks good
and the results look relevant.


But I am having issues with *Solr performance*.

*Problems:*
The initial training pulls 2000 documents from solr to find the most
probable matches and calculates score (PMI/NPMI). This query is extremely
slow. Also, a regular query also takes 3-4 seconds.
I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
space is allocated to solr, the block storage is an SSD.

What is the suggested setup for this usecase?
My guess is, setting up 4 solr nodes will help, but what is the suggested
RAM/heap for this kind of data?
And what are the recommended configuration (solrconfig.xml) where I *need
to speed up reads*?

Also, is there a way I can debug what is going on with solr internally? As
you can see, my queries are not that complex, so I don't need to debug my
queries but just debug solr and see the  troubled pieces in it.

Also, I am new to solr, so there anything else which I missed to share
which would help debug the problem?

-- 
Thanks,
-Utkarsh

Re: Solr4 cluster setup for high performance reads

2013-06-13 Thread Utkarsh Sengar

Otis,Shawn,

Thanks for reply.
You can find my schema.xml and solrconfig.xml here:
https://gist.github.com/utkarsh2012/5778811


To answer your questions:

Those are massive caches.  Rethink their size.  More specifically,
plug in some monitoring tool and see what you are getting out of them.
 Just today I looked at one Sematext's client's caches - 200K entries,
0 evictions == needless waste of JVM heap.  So lower those numbers
and increase only if you are getting evictions.

Sure, I will reduce the count and see how it goes. The problem I have is,
after such a change, I need to reindex everything again, which again is
slow and takes time (40-60hours).

debugQuery=true output will tell you something about timings, etc.

Some queries are really bad, like this one:
http://explain.solr.pl/explains/bzy034qi
How can this be improved? I understand that there is something horribly
wrong here, but not sure what points to look at (Been using solr from the
last 20 days).

consider edismax and qf param instead of that field copy stuff, info
on zee Wiki
Related back to my last point, how can such a query be improved? Maybe
using qf?

back to monitoring - what is your bottleneck?  The query looks
simplistic.  Is it IO? Memory? CPU?  Share some graphs and let's look.
The query is simple, although it used edismax. I have shared an explain
query above. Other than the query, this is my performance stats:

iostat -m 5 result: http://apaste.info/hjNV

top result: http://apaste.info/jlHN


How often do you index and commit, and how many documents each time?
This is done by datastax's dse. I assume it is configurable via
solrconfig.xml. The updates to cassandra are daily but all the documents
are not updated.

What is your query rate?
For the initial training, I will hit solr 1.3M times and request 2000
documents in each query. By the current speed (just one machine), it will
take me ~20 days to do the initial training.


Thanks,
-Utkarsh



On Thu, Jun 13, 2013 at 6:25 PM, Shawn Heisey s...@elyograg.org wrote:

 On 6/13/2013 5:53 PM, Utkarsh Sengar wrote:
  *Problems:*
  The initial training pulls 2000 documents from solr to find the most
  probable matches and calculates score (PMI/NPMI). This query is extremely
  slow. Also, a regular query also takes 3-4 seconds.
  I am running solr currently on just one VM with 12GB RAM and 8GB of Heap
  space is allocated to solr, the block storage is an SSD.

 Normally, I would say that you should have as much RAM as your heap size
 plus your index size, so with your 8GB heap and 15GB index, you'd want
 24GB total RAM.  With SSD, that requirement should not be quite so high,
 but you might want to try 16GB or more.  Solr works much better on bare
 metal than it does on virtual machines.

 I suspect that what might be happening here is that your heap is just a
 little bit too small for the combination of your index size (both
 document count and disk space), how you use Solr, and your config, so
 your JVM is constantly doing garbage collections.

  What is the suggested setup for this usecase?
  My guess is, setting up 4 solr nodes will help, but what is the suggested
  RAM/heap for this kind of data?
  And what are the recommended configuration (solrconfig.xml) where I *need
  to speed up reads*?

 http://wiki.apache.org/solr/SolrPerformanceProblems
 http://wiki.apache.org/solr/SolrPerformanceFactors

 Heap size requirements are hard to predict.  I can tell you that it's
 highly unlikely that you will need cache sizes as large as you have
 configured.  Start with the defaults and only increase them (by small
 amounts) if your hitratio is not high enough.  If increasing the size
 doesn't increase hitratio, there may be another problem.

  Also, is there a way I can debug what is going on with solr internally?
 As
  you can see, my queries are not that complex, so I don't need to debug my
  queries but just debug solr and see the troubled pieces in it.

 If you add debugQuery=true to your URL, Solr will give you a lot of
 extra information in the response.  One of the things that would be
 important here is seeing how much time is spent in various components.

  Also, I am new to solr, so there anything else which I missed to share
  which would help debug the problem?

 Sharing the entire config, schema, examples of all fields from your
 indexed documents, and examples of your full queries would help.
 http://apaste.info

 How often do you index and commit, and how many documents each time?
 What is your query rate?

 Thanks,
 Shawn




-- 
Thanks,
-Utkarsh

Not able to see newly added copyField in the response (indexing is 80% complete)

2013-05-02 Thread Utkarsh Sengar

Hello,

I updated my schema to use a copyField and have triggered a reindex, 80% of
the reindexing is complete. Although when I query the data, I don't see
myNewCopyFieldName being returned with the documents.

Is there something wrong with my schema or I need to wait for the indexing
to complete to see the new copyField?



This is my schema (retracted the actual names):

 fields
field name=key  type=string indexed=true  stored=true/
field name=1  type=string indexed=true  stored=true/
field name=2  type=string indexed=true  stored=true/
field name=3  type=string indexed=false  stored=true/
field name=4  type=string indexed=true  stored=true/
field name=5  type=string indexed=true  stored=true/
field name=6  type=custom_type indexed=true  stored=true/
field name=7  type=text_general indexed=true  stored=true/
field name=8  type=string indexed=true  stored=true/
field name=9  type=text_general indexed=true  stored=true/
field name=10  type=text_general indexed=true  stored=true/
field name=11  type=string indexed=true  stored=true/
field name=12  type=string indexed=true  stored=true/
field name=13  type=string indexed=true  stored=true/
   field name=myNewCopyFieldName type=text_general indexed=true
stored=true multiValued=true/
 /fields
defaultSearchField4/defaultSearchFielduniqueKeykey/uniqueKey

copyField source=1 dest=myNewCopyFieldName/copyField source=2
dest=myNewCopyFieldName/copyField source=3
dest=myNewCopyFieldName/copyField source=4
dest=myNewCopyFieldName/copyField source=6
dest=myNewCopyFieldName/

Where:
fieldType name=custom_type class=solr.TextField positionIncrementGap=100
   analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer/fieldType


and


fieldType name=text_general class=solr.TextField
positionIncrementGap=100
   analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer/fieldType


-- 
Thanks,
-Utkarsh

Re: Not able to see newly added copyField in the response (indexing is 80% complete)

2013-05-02 Thread Utkarsh Sengar

Thanks Shawn. Find my answers below.


On Thu, May 2, 2013 at 2:34 PM, Shawn Heisey s...@elyograg.org wrote:

 On 5/2/2013 3:13 PM, Utkarsh Sengar wrote:
  Hello,
 
  I updated my schema to use a copyField and have triggered a reindex, 80%
 of
  the reindexing is complete. Although when I query the data, I don't see
  myNewCopyFieldName being returned with the documents.
 
  Is there something wrong with my schema or I need to wait for the
 indexing
  to complete to see the new copyField?

 After making sure that you restarted Solr (or reloaded the core) after
 changing your schema, there are two things to mention:


Yes, I restarted solr and also did a reload.


 1) Using stored=true with a copyField doesn't make any sense, because
 you already have the individual values stored with the source fields.  I
 haven't done any testing, but Solr might ignore stored=true on
 copyField fields.


Ah I see, didn't know about this. If its not stored then it makes sense.
Need to verify this though.



 2) If I'm wrong about how Solr behaves with stored=true on a
 copyField, then a soft commit (4.x and later) or a hard commit with
 openSearcher=true would be required to see changes from indexing.  Have
 you committed your updates yet?


I am using Solr 4.x and soft commit is enabled. So I assume commit happened.
I see this in my solr admin:

   - lastModified:less than a minute ago
   - version:453962
   - numDocs:26413743
   - maxDoc: 28322675
   - current:
   - indexing: yes

So, lastModified = less than minute means the change was committed right?


 Thanks,
 Shawn




-- 
Thanks,
-Utkarsh

How to recover from Error opening new searcher when machine crashed while indexing

2013-04-30 Thread Utkarsh Sengar

Solr 4.0 was indexing data and the machine crashed.

Any suggestions on how to recover my index since I don't want to delete my
data directory?

When I try to start it again, I get this error:
ERROR 12:01:46,493 Failed to load Solr core: xyz.index1
ERROR 12:01:46,493 Cause:
ERROR 12:01:46,494 Error opening new searcher
org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.init(SolrCore.java:701)
at org.apache.solr.core.SolrCore.init(SolrCore.java:564)
at
org.apache.solr.core.CassandraCoreContainer.load(CassandraCoreContainer.java:213)
at
com.datastax.bdp.plugin.SolrCorePlugin.activateImpl(SolrCorePlugin.java:66)
at
com.datastax.bdp.plugin.PluginManager$PluginInitializer.call(PluginManager.java:161)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1290)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1402)
at org.apache.solr.core.SolrCore.init(SolrCore.java:675)
... 9 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments*
file found in 
NRTCachingDirectory(org.apache.lucene.store.NIOFSDirectory@/media/SSD/data/solr.data/rlcatalogks.prodinfo/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@d7581b;
maxCacheMB=48.0 maxMergeSizeMB=4.0): files: [_73ne_nrm.cfs,
_73ng_Lucene40_0.tip, _73nh_nrm.cfs, _73ng_Lucene40_0.tim, _73nf.fnm,
_73n5_Lucene40_0.frq, _73ne.fdt, _73nh.fdx, _73ne_nrm.cfe, _73ne.fdx,
_73ne_Lucene40_0.tim, _73ne.si, _73ni.fnm, _73nh_Lucene40_0.prx, _73ni.fdt,
_73n5.si, _73ne_Lucene40_0.tip, _73nf_Lucene40_0.frq, _73nf_Lucene40_0.prx,
_73nf_nrm.cfe, _73ne_Lucene40_0.frq, _73ng_Lucene40_0.prx,
_73nf_Lucene40_0.tip, _73n5.fdx, _73ng_Lucene40_0.frq, _73ng.fnm,
_73ni.fdx, _73n5.fnm, _73nf_Lucene40_0.tim, _73ni.si, _73n5.fdt,
_73nf_nrm.cfs, _73nh_nrm.cfe, _73ni_Lucene40_0.frq, _73ng.fdx,
_73ne_Lucene40_0.prx, _73nh.fnm, _73nh_Lucene40_0.tip,
_73nh_Lucene40_0.tim, _73nh.si, _73n5_Lucene40_0.tip, _73ni_Lucene40_0.prx,
_73n5_Lucene40_0.tim, _73nf.si, _73ng_nrm.cfe, _73n5_Lucene40_0.prx,
_392j_42f.del, _73ng.fdt, _73ng.si, _73ni_nrm.cfe, _73n5_nrm.cfe,
_73ni_nrm.cfs, _73nf.fdx, _73ni_Lucene40_0.tip, _73n5_nrm.cfs,
_73ni_Lucene40_0.tim, _73nf.fdt, _73ne.fnm, _73nh.fdt,
_73nh_Lucene40_0.frq, _73ng_nrm.cfs]


-- 
Thanks,
-Utkarsh

Solr's physical memory and JVM memory

2013-04-16 Thread Utkarsh Sengar

Hello,

I have setup a solr4 instance (just one node) and I see this memory pattern:

[image: Inline image 1]

Physical memory is nearly full and JVM memory is ok. I have ~40M documents
(where 1 document=1KB) indexed and in production env I am planning to setup
2 solr cloud nodes.

So I have 2 questions:
1. What is the recommended memory for those 2 nodes?
2. I am not sure what does Physical memory mean in context to solr. My
understanding of the physical memory is the actual RAM in my machine and
'top' says that I have used just 4.6GB or 23.7GB. Why is Solr admin
reporting that I have used 22.84GB out of 23.7GB?

-- 
Thanks,
-Utkarsh

Re: Solr's physical memory and JVM memory

2013-04-16 Thread Utkarsh Sengar

My bad about the attachment, there you go: http://i.imgur.com/XKtw32K.png
Thanks for the details answer and that helps alot.

Thank,
-Utkarsh


On Tue, Apr 16, 2013 at 9:48 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/16/2013 10:01 PM, Otis Gospodnetic wrote:
  Not sure if it's just me, but I'm not seeing your inlined image.

 It's not just you.

  On Tue, Apr 16, 2013 at 7:52 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:
  So I have 2 questions:
  1. What is the recommended memory for those 2 nodes?
  2. I am not sure what does Physical memory mean in context to solr. My
  understanding of the physical memory is the actual RAM in my machine and
  'top' says that I have used just 4.6GB or 23.7GB. Why is Solr admin
  reporting that I have used 22.84GB out of 23.7GB?

 Attachments don't work well on mailing lists.  We can't see your image.
  Best to put the file on the Internet somewhere (like dropbox or another
 file sharing site) and include the public link.  After you get an answer
 to your question, you can remove the file.

 Answers to your two questions:

 1) A good rule of thumb is that you want to have enough RAM to equal or
 exceed the sum of two things: The amount of memory that your programs
 take (including the max heap setting you give to Solr), and the size of
 your Solr index(es) stored on that server.  You may be able to get away
 with less memory than this, but you do want to have enough memory for a
 sizable chunk of your on-disk index.  Example: If Solr is the only major
 program running on the machine, you give Solr a 4GB heap, and your index
 is 20GB, an ideal setup would have at least 24GB of RAM.

 2) You are seeing the result of the way that all modern operating
 systems work.  The extra memory that is not being currently used by
 programs is borrowed by the operating system to cache data from your
 disk into RAM, so that frequently accessed data will not have be read
 from the disk.  Reading from main memory is many orders of magnitude
 faster than reading from disk.  The memory that is being used for the
 disk cache (on top it shows up as 'cached') is instantly made available
 to programs that request it.

 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 2a) Operating systems like Linux tell you the truth about the OS using
 excess memory for the disk cache.  With the most basic information
 tools, Windows tells you a semi-lie and will report that memory as free.
  The newest versions of Windows seem to have gotten the hint and do
 include tools that will give you the true picture.

 2b) For good performance, Solr is extremely reliant on having a big
 enough disk cache so that reads from disk are rare.  This is the case
 for most other programs too, actually.

 Thanks,
 Shawn




-- 
Thanks,
-Utkarsh

Getting started with solr 4.2 and cassandra

2013-04-01 Thread Utkarsh Sengar

Hello,

I am evaluating solr 4.2 and ElasticSearch (I am new to both) for a search
API, where data sits in cassandra.

Getting started with elasticsearch is pretty straight forward and I was
able to write an ES
riverhttp://www.elasticsearch.org/guide/reference/river/
which pulls data from cassandra and indexes it in ES within a day.

Now, I trying to implement something similar with solr and compare both of
them.

Getting started with
solr/examplehttp://lucene.apache.org/solr/4_2_0/tutorial.htmlwas
pretty easy and an example solr instance works. But the example folder
contains whole bunch of stuff which I am not sure if I need:
http://pastebin.com/Gv660mRT . I am sure I don't need 53 directories and
527 files

So my questions are:
1. How can I create a bare bone solr app up and running with minimum set of
configuration? (I will build over it when needed by taking reference from
/example)
2. What is a best practice to run solr in production? Am approach like this
jetty+nginx recommended:
http://sacharya.com/nginx-proxy-to-jetty-for-java-apps/ ?

Once I am done setting up a simple solr instance:
3. What is the general practice to import data to solr? For now, I am
writing a python script which will read data in bulk from cassandra and
throw it to solr.

-- 
Thanks,
-Utkarsh

Re: Getting started with solr 4.2 and cassandra

2013-04-01 Thread Utkarsh Sengar

Thanks for the reply. So DSE is one of the options and I am looking into
that too.
Although, before diving into solr+cassandra integration (which comes out of
the box with DSE).

I am just trying to setup a solr instance on my local machine without the
bloat the example solr instance has to offer. Any suggestions about that?

Thanks,
-Utkarsh

On Mon, Apr 1, 2013 at 4:00 PM, Jack Krupansky j...@basetechnology.comwrote:

You might want to check out DataStax Enterprise, which actually integrates
Cassandra and Solr. You keep the data in Cassandra, but as data is added
and updated and deleted, the Solr index is automatically updated in
parallel. You can add and update data and query using either the Cassandra
API or the Solr API.

See:
http://www.datastax.com/what-**we-offer/products-services/**
datastax-enterprisehttp://www.datastax.com/what-we-offer/products-services/datastax-enterprise

-- Jack Krupansky

-Original Message- From: Utkarsh Sengar
Sent: Monday, April 01, 2013 6:34 PM
To: solr-user@lucene.apache.org
Subject: Getting started with solr 4.2 and cassandra

Hello,

I am evaluating solr 4.2 and ElasticSearch (I am new to both) for a search
API, where data sits in cassandra.

Getting started with elasticsearch is pretty straight forward and I was
able to write an ES
riverhttp://www.**elasticsearch.org/guide/**reference/river/http://www.elasticsearch.org/guide/reference/river/

which pulls data from cassandra and indexes it in ES within a day.

Now, I trying to implement something similar with solr and compare both of
them.

Getting started with
solr/examplehttp://lucene.**apache.org/solr/4_2_0/**tutorial.htmlhttp://lucene.apache.org/solr/4_2_0/tutorial.html
was

pretty easy and an example solr instance works. But the example folder
contains whole bunch of stuff which I am not sure if I need:
http://pastebin.com/Gv660mRT . I am sure I don't need 53 directories and
527 files

So my questions are:
1. How can I create a bare bone solr app up and running with minimum set of
configuration? (I will build over it when needed by taking reference from
/example)
2. What is a best practice to run solr in production? Am approach like this
jetty+nginx recommended:
http://sacharya.com/nginx-**proxy-to-jetty-for-java-apps/http://sacharya.com/nginx-proxy-to-jetty-for-java-apps/?

Once I am done setting up a simple solr instance:
3. What is the general practice to import data to solr? For now, I am
writing a python script which will read data in bulk from cassandra and
throw it to solr.

--
Thanks,
-Utkarsh

79 matches

Mail list logo