Solr results in null response

2017-12-26 Thread Kumar, Amit
Hi Team,

I have an application running on solr 4.7.0; I am frequently seeing null 
responses for requests to application. On SOLR console I see below error 
related to 'grouping parameters'. Although I am setting all grouping parameters 
in code. Could you please suggest why it is throwing this error, the scenario 
in which it throws this, how I can rectify it?

Thanks in advance. Below is the full error details:

org.apache.solr.common.SolrException: Specify at least one field, function or 
query to group by.
 at org.apache.solr.search.Grouping.execute(Grouping.java:298)
 at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:433)
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:214)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
 at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
 at 
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:503)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
 at 
com.googlecode.psiprobe.Tomcat70AgentValve.invoke(Tomcat70AgentValve.java:38)
 at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
 at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
 at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
 at 
org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:218)
 at 
org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:333)
 at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
 at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
 at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
 at 
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462)
 at 
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2451)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at 
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
 at java.lang.Thread.run(Thread.java:745)

best,
Amit



SegmentInfo from (SolrIndexSearcher) LeafReader

2016-05-14 Thread Amit Kumar
Hey Guys,

I am writing a SearchComponent for SOLR 5.4.0 that does some caching at the
level of segments and I want to be able to get SegmentInfo from a
LeafReader -I am unable to figure that out; A LeafReader is not an instance
of SegmentReader that exposes the segment information, is it still possible
to get the SegmentInfo that I might be missing, If I am in
the SearchComponent.prepare/process.

Many thanks,
Amit


Re: How fast indexing?

2016-03-21 Thread Amit Jha
When I run the same sql on DB it takes only 1 sec. And 6-7 documents are 
getting indexed per second. 

As I've 4 node solrCloud setup, can I run 4 import handler to index the same 
data? Will it not over write? 

10-20k is very high in numbers, where can I get the actual size of document.

Rgds
AJ

> On 22-Mar-2016, at 05:32, Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 3/20/2016 6:11 PM, Amit Jha wrote:
>> In my case I am using DIH to index the data and Query is having 2 join 
>> statements. To index 70K documents it is taking 3-4Hours. Document size 
>> would be around 10-20KB. DB is MSSQL and using solr4.2.10 in cloud mode.
> 
> My source data is in a MySQL database.  I use DIH for full rebuilds and
> SolrJ for maintenance.
> 
> My index is sharded, but I'm not running SolrCloud.  When using DIH, all
> of my shards build at once, and each one achieves about 750 docs per
> second.  With six large shards, rebuilding a 146 million document index
> takes 9-10 hours.  It produces a total index size in the ballpark of 170GB.
> 
> DIH has a performance limitation -- it's single-threaded.  I obtain the
> speeds that I do because all of my shards import at the same time -- six
> dataimport instances running at the same time, each one with a single
> thread, importing a little more than 24 million documents.  I have
> discovered that Solr is the bottleneck on my setup.  The data retrieval
> from MySQL can proceed much faster than Solr can handle with a single
> indexing thread.  My situation is a little bit unusual -- as Erick
> mentioned, usually the bottleneck is data retrieval, not Solr.
> 
> At this point, if I want to make bulk indexing go faster, I need to
> build a SolrJ application that can index with multiple threads to each
> Solr core at the same time.  This is on my roadmap, but it's not going
> to be a trivial project.
> 
> At 10-20K, your documents are large, but not excessively so.  If 7
> documents takes 3-4 hours, then there's one of a few problems happening.
> 
> 1) your database is VERY slow.
> 2) your analysis chain in schema.xml is running SUPER slow analysis
> components.
> 3) Your server or its configuration is not providing enough resources
> (CPU/RAM/IO) so Solr can run efficiently.
> 
> #2 seems rather unlikely, so I would suspect one of the other two.
> 
> 
> 
> I have seen one situation related to the Microsoft side of your setup
> that might cause a problem like this.  If any of your machines are
> running on Windows Server 2012 and you have bridged NICs (usually for
> failover in the event of a switch failure), then you will need to break
> the bridge and just run one NIC.
> 
> The performance improvement on the network when a bridged NIC is removed
> from Server 2012 is enough to blow your mind, especially if the access
> is over a high-latency network link, like a VPN or WAN connection.  The
> same setup on Server 2003 or Server 2008 has very good performance.
> Microsoft seems to have a bug with bridged NICs in Server 2012.  Last
> time I tried to figure out whether it could be fixed, I ran into this
> problem:
> 
> https://xkcd.com/979/
> 
> Thanks,
> Shawn
> 


Re: How fast indexing?

2016-03-21 Thread Amit Jha
Yes, I do have multiple modes in my solr cloud setup.

Rgds
AJ

> On 21-Mar-2016, at 22:20, fabigol <fabien.stou...@vialtis.com> wrote:
> 
> Amit Jha,
> do you have several sold server with solr cloud?
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-fast-indexing-tp4264994p4265122.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: How fast indexing?

2016-03-20 Thread Amit Jha
Hi All,

In my case I am using DIH to index the data and Query is having 2 join 
statements. To index 70K documents it is taking 3-4Hours. Document size would 
be around 10-20KB. DB is MSSQL and using solr4.2.10 in cloud mode.

Rgds
AJ

> On 21-Mar-2016, at 05:23, Erick Erickson  wrote:
> 
> In my experience, a majority of the time the bottleneck is in
> the data acquisition, not the Solr indexing per-se. Take a look
> at the CPU utilization on Solr, if it's not running very heavy,
> then you need to look upstream.
> 
> You haven't told us anything about _how_ you're indexing.
> SolrJ? DIH? Something from some other party? so it's hard to
> say much useful.
> 
> You might review:
> 
> http://wiki.apache.org/solr/UsingMailingLists
> 
> Best,
> Erick
> 
> On Sun, Mar 20, 2016 at 3:31 PM, Nick Vasilyev 
> wrote:
> 
>> There can be a lot of factors, can you provide a bit of additional
>> information to get started?
>> 
>> - How many items are you indexing per second?
>> - How does the indexing process look like?
>> - How large is each item?
>> - What hardware are you using?
>> - How is your Solr set up? JVM memory, collection layout, etc...
>> - What is your current commit frequency?
>> - What is the query volume while you are indexing?
>> 
>> On Sun, Mar 20, 2016 at 6:25 PM, fabigol 
>> wrote:
>> 
>>> hi,
>>> i have a soir project where i do the indexing since a database postgre.
>>> the indexation is very long.
>>> How i can accelerate it.
>>> I can modify autocommit in the file solrconfig.xml?
>>> someone has some ideas. I looking on google but I found little
>>> help me please
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/How-fast-indexing-tp4264994.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 


SolrCloud Document Update Problem

2015-06-29 Thread Amit Jha
Hi,

I setup a SolrCloud with 2 shards each is having 2 replicas with 3
zookeeper ensemble.

We add and update documents from web app. While updating we delete the
document and add same document with updated values with same unique id.

I am facing a very strange issue that some time 2 documents have the same
unique ID. One document with old values and another one with new values.
It happens only we update the document.

Please suggest or guide...

Rgds


Re: SolrCloud Document Update Problem

2015-06-29 Thread Amit Jha
It was because of the issues

Rgds
AJ

 On Jun 29, 2015, at 6:52 PM, Shalin Shekhar Mangar shalinman...@gmail.com 
 wrote:
 
 On Mon, Jun 29, 2015 at 4:37 PM, Amit Jha shanuu@gmail.com wrote:
 Hi,
 
 I setup a SolrCloud with 2 shards each is having 2 replicas with 3
 zookeeper ensemble.
 
 We add and update documents from web app. While updating we delete the
 document and add same document with updated values with same unique id.
 
 I am not sure why you delete the document. If you use the same unique
 key and send the whole document again (with some other fields
 changed), Solr will automatically overwrite the old document with the
 new one.
 
 
 I am facing a very strange issue that some time 2 documents have the same
 unique ID. One document with old values and another one with new values.
 It happens only we update the document.
 
 
 
 Please suggest or guide...
 
 Rgds
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.


Real Time indexing and Scalability

2015-06-05 Thread Amit Jha
Hi,

In my use case, I am adding a document to Solr through spring application using 
spring-data-solr. This setup works well with single Solr. In current setup it 
is single point of failure. So we decided to use solr replication because we 
also need centralized search. Therefore we setup two instances both in repeater 
mode. The problem with this setup was, some time data was not get indexed. So 
we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still 
sometime we found that documents are not getting indexed.

I would like to know what is the best way to have highly available setup.

Rgds
AJ

Re: Real Time indexing and Scalability

2015-06-05 Thread Amit Jha
I want to have realtime index and realtime search.

Rgds
AJ

 On Jun 5, 2015, at 10:12 PM, Amit Jha shanuu@gmail.com wrote:
 
 Hi,
 
 In my use case, I am adding a document to Solr through spring application 
 using spring-data-solr. This setup works well with single Solr. In current 
 setup it is single point of failure. So we decided to use solr replication 
 because we also need centralized search. Therefore we setup two instances 
 both in repeater mode. The problem with this setup was, some time data was 
 not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 
 replica setup, but still sometime we found that documents are not getting 
 indexed.
 
 I would like to know what is the best way to have highly available setup.
 
 Rgds
 AJ


Re: Real Time indexing and Scalability

2015-06-05 Thread Amit Jha
Thanks Eric, what about document is committed to master?Then document should be 
visible from master. Is that correct?

I was using replication with repeater mode because LBHttpSolrServer can send 
write request to any of the Solr server, and that Solr should index the 
document because it a master. we have a polling interval of 2 sec. After 
polling interval slave can poll the data. It is worth to mention here is 
application request the commit command. 

If document is committed to master and a search request coming to the same 
master then document should be retrieved. Irrespective of replication because 
master doesn't know who the slave are?

In repeater mode document can be indexed on both the Solr instance. Is that 
understanding correct?

Also why you say that commit is inappropriate? 





Rgds
AJ

 On Jun 5, 2015, at 11:16 PM, Erick Erickson erickerick...@gmail.com wrote:
 
 You have to provide a _lot_ more details. You say:
 The problem... some data was not get indexed... still sometime we
 found that documents are not getting indexed.
 
 Neither of these should be happening, so I suspect
 1 you're expectations aren't correct. For instance, in the
 master/slave setup you won't see docs on the slave until after the
 polling interval is expired and the index is replicated.
 2 In SolrCloud you aren't committing appropriately.
 
 You might review: http://wiki.apache.org/solr/UsingMailingLists
 
 Best,
 Erick
 
 
 On Fri, Jun 5, 2015 at 9:45 AM, Amit Jha shanuu@gmail.com wrote:
 I want to have realtime index and realtime search.
 
 Rgds
 AJ
 
 On Jun 5, 2015, at 10:12 PM, Amit Jha shanuu@gmail.com wrote:
 
 Hi,
 
 In my use case, I am adding a document to Solr through spring application 
 using spring-data-solr. This setup works well with single Solr. In current 
 setup it is single point of failure. So we decided to use solr replication 
 because we also need centralized search. Therefore we setup two instances 
 both in repeater mode. The problem with this setup was, some time data was 
 not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 
 replica setup, but still sometime we found that documents are not getting 
 indexed.
 
 I would like to know what is the best way to have highly available setup.
 
 Rgds
 AJ


Re: Real Time indexing and Scalability

2015-06-05 Thread Amit Jha
Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud. 

I agree that repeater is a slave and acts as master for other slaves. But still 
it's a master and logically it has to obey the what master suppose to obey. 

if 2 servers are master that means writing can be done on both. If I setup 
replication between 2 servers and configure both as repeater, than both can act 
master and slave for each other. Therefore writing can be done on both.


Rgds
AJ

 On Jun 6, 2015, at 1:26 AM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 6/5/2015 1:38 PM, Amit Jha wrote:
 Thanks Eric, what about document is committed to master?Then document should 
 be visible from master. Is that correct?
 
 I was using replication with repeater mode because LBHttpSolrServer can send 
 write request to any of the Solr server, and that Solr should index the 
 document because it a master. we have a polling interval of 2 sec. After 
 polling interval slave can poll the data. It is worth to mention here is 
 application request the commit command. 
 
 If document is committed to master and a search request coming to the same 
 master then document should be retrieved. Irrespective of replication 
 because master doesn't know who the slave are?
 
 In repeater mode document can be indexed on both the Solr instance. Is that 
 understanding correct?
 
 Also why you say that commit is inappropriate?
 
 If you are not using SolrCloud, then you must index to the master
 *ONLY*.  A repeater does not enable two-way replication.  A repeater is
 a slave that is also a master for additional slaves.  Master-slave
 replication is *only* one-way - from the master to slaves, and if any of
 those slaves are repeaters, from there to additional slaves.
 
 SolrCloud is probably a far better choice for your setup, especially if
 you are using the SolrJ client.  You mentioned LBHttpSolrServer, which
 is why I am thinking you're using SolrJ.
 
 With a proper configuration on your collection, SolrCloud lets you index
 to any machine in the cloud and the data will end up exactly where it
 needs to go.  If you use CloudSolrServer/CloudSolrClient and a very
 recent Solr/SolrJ version, the data will be sent directly to the correct
 instance for best performance.
 
 Thanks,
 Shawn
 


Re: Real Time indexing and Scalability

2015-06-05 Thread Amit Jha
Thanks everyone. I got the answer.

Rgds
AJ

 On Jun 6, 2015, at 7:00 AM, Erick Erickson erickerick...@gmail.com wrote:
 
 bq: if 2 servers are master that means writing can be done on both.
 
 If there's a single piece of documentation that supports this contention,
 we'll correct it immediately. But it's simply not true.
 
 As Shawn says, the entire design behind master/slave
 architecture is that there is exactly one (and only one) master that
 _ever_ gets documents indexed to it. Repeaters were introduced
 as a way to fan out the replication process, particularly across data
 centers that had expensive pipes connecting them. You could have
 the repeater in DC2 relay the index form the master in DC1 to  all slaves in
 DC2. In that kind of setup, you then replicate the index
 across the expensive pipe once rather than once for each slave in
 DC2.
 
 But even in this situation you are only ever indexing to the master
 on DC1.
 
 Best,
 Erick
 
 On Fri, Jun 5, 2015 at 1:20 PM, Amit Jha shanuu@gmail.com wrote:
 Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud.
 
 I agree that repeater is a slave and acts as master for other slaves. But 
 still it's a master and logically it has to obey the what master suppose to 
 obey.
 
 if 2 servers are master that means writing can be done on both. If I setup 
 replication between 2 servers and configure both as repeater, than both can 
 act master and slave for each other. Therefore writing can be done on both.
 
 
 Rgds
 AJ
 
 On Jun 6, 2015, at 1:26 AM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 6/5/2015 1:38 PM, Amit Jha wrote:
 Thanks Eric, what about document is committed to master?Then document 
 should be visible from master. Is that correct?
 
 I was using replication with repeater mode because LBHttpSolrServer can 
 send write request to any of the Solr server, and that Solr should index 
 the document because it a master. we have a polling interval of 2 sec. 
 After polling interval slave can poll the data. It is worth to mention 
 here is application request the commit command.
 
 If document is committed to master and a search request coming to the same 
 master then document should be retrieved. Irrespective of replication 
 because master doesn't know who the slave are?
 
 In repeater mode document can be indexed on both the Solr instance. Is 
 that understanding correct?
 
 Also why you say that commit is inappropriate?
 
 If you are not using SolrCloud, then you must index to the master
 *ONLY*.  A repeater does not enable two-way replication.  A repeater is
 a slave that is also a master for additional slaves.  Master-slave
 replication is *only* one-way - from the master to slaves, and if any of
 those slaves are repeaters, from there to additional slaves.
 
 SolrCloud is probably a far better choice for your setup, especially if
 you are using the SolrJ client.  You mentioned LBHttpSolrServer, which
 is why I am thinking you're using SolrJ.
 
 With a proper configuration on your collection, SolrCloud lets you index
 to any machine in the cloud and the data will end up exactly where it
 needs to go.  If you use CloudSolrServer/CloudSolrClient and a very
 recent Solr/SolrJ version, the data will be sent directly to the correct
 instance for best performance.
 
 Thanks,
 Shawn
 


SolrCloud Replication Issue

2015-04-27 Thread Amit L
Hi,

A few days ago I deployed a solr 4.9.0 cluster, which consists of 2
collections. Each collection has 1 shard with 3 replicates on 3 different
machines.

On the first day I noticed this error appear on the leader. Full Log -
http://pastebin.com/wcPMZb0s

4/23/2015, 2:34:37 PM SEVERE SolrCmdDistributor
org.apache.solr.client.solrj.SolrServerException: IOException occured when
talking to server at:
http://production-solrcloud-004:8080/solr/bookings_shard1_replica2

4/23/2015, 2:34:37 PM WARNING DistributedUpdateProcessor
Error sending update

4/23/2015, 2:34:37 PM WARNING ZkController
Leader is publishing core=bookings_shard1_replica2 state=down on behalf of
un-reachable replica
http://production-solrcloud-004:8080/solr/bookings_shard1_replica2/;
forcePublishState? false


The other 2 replicas had 0 errors.

I thought it may be a one off but the same error occured on day 2 which has
got me slighlty concerned. During these periods I didn't notice any issues
with the cluster and everything looks healthy in the cloud summary. All of
the instances are hosted on AWS.

Any idea what may be causing this issue and what I can do to mitigate?

Thanks
Amit


Re: SolrCloud Replication Issue

2015-04-27 Thread Amit L
Appreciate the response, to answer your questions.

* Do you see this happen often? How often?
It has happened twice in five days. The first two days after deployment.

* Are there any known network issues?
There are no obvious network issues but as these instances reside in AWS i
cannot rule it out network blips.

* Do you have any idea about the GC on those replicas?
I have been monitoring the memory usage and all instances are using no more
than 30% of its JVM memory allocation.




On 27 April 2015 at 21:36, Anshum Gupta ans...@anshumgupta.net wrote:

 Looks like LeaderInitiatedRecovery or LIR. When a leader receives a
 document (update) but fails to successfully forward it to a replica, it
 marks that replica as down and asks the replica to recover (hence the name,
 Leader Initiated Recovery). It could be due to multiple reasons e.g.
 network issue/GC. The replica generally comes back up and syncs with the
 leader transparently. As an end-user, you don't have to really worry much
 about this but if you want to dig deeper, here are a few questions that
 might help us in suggesting what to do/look at.
 * Do you see this happen often? How often?
 * Are there any known network issues?
 * Do you have any idea about the GC on those replicas?


 On Mon, Apr 27, 2015 at 1:25 PM, Amit L amitlal...@gmail.com wrote:

  Hi,
 
  A few days ago I deployed a solr 4.9.0 cluster, which consists of 2
  collections. Each collection has 1 shard with 3 replicates on 3 different
  machines.
 
  On the first day I noticed this error appear on the leader. Full Log -
  http://pastebin.com/wcPMZb0s
 
  4/23/2015, 2:34:37 PM SEVERE SolrCmdDistributor
  org.apache.solr.client.solrj.SolrServerException: IOException occured
 when
  talking to server at:
  http://production-solrcloud-004:8080/solr/bookings_shard1_replica2
 
  4/23/2015, 2:34:37 PM WARNING DistributedUpdateProcessor
  Error sending update
 
  4/23/2015, 2:34:37 PM WARNING ZkController
  Leader is publishing core=bookings_shard1_replica2 state=down on behalf
 of
  un-reachable replica
  http://production-solrcloud-004:8080/solr/bookings_shard1_replica2/;
  forcePublishState? false
 
 
  The other 2 replicas had 0 errors.
 
  I thought it may be a one off but the same error occured on day 2 which
 has
  got me slighlty concerned. During these periods I didn't notice any
 issues
  with the cluster and everything looks healthy in the cloud summary. All
 of
  the instances are hosted on AWS.
 
  Any idea what may be causing this issue and what I can do to mitigate?
 
  Thanks
  Amit
 



 --
 Anshum Gupta



Re: Retrieving Phonetic Code as result

2015-01-23 Thread Amit Jha
Can I extend solr to add phonetic codes at time of indexing as uuid field 
getting added. Because I want to preprocess the metaphone code because I 
calculate the code on runtime will give me some performance hit.

Rgds
AJ

 On Jan 23, 2015, at 5:37 PM, Jack Krupansky jack.krupan...@gmail.com wrote:
 
 Your app can use the field analysis API (FieldAnalysisRequestHandler) to
 query Solr for what the resulting field values are for each filter in the
 analysis chain for a given input string. This is what the Solr Admin UI
 Analysis web page uses.
 
 See:
 http://lucene.apache.org/solr/4_10_2/solr-core/org/apache/solr/handler/FieldAnalysisRequestHandler.html
 and in solrconfig.xml
 
 
 -- Jack Krupansky
 
 On Thu, Jan 22, 2015 at 8:42 AM, Amit Jha shanuu@gmail.com wrote:
 
 Hi,
 
 I need to know how can I retrieve phonetic codes. Does solr provide it as
 part of result? I need codes for record matching.
 
 *following is schema fragment:*
 
 fieldtype name=phonetic stored=true indexed=true
 class=solr.TextField 
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.DoubleMetaphoneFilterFactory inject=true
 maxCodeLength=4/
  /analyzer
/fieldtype
 
 field name=firstname type=text_general indexed=true stored=true/
  field name=firstname_phonetic type=phonetic /
  field name=lastname_phonetic type=phonetic /
  field name=lastname type=text_general indexed=true stored=true/
 
 copyField source=lastname dest=lastname_phonetic/
 copyField source=firstname dest=firstname_phonetic/
 


Retrieving Phonetic Code as result

2015-01-22 Thread Amit Jha
Hi,

I need to know how can I retrieve phonetic codes. Does solr provide it as
part of result? I need codes for record matching.

*following is schema fragment:*

fieldtype name=phonetic stored=true indexed=true
class=solr.TextField 
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.DoubleMetaphoneFilterFactory inject=true
maxCodeLength=4/
  /analyzer
/fieldtype

 field name=firstname type=text_general indexed=true stored=true/
  field name=firstname_phonetic type=phonetic /
  field name=lastname_phonetic type=phonetic /
  field name=lastname type=text_general indexed=true stored=true/

copyField source=lastname dest=lastname_phonetic/
 copyField source=firstname dest=firstname_phonetic/


Re: Retrieving Phonetic Code as result

2015-01-22 Thread Amit Jha
Hi,

I need to know how can I retrieve phonetic codes. Does solr provide it as
part of result? I need codes for record matching.

*following is schema fragment:*

fieldtype name=phonetic stored=true indexed=true
class=solr.TextField 
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.DoubleMetaphoneFilterFactory inject=true
maxCodeLength=4/
  /analyzer
/fieldtype

 field name=firstname type=text_general indexed=true stored=true/
  field name=firstname_phonetic type=phonetic /
  field name=lastname_phonetic type=phonetic /
  field name=lastname type=text_general indexed=true stored=true/

copyField source=lastname dest=lastname_phonetic/
 copyField source=firstname dest=firstname_phonetic/

Hi,

Thanks for response, I can see generated MetaPhone codes using Luke. I am
using solr only because it creates the phonetic code at time of indexing.
Otherwise for each record I need to call Metaphone algorithm in realtime to
get the codes and compare them. I think when luke can read and display it,
why can't solr?


Re: Retrieving Phonetic Code as result

2015-01-22 Thread Amit Jha
Thanks for response, I can see generated MetaPhone codes using Luke. I am
using solr only because it creates the phonetic code at time of indexing.
Otherwise for each record I need to call Metaphone algorithm in realtime to
get the codes and compare them. I think when luke can read and display it,
why can't solr

On Thu, Jan 22, 2015 at 7:54 PM, Amit Jha shanuu@gmail.com wrote:

 Hi,

 I need to know how can I retrieve phonetic codes. Does solr provide it as
 part of result? I need codes for record matching.

 *following is schema fragment:*

 fieldtype name=phonetic stored=true indexed=true
 class=solr.TextField 
   analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.DoubleMetaphoneFilterFactory inject=true
 maxCodeLength=4/
   /analyzer
 /fieldtype

  field name=firstname type=text_general indexed=true stored=true/
   field name=firstname_phonetic type=phonetic /
   field name=lastname_phonetic type=phonetic /
   field name=lastname type=text_general indexed=true stored=true/

 copyField source=lastname dest=lastname_phonetic/
  copyField source=firstname dest=firstname_phonetic/

 Hi,

 Thanks for response, I can see generated MetaPhone codes using Luke. I am
 using solr only because it creates the phonetic code at time of indexing.
 Otherwise for each record I need to call Metaphone algorithm in realtime to
 get the codes and compare them. I think when luke can read and display it,
 why can't solr?




Re: De Duplication using Solr

2015-01-03 Thread Amit Jha
Thanks for reply...I have already seen wiki. It is more  likely to record
matching.

On Sat, Jan 3, 2015 at 7:39 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 First, see if you can get your requirements to align to the de-dupe feature
 that Solr already has:
 https://cwiki.apache.org/confluence/display/solr/De-Duplication


 -- Jack Krupansky

 On Sat, Jan 3, 2015 at 2:54 AM, Amit Jha shanuu@gmail.com wrote:

  I am trying to find out duplicate records based on distance and phonetic
  algorithms. Can I utilize solr for that? I have following fields and
  conditions to identify exact or possible duplicates.
 
  1. Fields
  prefix
  suffix
  firstname
  lastname
  email(primary_email1, email2, email3)
  phone(primary_phone1, phone2, phone3)
  2. Conditions:
  Two records said to be exact duplicates if
 
  1. IsExactMatchFunction(record1_prefix, record2_prefix) AND
  IsExactMatchFunction(record1_suffix, record2_suffix) AND
  IsExactMatchFunction(record1_firstname,record2_firstname) AND
  IsExactMatchFunction(record1_lastname,record2_lastname) AND
  IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
  IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
  Two records said to be possible duplicates if
 
  1. IsExactMatchFunction(record1_prefix, record2_prefix) OR
  IsExactMatchFunction(record1_suffix, record2_suffix) OR
  IsExactMatchFunction(record1_firstname,record2_firstname) AND
  IsExactMatchFunction(record1_lastname,record2_lastname) AND
  IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
  IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
   ELSE
   2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
  IsExactMatchFunction(record1_lastname,record2_lastname) AND
  IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
  IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
   ELSE
   3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
  IsExactMatchFunction(record1_lastname,record2_lastname) AND
  IsExactMatchFunction(record1_any_email,record2_any_email) OR
  IsExactMatchFunction(record1_any_phone,record2_any_primary)
 
  IsFuzzyMatchFunction() will perform distance and phonetic algorithms
  calculation and compare it with predefined threshold.
 
  For example:
 
  if threshold defined for firsname is 85 and IsFuzzyMatchFunction()
 function
  only return ture only and only if one of the algorithms(distance or
  phonetic) return the similarity socre = 85.
 
  Can I use solr to perform this job. Or Can you guys suggest how can I
  approach to this problem. I have seen the duke(De duplication API) but I
  can not use duke out of the box.
 



De Duplication using Solr

2015-01-02 Thread Amit Jha
I am trying to find out duplicate records based on distance and phonetic
algorithms. Can I utilize solr for that? I have following fields and
conditions to identify exact or possible duplicates.

1. Fields
prefix
suffix
firstname
lastname
email(primary_email1, email2, email3)
phone(primary_phone1, phone2, phone3)
2. Conditions:
Two records said to be exact duplicates if

1. IsExactMatchFunction(record1_prefix, record2_prefix) AND
IsExactMatchFunction(record1_suffix, record2_suffix) AND
IsExactMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
Two records said to be possible duplicates if

1. IsExactMatchFunction(record1_prefix, record2_prefix) OR
IsExactMatchFunction(record1_suffix, record2_suffix) OR
IsExactMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
 ELSE
 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
 ELSE
 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_any_email,record2_any_email) OR
IsExactMatchFunction(record1_any_phone,record2_any_primary)

IsFuzzyMatchFunction() will perform distance and phonetic algorithms
calculation and compare it with predefined threshold.

For example:

if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function
only return ture only and only if one of the algorithms(distance or
phonetic) return the similarity socre = 85.

Can I use solr to perform this job. Or Can you guys suggest how can I
approach to this problem. I have seen the duke(De duplication API) but I
can not use duke out of the box.


Re: different fields for user-supplied phrases in edismax

2014-12-12 Thread Amit Jha
Hi Mike,

What is exact your use case?  
What do mean by controlling the fields used for phrase queries ? 


Rgds
AJ

 On 12-Dec-2014, at 20:11, Michael Sokolov msoko...@safaribooksonline.com 
 wrote:
 
 Doug - I believe pf controls the fields that are used for the phrase queries 
 *generated by the parser*.
 
 What I am after is controlling the fields used for the phrase queries 
 *supplied by the user* -- ie surrounded by double-quotes.
 
 -Mike
 
 On 12/12/2014 08:53 AM, Doug Turnbull wrote:
 Michael,
 
 I typically solve this problem by using a copyField and running different
 analysis on the destination field. Then you could use this field as pf
 insteaf of qf. If I recall, fields in pf must also be mentioned in qf for
 this to work.
 
 -Doug
 
 On Fri, Dec 12, 2014 at 8:13 AM, Michael Sokolov 
 msoko...@safaribooksonline.com wrote:
 Yes, I guess it's a common expectation that searches work this way.  It
 was actually almost trivial to add as an extension to the edismax parser,
 and I have what I need now; I opened SOLR-6842; if there's interest I'll
 try to find the time to contribute back to Solr
 
 -Mike
 
 
 On 12/11/14 5:20 PM, Ahmet Arslan wrote:
 
 Hi Mike,
 
 If I am not wrong, you are trying to simulate google behaviour.
 If you use quotes, google return exact matches. I think that makes
 perfectly sense and will be a valuable addition. I remember some folks
 asked/requested this behaviour in the list.
 
 Ahmet
 
 
 
 On Thursday, December 11, 2014 10:50 PM, Michael Sokolov 
 msoko...@safaribooksonline.com wrote:
 I'd like to supply a different set of fields for phrases than for bare
 terms.  Specifically, we'd like to treat phrases as more exact -
 probably turning off stemming and generally having a tighter analysis
 chain.  Note: this is *not* what's done by configuring pf which
 controls fields for the auto-generated phrases.  What we want to do is
 provide our users more precise control by explicit use of  
 
 Is there a way to do this by configuring edismax?  I don't think there
 is, and then if you agree, a followup question - if I want to extend the
 EDismax parser, does anybody have advice as to the best way in?  I'm
 looking at:
 
 Query getFieldQuery(String field, String val, int slop)
 
 and altering getAliasedQuery() to accept an aliases parameter, which
 would be a different set of aliases for phrases ...
 
 does that make sense?
 
 -Mike
 


Re: Fault Tolerant Technique of Solr Cloud

2014-02-18 Thread Amit Jha
Solr will complaint only if you brought down both replica  leader of same 
shard. It would be difficult to have highly available env. If you have less 
number of physical servers.

Rgds
AJ

 On 18-Feb-2014, at 18:35, Vineet Mishra clearmido...@gmail.com wrote:
 
 Hi All,
 
 I want to have clear idea about the Fault Tolerant Capability of SolrCloud
 
 Considering I have setup the SolrCloud with a external Zookeeper, 2 shards,
 each having a replica with single collection as given in the official Solr
 Documentation.
 
 https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
 
   *Collection1*
 /\
   /\
 /\
   /\
 /\
/   \
 *Shard 1 Shard 2*
 localhost:8983localhost:7574
 localhost:8900localhost:7500
 
 
 I Indexed some document and then if I shutdown any of the replica or Leader
 say for ex- *localhost:8900*, I can't query to the collection to that
 particular port
 
 http:/*/localhost:8900*/solr/collection1/select?q=*:*
 
 Then how is it Fault Tolerant or how the query has to be made.
 
 Regards


Solr Deduplication use of overWriteDupes flag

2014-02-04 Thread Amit Agrawal
Hello,

I had a configuration where I had overwriteDupes=false. I added few
duplicate documents. Result: I got duplicate documents in the index.

When I changed to overwriteDupes=true, the duplicate documents started
overwriting the older documents.

Question 1: How do I achieve, [add if not there, fail if duplicate is
found] i.e. mimic the behaviour of a DB which fails when trying to insert a
record which violates some unique constraint. I thought that
overwriteDupes=false would do that, but apparently not.

Question2: Is there some documentation around overwriteDupes? I have
checked the existing Wiki; there is very little explanation of the flag
there.

Thanks,

-Amit


Re: Boosting documents by categorical preferences

2014-01-30 Thread Amit Nithian
Chris,

Sounds good! Thanks for the tips.. I'll be glad to submit my talk to this
as I have a writeup pretty much ready to go.

Cheers
Amit


On Tue, Jan 28, 2014 at 11:24 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : The initial results seem to be kinda promising... of course there are
 many
 : more optimizations I could do like decay user ratings over time to
 indicate
 : that preferences decay over time so a 5 rating a year ago doesn't count
 as
 : much as a 5 rating today.
 :
 : Hope this helps others. I'll open source what I have soon and post back.
 If
 : there is feedback or other thoughts let me know!

 Hey Amit,

 Glad to hear your user based boosting experiments are paying off.  I would
 definitely love to see a more detailed writeup down the road showing off
 how it affects your final user metrics -- or perhaps even give a session
 on your technique at ApacheCon?


 http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp


 -Hoss
 http://www.lucidworks.com/



Re: Boosting documents by categorical preferences

2014-01-27 Thread Amit Nithian
Hi Chris (and others interested in this),

Sorry for dropping off.. I got sidetracked with other work and came back to
this and finally got a V1 of this implemented.

The final process is as follows:
1) Pre-compute the global categorical num_ratings/average/std-dev (so for
Action the average rating may be 3.49 with stdDev of .99)
2) For a given user, retrieve the last X (X for me is 10) ratings and
compute the user's categorical affinities by taking the average rating for
all movies in that particular category (Action) subtract the global cat
average and divide by cat std_dev. Furthermore, multiply this by the
fraction of total user ratings in that category.
   - For example, if a user's last 10 ratings consisted of 9/10 Drama and
1/10 Thriller, the z-score of the Thriller should be discounted relative to
that of the Drama so that it's more prominent the user's preference (either
positive or negative) to Drama.
3) Sort by the absolute value of the z-score (Thanks Hossman.. great
thought).
4) Return the top 3 (arbitrary number)
5) Modify the query to look like the following:

qq=tom hanksq={!boost b=$b defType=edismax
v=$qq}cat1=category:Childrencat2=category:Fantasycat3=category:Animationb=sum(1,sum(product(query($cat1),0.22267872),product(query($cat2),0.21630952),product(query($cat3),0.21120241)))

basically b = 1+(pref1*query(category:something1) +
pref2*query(category:something2) + pref3*query(category:something3))

The initial results seem to be kinda promising... of course there are many
more optimizations I could do like decay user ratings over time to indicate
that preferences decay over time so a 5 rating a year ago doesn't count as
much as a 5 rating today.

Hope this helps others. I'll open source what I have soon and post back. If
there is feedback or other thoughts let me know!

Cheers
Amit


On Fri, Nov 22, 2013 at 11:38 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I thought about that but my concern/question was how. If I used the pow
 : function then I'm still boosting the bad categories by a small
 : amount..alternatively I could multiply by a negative number but does that
 : work as expected?

 I'm not sure i understand your concern: negative powers would give you
 values less then 1, positive powers would give you values greater then 1,
 and then you'd use those values as multiplicitive boosts -- so the values
 less then 1 would penalize the scores of existing matching docs in the
 categories the user dislikes.

 Oh wait ... i see, in your original email (and in my subsequent suggested
 tweak to use pow()) you were talking about sum()ing up these 3 category
 boosts (and i cut/pasted sum() in my example as well) ... yeah,
 using multiplcation there would make more sense if you wanted to do the
 negative prefrences as well, because then then score of any matching doc
 will be reduced if it matches on an undesired category -- and the
 amount it will be reduced will be determined by how strongly it
 matches on that category (ie: the base score returned by the nested
 query() func) and how negative the undesired prefrence value (ie:
 the pow() exponent) is


 qq=...
 q={!boost b=$b v=$qq}

 b=prod(pow(query($cat1,cat1z)),pow(query($cat2,cat2z)),pow(query($cat3,cat3z))
 cat1=...action...
 cat1z=1.48
 cat2=...comedy...
 cat2z=1.33
 cat3=...kids...
 cat3z=-1.7


 -Hoss



SolrCloud Cluster Setup - Shard Replica

2014-01-18 Thread Amit Jha
Hi,

I tried to create 2 shard cluster with shard replica of a collection. For
this set up I used two physical machines. In this set up I have installed 1
shard and replica in Machine A and another 1 shard and 1 replica in Machine
B.
Now when I stop both shard and replica on machine B. I was not able to
perform search. I would like to know how can I set up a fail safe cluster
using two machines?
I would like achieve the use case where if machine goes down, Still I can
serve the search request. I have a constraint where I can not add more
machine. Is there any alternative to achieve the use case?

Regards
Amit


Index size - to determine storage

2014-01-09 Thread Amit Jha
Hi,

I would like to know if I index a file I.e PDF of 100KB then what would be the 
size of index. What all factors should be consider to determine the disk size?

Rgds
AJ

Re: DateField - Invalid JSON String Exception - converting Query Response to JSON Object

2014-01-07 Thread Amit Jha
I am using it. But timestamp having : in between causes the issue. Please
help


On Tue, Jan 7, 2014 at 11:46 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Amit,

 If you want json response, Why don't you use wt=json?

 Ahmet


 On Tuesday, January 7, 2014 7:34 AM, Amit Jha shanuu@gmail.com
 wrote:
 Hi,


 We have index where date field have default value as 'NOW'. We are using
 solrj to query solr and when we try to convert query
 response(response.getResponse) to JSON object in java. The JSON
 API(org.json) throws 'invalid json string' exception. API say so because
 date field value i.e. -mm-ddThh:mm:ssZ  is not surrounded by double
 inverted commas(  ). So It says required , or } character when API see the
 colon.

 Could you please help me to retrieve the date field value as string in JSON
 response. Or any pointers.

 Any help would be highly appreciable.



 On Tue, Jan 7, 2014 at 12:28 AM, Amit Jha shanuu@gmail.com wrote:

  Hi,
 
  Wish You All a Very Happy New Year.
 
  We have index where date field have default value as 'NOW'. We are using
  solrj to query solr and when we try to convert query
  response(response.getResponse) to JSON object in java. The JSON
  API(org.json) throws 'invalid json string' exception. API say so because
  date field value i.e. -mm-ddThh:mm:ssZ  is not surrounded by double
  inverted commas(  ). So It says required , or } character when API see
 the
  colon.
 
  Could you please help me to retrieve the date field value as string in
  JSON response. Or any pointers.
 
  Any help would be highly appreciable.
 
 
 
 




Re: DateField - Invalid JSON String Exception - converting Query Response to JSON Object

2014-01-07 Thread Amit Jha
Hey Hoss,

Thanks for replying back..Here is the response generated by solrj.




*SolrJ Response*: ignore the Braces at It have copied it from big chunk

Response:
{responseHeader={status=0,QTime=0,params={lowercaseOperators=true,sort=score
desc,cache=false,qf=content,wt=javabin,rows=100,defType=edismax,version=2,fl=*,score,start=0,q=White+Paper,stopwords=true,fq=type:White
Paper}},response={numFound=9,start=0,maxScore=0.61586785,docs=[SolrDocument{id=007,
type=White Paper, source=Documents, title=White Paper 003, body=White Paper
004 Body, author=[Author 3], keywords=[Keyword 3], description=Vivamus
turpis eros, mime_type=pdf, _version_=1456609602022932480,
*publication_date=Wed
Jan 08 03:16:06 IST 2014*, score=0.61586785}]},

Please the publication_date value, Whenever I enable stored=true for this
field I got the error

*org.json.JSONException: Expected a ',' or '}' at 853 [character 854 line
1]*

*Solr Query String*
q=%22White%2BPaper%22qf=contentstart=0rows=100sort=score+descdefType=edismaxstopwords=truelowercaseOperators=truewt=jsoncache=falsefl=*%2Cscorefq=type%3A%22White+Paper%22

Hope this may help you to answer.




On Tue, Jan 7, 2014 at 10:29 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : We have index where date field have default value as 'NOW'. We are using
 : solrj to query solr and when we try to convert query
 : response(response.getResponse) to JSON object in java. The JSON

 You're going to have to show us some real code, some real data, and a real
 error exception that you are getting -- because it's not at all clear what
 you are trying to do, or why you would get an error about invalid JSON.

 If you generate a JSON response from Solr, you'll get properly quoted
 strings for the dates...

 $ curl 'http://localhost:8983/solr/collection1/query?q=SOLRfl=*_dt;'
 {
   responseHeader:{
 status:0,
 QTime:8,
 params:{
   fl:*_dt,
   q:SOLR}},
   response:{numFound:1,start:0,docs:[
   {
 incubationdate_dt:2006-01-17T00:00:00Z}]
   }}


 ...but it appears you are trying to *generate* JSON yourself, using the
 Java objects you get back from a parsed SolrJ response -- so i'm not sure
 where you would be getting an error about invalid JSON, unless you were
 doing something invalid in the code you are writing to create that JSON.


 -Hoss
 http://www.lucidworks.com/



DateField - Invalid JSON String Exception - converting Query Response to JSON Object

2014-01-06 Thread Amit Jha
Hi,

Wish You All a Very Happy New Year.

We have index where date field have default value as 'NOW'. We are using
solrj to query solr and when we try to convert query
response(response.getResponse) to JSON object in java. The JSON
API(org.json) throws 'invalid json string' exception. API say so because
date field value i.e. -mm-ddThh:mm:ssZ  is not surrounded by double
inverted commas(  ). So It says required , or } character when API see the
colon.

Could you please help me to retrieve the date field value as string in JSON
response. Or any pointers.

Any help would be highly appreciable.


Re: DateField - Invalid JSON String Exception - converting Query Response to JSON Object

2014-01-06 Thread Amit Jha
Hi,


We have index where date field have default value as 'NOW'. We are using
solrj to query solr and when we try to convert query
response(response.getResponse) to JSON object in java. The JSON
API(org.json) throws 'invalid json string' exception. API say so because
date field value i.e. -mm-ddThh:mm:ssZ  is not surrounded by double
inverted commas(  ). So It says required , or } character when API see the
colon.

Could you please help me to retrieve the date field value as string in JSON
response. Or any pointers.

Any help would be highly appreciable.


On Tue, Jan 7, 2014 at 12:28 AM, Amit Jha shanuu@gmail.com wrote:

 Hi,

 Wish You All a Very Happy New Year.

 We have index where date field have default value as 'NOW'. We are using
 solrj to query solr and when we try to convert query
 response(response.getResponse) to JSON object in java. The JSON
 API(org.json) throws 'invalid json string' exception. API say so because
 date field value i.e. -mm-ddThh:mm:ssZ  is not surrounded by double
 inverted commas(  ). So It says required , or } character when API see the
 colon.

 Could you please help me to retrieve the date field value as string in
 JSON response. Or any pointers.

 Any help would be highly appreciable.






Re: /select with 'q' parameter does not work

2013-12-11 Thread Amit Aggarwal
Because in your solrconfig ... Against /select ... DirectUpdateHandler is
mentioned . It should be solr.searchhanlder ..
On 11-Dec-2013 3:11 PM, Nutan nutanshinde1...@gmail.com wrote:

 I have indexed 9 docs.
 this my* schema.xml*

 schema  name=documents
 fields

 field name=doc_id type=uuid indexed=true stored=true default=NEW
 multiValued=false/
 field name=id type=integer indexed=true stored=true
 required=true
 multiValued=false/
 field name=contents type=text indexed=true stored=true
 multiValued=false/
 field name=author type=title_text indexed=true stored=true
 multiValued=true/
 field name=title type=title_text indexed=true stored=true/
 field name=_version_ type=long indexed=true stored=true
 multiValued=false/
 copyfield source=id dest=text /
 dynamicField name=ignored_* type=text indexed=false stored=false
 multiValued=true/

 field name=description_ngram type=text_ngram indexed=true
 stored=false /
 copyField source=contents dest=description_ngram /
 /fields

 types

 fieldType name=text_ngram class=solr.TextField
 positionIncrementGap=100 
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2 /
 /analyzer
 /fieldType


 fieldType name=uuid class=solr.UUIDField indexed=true /
 fieldtype name=ignored stored=false indexed=false
 class=solr.StrField /
 fieldType name=integer class=solr.IntField  omitNorms=true
 positionIncrementGap=0/
 fieldType name=long class=solr.LongField /
 fieldType name=string class=solr.StrField  /
 fieldType name=title_text class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/

 /analyzer
 /fieldType


 fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 splitOnCaseChange=1 generateNumberParts=1 splitOnNumerics=1 /
 filter class=solr.StemmerOverrideFilterFactory
 dictionary=my_stemmer.txt /
 filter class=solr.SnowballPorterFilterFactory /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false /
 filter class=solr.EnglishMinimalStemFilterFactory /
 filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2 /
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 splitOnCaseChange=1 generateNumberParts=1 splitOnNumerics=1 /
 filter class=solr.StemmerOverrideFilterFactory
 dictionary=my_stemmer.txt /
 filter class=solr.SnowballPorterFilterFactory /
 filter class=solr.EnglishMinimalStemFilterFactory /
 /analyzer
 /fieldType
 /types
 defaultSearchFieldcontents/defaultSearchField
 uniqueKeyid/uniqueKey
 /schema

 *solrconfig.xml* is:

 ?xml version=1.0 encoding=UTF-8 ?

 config

   luceneMatchVersionLUCENE_42/luceneMatchVersion

   dataDir${solr.document.data.dir:}/dataDir

   requestDispatcher handleSelect=false 
   requestParsers enableRemoteStreaming=true
 multipartUploadLimitInKB=8500 /
 /requestDispatcher

lib dir=../lib  regex=.*\.jar /


   requestHandler name=standard class=solr.StandardRequestHandler
 default=true

  lst name=defaults
str name=echoParamsexplicit/str
int name=rows20/int
str name=fl*/str
str name=dfid/str
str name=version2.1/str
  /lst
   /requestHandler

   updateHandler name=/select class=solr.DirectUpdateHandler2 
   updateLog
 str name=dir${solr.document.data.dir:}/str
   /updateLog
   /updateHandler

  requestHandler name=/analysis/field startup=lazy
 class=solr.FieldAnalysisRequestHandler /
  requestHandler name=/admin/ class=solr.admin.AdminHandlers /
  requestHandler name=/update class=solr.UpdateRequestHandler/
 requestHandler name=/select class=solr.SearchHandler
  lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dfcontents/str
  /lst
 /requestHandler
 /config
 (i have also added extract,analysis,elevator,promotion,spell,suggester
 components in solrconfig but i guess that wont select query)
 When i run this:
 http://localhost:8080/solr/document/select?q=*:*   -- all the 9 docs are
 replaced

 but when i run this:
 http://localhost:8080/solr/document/select?q=programmer or anything in
 place
 of programmer -- output shows numfound=0 evenif there are about 34 times
 programmer has appeared in docs.

 Initially it worked fine,but not now.
 Why is it so?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/select-with-q-parameter-does-not-work-tp4106099.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: /select with 'q' parameter does not work

2013-12-11 Thread Amit Aggarwal
When you start solr , do you find any error or exception
 Java -jar ./start.jar ... Then see if there is any problem ...
Otherwise take solr solrconfig.xml and try to run .. it should run
On 11-Dec-2013 5:41 PM, Nutan nutanshinde1...@gmail.com wrote:

 requestHandler name=standard class=solr.StandardRequestHandler
 default=true

  lst name=defaults
str name=echoParamsexplicit/str
int name=rows20/int
str name=fl*/str
str name=dfid/str
str name=version2.1/str
  /lst
   /requestHandler


   updateHandler class=solr.DirectUpdateHandler2 
   updateLog
 str name=dir${solr.document.data.dir:}/str
   /updateLog
   /updateHandler

requestHandler name=/update class=solr.UpdateRequestHandler
   /requestHandler

  requestHandler name=/analysis/field startup=lazy
 class=solr.FieldAnalysisRequestHandler /
  requestHandler name=/admin/ class=solr.admin.AdminHandlers /
  requestHandler name=/update class=solr.UpdateRequestHandler/

 requestHandler name=/select class=solr.SearchHandler
  lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dfcontents/str
  /lst
 /requestHandler

 i made changes n this new solrconfig.xml ,but still the query does not
 work.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/select-with-q-parameter-does-not-work-tp4106099p4106133.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Please help me to understand debugQuery output

2013-11-25 Thread Amit Aggarwal

Hello All,

Can any one help me in understanding debugQuery output like this.


lst name=explain

str

0.6276088 = (MATCH) sum of:

0.6276088 = (MATCH) max of:

0.18323982 = (MATCH) sum of:

	0.18323982 = (MATCH) weight(state_search:a in 327) [DefaultSimilarity], 
result of:


0.18323982 = score(doc=327,freq=2.0 = termFreq=2.0

), product of:

0.3188151 = queryWeight, product of:

3.2512918 = idf(docFreq=35, maxDocs=342)

0.098057985 = queryNorm

0.5747526 = fieldWeight in 327, product of:

1.4142135 = tf(freq=2.0), with freq of:

2.0 = termFreq=2.0

3.2512918 = idf(docFreq=35, maxDocs=342)

0.125 = fieldNorm(doc=327)

0.2505932 = (MATCH) sum of:

	0.2505932 = (MATCH) weight(country_search:a in 327) 
[DefaultSimilarity], result of:


0.2505932 = score(doc=327,freq=1.0 = termFreq=1.0

), product of:

0.3135134 = queryWeight, product of:

3.1972246 = idf(docFreq=37, maxDocs=342)

0.098057985 = queryNorm

0.79930615 = fieldWeight in 327, product of:

1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

3.1972246 = idf(docFreq=37, maxDocs=342)

0.25 = fieldNorm(doc=327)

0.25283098 = (MATCH) sum of:

	0.25283098 = (MATCH) weight(area_search:a in 327) [DefaultSimilarity], 
result of:


0.25283098 = score(doc=327,freq=1.0 = termFreq=1.0

), product of:

0.398 = queryWeight, product of:

4.06 = idf(docFreq=15, maxDocs=342)

0.098057985 = queryNorm

0.6347222 = fieldWeight in 327, product of:

1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

4.06 = idf(docFreq=15, maxDocs=342)

0.15625 = fieldNorm(doc=327)

0.6276088 = (MATCH) sum of:

	0.12957011 = (MATCH) weight(city_search:a in 327) [DefaultSimilarity], 
result of:


0.12957011 = score(doc=327,freq=1.0 = termFreq=1.0

), product of:

0.3188151 = queryWeight, product of:

3.2512918 = idf(docFreq=35, maxDocs=342)

0.098057985 = queryNorm

0.40641147 = fieldWeight in 327, product of:

1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

3.2512918 = idf(docFreq=35, maxDocs=342)

0.125 = fieldNorm(doc=327)

	0.3638727 = (MATCH) weight(city_search:ab in 327) [DefaultSimilarity], 
result of:


0.3638727 = score(doc=327,freq=1.0 = termFreq=1.0

), product of:

0.5342705 = queryWeight, product of:

5.4485164 = idf(docFreq=3, maxDocs=342)

0.098057985 = queryNorm

0.68106455 = fieldWeight in 327, product of:

1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

5.4485164 = idf(docFreq=3, maxDocs=342)

0.125 = fieldNorm(doc=327)

	0.13416591 = (MATCH) weight(city_search:b in 327) [DefaultSimilarity], 
result of:


0.13416591 = score(doc=327,freq=1.0 = termFreq=1.0

), product of:

0.32441998 = queryWeight, product of:

3.3084502 = idf(docFreq=33, maxDocs=342)

0.098057985 = queryNorm

0.41355628 = fieldWeight in 327, product of:

1.0 = tf(freq=1.0), with freq of:

1.0 = termFreq=1.0

3.3084502 = idf(docFreq=33, maxDocs=342)

0.125 = fieldNorm(doc=327)

/str








Any links where this explaination is explained ?

Thanks

--
Amit Aggarwal
8095552012



Re: Can I use boosting fields with edismax ?

2013-11-24 Thread Amit Aggarwal
Ok Erick.. I will try thanks
On 25-Nov-2013 2:46 AM, Erick Erickson erickerick...@gmail.com wrote:

 This should work. Try adding debug=all to your URL, and examine
 the output both with and without your boosting. I believe you'll see
 the difference in the score calculations. From there it's a matter
 of adjusting the boosts to get the results you want.


 Best,
 Erick


 On Sat, Nov 23, 2013 at 9:17 AM, Amit Aggarwal amit.aggarwa...@gmail.com
 wrote:

  Hello All ,
 
  I am using defType=edismax
  So will boosting will work like this in solrConfig.xml
 
  str name=qfvalue_search^2.0 desc_search country_search^1.5
  state_search^2.0 city_search^2.5 area_search^3.0/str
 
  I think it is not working ..
 
  If yes , then what should I do ?
 



Can I use boosting fields with edismax ?

2013-11-23 Thread Amit Aggarwal
Hello All ,

I am using defType=edismax
So will boosting will work like this in solrConfig.xml

str name=qfvalue_search^2.0 desc_search country_search^1.5
state_search^2.0 city_search^2.5 area_search^3.0/str

I think it is not working ..

If yes , then what should I do ?


Re: Boosting documents by categorical preferences

2013-11-20 Thread Amit Nithian
I thought about that but my concern/question was how. If I used the pow
function then I'm still boosting the bad categories by a small
amount..alternatively I could multiply by a negative number but does that
work as expected?

I haven't done much with negative boosting except for the sledgehammer
approach of category exclusion through filters.

Thanks
Amit
On Nov 19, 2013 8:51 AM, Chris Hostetter hossman_luc...@fucit.org wrote:

 : My approach was something like:
 : 1) Look at the categories that the user has preferred and compute the
 : z-score
 : 2) Pick the top 3 among those
 : 3) Use those to boost search results.

 I think that totaly makes sense ... the additional bit i was suggesting
 that you consider is that instead of picking the highest 3 z-scores,
 pick the z-scores with the greatest absolute value ... that way if someone
 is a very booring person and their positive interests are all basically
 exactly the same as the mean for everyone else, but they have some very
 strong dis-interests you don't bother boosting on those miniscule
 interests and instead you negatively boost on the things they are
 antogonistic against.


 -Hoss



Re: How to get score with getDocList method Solr API

2013-11-19 Thread Amit Aggarwal
Hello shekhar ,
Thanks for answering . Do I have to set GET_SCORES FLAG as last parameter
of getDocList method ?

Thanks
On 19-Nov-2013 1:43 PM, Shalin Shekhar Mangar shalinman...@gmail.com
wrote:

 A few flags are supported:
 public static final int GET_DOCSET= 0x4000;
 public static final int TERMINATE_EARLY = 0x04;
 public static final int GET_DOCLIST   =0x02; // get
 the documents actually returned in a response
 public static final int GET_SCORES =   0x01;

 Use the GET_SCORES flag to get the score with each document.

 On Tue, Nov 19, 2013 at 8:08 AM, Amit Aggarwal
 amit.aggarwa...@gmail.com wrote:
  Hello All,
 
  I am trying to develop a custom request handler.
  Here is the snippet :
 
  // returnMe is nothing but a list of Document going to return
 
  try {
 
  // FLAG ???
  DocList docList = searcher.getDocList(parsedQuery,
  parsedFilterQueryList, Sort.RELEVANCE, 1, maxDocs , FLAG);
 
  // Now get DocIterator
  DocIterator it = docList.iterator();
 
  // Now for each id get doc and put it in
 listDocument
 
  int i =0;
  while (it.hasNext()) {
 
  returnMe.add(searcher.doc(it.next()));
 
  }
 
 
  Ques 1 -  My question is , what does FLAG represent in getDocList
 method ?
  Ques 2 -  How can I ensure that searcher.getDocList method give me score
  also with each document.
 
 
  --
  Amit Aggarwal
  8095552012
 



 --
 Regards,
 Shalin Shekhar Mangar.



How to get score with getDocList method Solr API

2013-11-18 Thread Amit Aggarwal

Hello All,

I am trying to develop a custom request handler.
Here is the snippet :

// returnMe is nothing but a list of Document going to return

try {

// FLAG ???
DocList docList = searcher.getDocList(parsedQuery, 
parsedFilterQueryList, Sort.RELEVANCE, 1, maxDocs , FLAG);


// Now get DocIterator
DocIterator it = docList.iterator();

// Now for each id get doc and put it in listDocument

int i =0;
while (it.hasNext()) {

returnMe.add(searcher.doc(it.next()));

}


Ques 1 -  My question is , what does FLAG represent in getDocList method ?
Ques 2 -  How can I ensure that searcher.getDocList method give me 
score also with each document.



--
Amit Aggarwal
8095552012



Re: Boosting documents by categorical preferences

2013-11-18 Thread Amit Nithian
Hey Chris,

Sorry for the delay and thanks for your response. This was inspired by your
talk on boosting and biasing that you presented way back when at a meetup.
I'm glad that my general approach seems to make sense.

My approach was something like:
1) Look at the categories that the user has preferred and compute the
z-score
2) Pick the top 3 among those
3) Use those to boost search results.

I'll look at using the boosts as an exponent instead of a multiplier as I
think that would make sense.. also as it handles the 0 case.

This is for a prototype I am doing but I'll share the results one day in a
meetup as I think it'll be kinda interesting.

Thanks again
Amit


On Thu, Nov 14, 2013 at 11:11 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : I have a question around boosting. I wanted to use the boost= to write a
 : nested query that will boost a document based on categorical preferences.

 You have no idea how stoked I am to see you working on this in a real
 world application.

 : Currently I have the weights set to the z-score equivalent of a user's
 : preference for that category which is simply how many standard deviations
 : above the global average is this user's preference for that movie
 category.
 :
 : My question though is basically whether or not semantically the equation
 : query(category:Drama)*some weight + query(category:Comedy)*some
 weight
 : + query(category:Action)*some weight makes sense?

 My gut says that your apprach makes sense -- but if i'm
 understadning you correclty, i think that you need to add 1 to
 all your weights: the boost is a multiplier, so if someone's rating for
 every category is is 0 std devs above the average rating (ie: the most
 average person imaginable), you don't wnat to give every moving in every
 category a score of 0.

 Are you picking the top 3 categories the user prefers as a cut off, or
 are you arbitrarily using N category boosts for however many N categories
 the user is above the global average in their pref for that category?

 Are your prefrences coming from explicit user feedback on the categories
 (ie: rate how much you like comedies on a scale of 1-5) or are you
 infering it from user ratings of the movies themselves? (ie: rate this
 movie, which happens to be an scifi,action,comedy, on a scale of 1-5) ...
 because if it's hte later you probably want to be careful to also
 normalize based on how many categories the movie is in.

 the other thing to consider is wether you want to include negative
 prefrences (ie: weights less then 1) based on how many std dev the user's
 average is *below* the global average for a category .. in this case i
 *think* you'd want to divide the raw value from -1 to get a useful
 multiplier.

 Alternatively: you oculd experiment with using the weights as exponents
 instead of multipliers...


 b=sum(pow(query($cat1),1.482),pow(query($cat2),0.1199),pow(query($cat3),1.448))

 ...that would simplify the math you'd have to worry about both for the
 totally boring average user (x**0 = 1) and for the categories users hate
 (x**-5 = some positive fraction that will act as a penalty) ... but you'd
 definitley need to run some tests to see if it over boosts as the std
 dev variations get really high (might want to take a root first before
 using them as the exponent)



 -Hoss



Re: Why do people want to deploy to Tomcat?

2013-11-12 Thread Amit Aggarwal
Agreed with Doug
On 12-Nov-2013 6:46 PM, Doug Turnbull dturnb...@opensourceconnections.com
wrote:

 As an aside, I think one reason people feel compelled to deviate from the
 distributed jetty distribution is because the folder is named example.
 I've had to explain to a few clients that this is a bit of a misnomer. The
 IT dept especially sees example and feels uncomfortable using that as a
 starting point for a jetty install. I wish it was called default or bin
 or something where its more obviously the default jetty distribution of
 Solr.


 On Tue, Nov 12, 2013 at 7:06 AM, Roland Everaert reveatw...@gmail.com
 wrote:

  In my case, the first time I had to deploy and configure solr on tomcat
  (and jboss) it was a requirement to reuse as much as possible the
  application/web server already in place. The next deployment I also use
  tomcat, because I was used to deploy on tomcat and I don't know jetty at
  all.
 
  I could ask the same question with regard to jetty. Why use/bundle(/ if
 not
  recommend) jetty with solr over other webserver solutions?
 
  Regards,
 
 
  Roland Everaert.
 
 
 
  On Tue, Nov 12, 2013 at 12:33 PM, Alvaro Cabrerizo topor...@gmail.com
  wrote:
 
   In my case, the selection of the servlet container has never been a
 hard
   requirement. I mean, some customers provide us a virtual machine
  configured
   with java/tomcat , others have a tomcat installed and want to share it
  with
   solr, others prefer jetty because their sysadmins are used to configure
   it...  At least in the projects I've been working in, the selection of
  the
   servlet engine has not been a key factor in the project success.
  
   Regards.
  
  
   On Tue, Nov 12, 2013 at 12:11 PM, Andre Bois-Crettez
   andre.b...@kelkoo.comwrote:
  
We are using Solr running on Tomcat.
   
I think the top reasons for us are :
 - we already have nagios monitoring plugins for tomcat that trace
queries ok/error, http codes / response time etc in access logs,
 number
of threads, jvm memory usage etc
 - start, stop, watchdogs, logs : we also use our standard tools for
  that
 - what about security filters ? Is that possible with jetty ?
   
André
   
   
On 11/12/2013 04:54 AM, Alexandre Rafalovitch wrote:
   
Hello,
   
I keep seeing here and on Stack Overflow people trying to deploy
 Solr
  to
Tomcat. We don't usually ask why, just help when where we can.
   
But the question happens often enough that I am curious. What is the
actual
business case. Is that because Tomcat is well known? Is it because
  other
apps are running under Tomcat and it is ops' requirement? Is it
  because
Tomcat gives something - to Solr - that Jetty does not?
   
It might be useful to know. Especially, since Solr team is
 considering
making the server part into a black box component. What use cases
 will
that
break?
   
So, if somebody runs Solr under Tomcat (or needed to and gave up),
  let's
use this thread to collect this knowledge.
   
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
  at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
   book)
   
--
André Bois-Crettez
   
Software Architect
Search Developer
http://www.kelkoo.com/
   
   
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris
   
Ce message et les pièces jointes sont confidentiels et établis à
l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
destinataire de ce message, merci de le détruire et d'en avertir
l'expéditeur.
   
  
 



 --
 Doug Turnbull
 Search  Big Data Architect
 OpenSource Connections http://o19s.com



Boosting documents by categorical preferences

2013-11-12 Thread Amit Nithian
Hi all,

I have a question around boosting. I wanted to use the boost= to write a
nested query that will boost a document based on categorical preferences.

For a movie search for example, say that a user likes drama, comedy, and
action. I could use things like

qq=q={!boost%20b=$b%20defType=edismax%20v=$qq}b=sum(product(query($cat1),1.482),product(query($cat2),0.1199),product(query($cat3),1.448))cat1=category:Dramacat2=category:Comedycat3=category:Action

where cat1=Drama cat2=Comedy cat3=Action

Currently I have the weights set to the z-score equivalent of a user's
preference for that category which is simply how many standard deviations
above the global average is this user's preference for that movie category.

My question though is basically whether or not semantically the equation
query(category:Drama)*some weight + query(category:Comedy)*some weight
+ query(category:Action)*some weight makes sense?

What are some techniques people use to boost documents based on discrete
things like category, manufacturer, genre etc?

Thanks!
Amit


return value from SolrJ client to php

2013-10-28 Thread Amit Aggarwal
Hello All,

I have a requirement where I have to conect to Solr using SolrJ client and
documents return by solr to SolrJ client have to returned to PHP.

I know its simple to get document from Solr to SolrJ
But how do I return documents from SolrJ to PHP ?


Thanks
Amit Aggarwal


Re: When is/should qf different from pf?

2013-10-28 Thread Amit Nithian
Thanks Erick. Numeric fields make sense as I guess would strictly string
fields too since its one  term? In the normal text searching case though
does it make sense to have qf and pf differ?

Thanks
Amit
On Oct 28, 2013 3:36 AM, Erick Erickson erickerick...@gmail.com wrote:

 The facetious answer is when phrases aren't important in the fields.
 If you're doing a simple boolean match, adding phrase fields will add
 expense, to no good purpose etc. Phrases on numeric
 fields seems wrong.

 FWIW,
 Erick


 On Mon, Oct 28, 2013 at 1:03 AM, Amit Nithian anith...@gmail.com wrote:

  Hi all,
 
  I have been using Solr for years but never really stopped to wonder:
 
  When using the dismax/edismax handler, when do you have the qf different
  from the pf?
 
  I have always set them to be the same (maybe different weights) but I was
  wondering if there is a situation where you would have a field in the qf
  not in the pf or vice versa.
 
  My understanding from the docs is that qf is a term-wise hard filter
 while
  pf is a phrase-wise boost of documents who made it past the qf filter.
 
  Thanks!
  Amit
 



Re: How to configure solr to our java project in eclipse

2013-10-27 Thread Amit Aggarwal
How so you start your another project ? If it is maven or ant then you can
use anturn plugin to start solr . Otherwise you can write a small shell
script to start solr ..
 On 27-Oct-2013 9:15 PM, giridhar girimc...@gmail.com wrote:

 Hi friends,Iam giridhar.please clarify my doubt.

 we are using solr for our project.the problem the solr is outside of our
 project( in another folder)

 we have to manually type java -start.jar to start the solr and use that
 services.

 But what we need is,when we run the project,the solr should be
 automatically
 start.

 our project is a java project with tomcat in eclipse.

 How can i achieve this.

 Please help me.

 Thankyou.
 Giridhar



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-configure-solr-to-our-java-project-in-eclipse-tp4097954.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr For

2013-10-27 Thread Amit Aggarwal
Depends  One core one schema file ... One solrconfig.xml .

So if you want only one core then put all required fields of both search in
one schema file and carry out your searches  Otherwise make two cores
having two schema file and perform searches accordingly ...
On 27-Oct-2013 7:22 AM, Baskar Sikkayan baskar@gmail.com wrote:

 Hi,
Looking for solr config for Job Site. In a job site there are 2 main
 searches.

 1) Employee can search for job ( based on skill set, job location, title,
 salary )
 2) Employer can search for employees ( based on skill set, exp, location,
  )

 Should i have a separate config xml for both searches?

 Thanks,
 Baskar



Re: Stop solr service

2013-10-27 Thread Amit Aggarwal
Lol ... Unsubscribe from this mailing list .
On 27-Oct-2013 5:02 PM, veena rani veenara...@gmail.com wrote:

 I want to stop the mail


 On Sun, Oct 27, 2013 at 4:37 PM, Rafał Kuć r@solr.pl wrote:

  Hello!
 
  Could you please write more about what you want to do? Do you need to
  stop running Solr process. If yes what you need to do is stop the
  container (Jetty/Tomcat) that Solr runs in. You can also kill JVM
  running Solr, however it will be usually enough to just stop the
  container.
 
  --
  Regards,
   Rafał Kuć
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
   Hi Team,
 
   Pla stop the solr service.
 
 


 --
 Regards,
 Veena Rani P N
 Banglore.
 9538440458



Re: How to configure solr to our java project in eclipse

2013-10-27 Thread Amit Nithian
Try this:
http://hokiesuns.blogspot.com/2010/01/setting-up-apache-solr-in-eclipse.html

I use this today and it still works. If anything is outdated (as it's a
relatively old post) let me know.
I wrote this so ping me if you have any questions.

Thanks
Amit


On Sun, Oct 27, 2013 at 7:33 PM, Amit Aggarwal amit.aggarwa...@gmail.comwrote:

 How so you start your another project ? If it is maven or ant then you can
 use anturn plugin to start solr . Otherwise you can write a small shell
 script to start solr ..
  On 27-Oct-2013 9:15 PM, giridhar girimc...@gmail.com wrote:

  Hi friends,Iam giridhar.please clarify my doubt.
 
  we are using solr for our project.the problem the solr is outside of our
  project( in another folder)
 
  we have to manually type java -start.jar to start the solr and use that
  services.
 
  But what we need is,when we run the project,the solr should be
  automatically
  start.
 
  our project is a java project with tomcat in eclipse.
 
  How can i achieve this.
 
  Please help me.
 
  Thankyou.
  Giridhar
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/How-to-configure-solr-to-our-java-project-in-eclipse-tp4097954.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



When is/should qf different from pf?

2013-10-27 Thread Amit Nithian
Hi all,

I have been using Solr for years but never really stopped to wonder:

When using the dismax/edismax handler, when do you have the qf different
from the pf?

I have always set them to be the same (maybe different weights) but I was
wondering if there is a situation where you would have a field in the qf
not in the pf or vice versa.

My understanding from the docs is that qf is a term-wise hard filter while
pf is a phrase-wise boost of documents who made it past the qf filter.

Thanks!
Amit


Please explain SolConfig.xml in terms of SolrAPIs (Java Psuedo Code)

2013-10-25 Thread Amit Aggarwal
Hello All,

Can some one explain me following snippet of SolrConfig.xml in terms of
Solr API (Java Psuedo Code) for better understanding.

like
*updateHandler class=solr.DirectUpdateHandler2*
* *
* UpdateLog*
*   str dir=BLAH /*
*/UpdateLog*
**
**
**
*/UpdateHandler*


Here I want to know .

1. What is updateHandler ? Is it some Package or class of interface ?
2. Whats is solr.DirectUpdateHandler2 ? Is it class
3. What is updateLog ? is it package ?
4. How do we know that UpdateLog have sub-element dir ?
5. how do we know that updateLog would be sub-element of updateHandler
?? Is updateLog some kind of subClass of something else ?


I KNOW that all these things are given in SolConfig.xml but I donot want to
cram those things .

One example of jetty.xml whatever we write there , it can be translated to
JAVA psuedo code


Re: Please explain SolConfig.xml in terms of SolrAPIs (Java Psuedo Code)

2013-10-25 Thread Amit Aggarwal
Yeah , you caught it right  Yes it was kid of Dtd .
Anyways thanks a lot for clearing my doubt ..

SOLVED .
On 25-Oct-2013 6:34 PM, Daniel Collins danwcoll...@gmail.com wrote:

 I think what you are looking for is some kind of DTD/schema you can use to
 see all the possible parameters in SolrConfig.xml, short answer, there
 isn't one (currently) :(

 jetty.xml has a DTD schema, and its XMLConfiguration format is inherently
 designed to convert to code, so the list of possible options can be
 generated by Java Reflection, but Solr's isn't quite that advanced.

 Generally speaking the config is described in
 http://wiki.apache.org/solr/SolrConfigXml.

 However, that is (by the nature of manually generated documentation) a bit
 out of date, so things like the updateLog aren't referenced there.  There
 is no Schema or DTD for SolrConfig, the best place to look for what the
 various options are is either the sample config which is generally quite
 good or the code (org.apache.solr.core.SolrConfig.java).

 At the end of the day updateLog is just the name of a config parameter  it
 is grouped under updateHandler since it relates to that.  How we know
 such a parameter exists:

 1) it was in the sample config (and commented to indicate what it means)
 2) its referenced in the code if you look through that




 On 25 October 2013 13:06, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

  I think better understanding is a bit too vague. Is there a specific
  problem you have? Your Jetty example would make sense if, for example,
 your
  goal was to automatically generate solrconfig.xml from some other
  configuration. But even then, you would probably use fillable templates
 and
  don't need fully corresponding JAVA api.
 
  For example, you are unlikely to edit the very line you are asking about,
  it's a little too esoteric:
  updateHandler class=solr.DirectUpdateHandler2
 
  Perhaps, what you want to do is to look at the smallest possible
  solrconfig.xml and then expand from there by looking at additional
 options.
 
  Regarding specific options available, most are documented on the Wiki and
  in the comments of the sample file.
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
  - Time is the quality of nature that keeps events from happening all at
  once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
  On Fri, Oct 25, 2013 at 5:19 PM, Amit Aggarwal 
 amit.aggarwa...@gmail.com
  wrote:
 
   Hello All,
  
   Can some one explain me following snippet of SolrConfig.xml in terms of
   Solr API (Java Psuedo Code) for better understanding.
  
   like
   *updateHandler class=solr.DirectUpdateHandler2*
   * *
   * UpdateLog*
   *   str dir=BLAH /*
   */UpdateLog*
   **
   **
   **
   */UpdateHandler*
  
  
   Here I want to know .
  
   1. What is updateHandler ? Is it some Package or class of interface ?
   2. Whats is solr.DirectUpdateHandler2 ? Is it class
   3. What is updateLog ? is it package ?
   4. How do we know that UpdateLog have sub-element dir ?
   5. how do we know that updateLog would be sub-element of
  updateHandler
   ?? Is updateLog some kind of subClass of something else ?
  
  
   I KNOW that all these things are given in SolConfig.xml but I donot
 want
  to
   cram those things .
  
   One example of jetty.xml whatever we write there , it can be translated
  to
   JAVA psuedo code
  
 



Re: Committing when indexing in parallel

2013-09-14 Thread Amit Jha
Hi,

As per my knowledge, any number of requests can be issued in parallel for index 
the documents. Any commit request will write them to index. 

So if P1 issued a commit then all documents of P2 those are eligible get 
committed and remaining documents will get committed on other commit request. 


Rgds
AJ

On 14-Sep-2013, at 2:51, Phani Chaitanya pvempaty@gmail.com wrote:

 
 I'm wondering what happens to commit while we are indexing in parallel in
 Solr. Are the indexing update requests blocked until the commit finishes ?
 
 Lets say I've a process P1 which issued a commit request and there is
 another process P2 which is still indexing to the same index. What happens
 to the index in that scenario. Are the P2 indexing requests blocked until P1
 commit request finishes ?
 
 I'm just wondering about what is the behavior of Solr in the above case.
 
 
 
 -
 Phani Chaitanya
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Committing-when-indexing-in-parallel-tp4089953.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: MySQL Data import handler

2013-09-14 Thread Amit Jha
Hi Baskar,

Just create a single schema.xml which should contains required fields from 3 
tables.

Add a status column to child table.i.e 
1 = add
2 = update
3 = delete
4 = indexed
Etc

Write a program using solrj which will read the status and do thing 
accordingly. 
 

Rgds
AJ

On 15-Sep-2013, at 5:46, Baskar Sikkayan baskar@gmail.com wrote:

 Hi,
  If i am supposed to go with Java client, should i still do any
 configurations in solrconfig.xml or schema.xml.
 
 Thanks,
 Baskar.S
 
 
 On Sat, Sep 14, 2013 at 8:46 PM, Gora Mohanty g...@mimirtech.com wrote:
 
 On 14 September 2013 20:07, Baskar Sikkayan baskar@gmail.com wrote:
 Hi Gora,
Thanks a lot for your reply.
 My requirement is to combine 3 tables in mysql for search operation and
 planning to sync these 3 tables( not all the columns ) in Apache Solr.
 Whenever there is any change( adding a new row, deleting a row, modifying
 the column data( any column in the 3 tables ) ), the same has to updated
 in
 solr. Guess, for this requirement, instead of going with delta-import,
 Apachae Solar java client will be of useful.
 [...]
 
 Yes, if you are comfortable with programming in Java,
 the Solr client would be a good alternative, though the
 DataImportHandler can also do what you want.
 
 Regards,
 Gora
 


Re: Solr Java Client

2013-09-14 Thread Amit Jha
Add a field called source in schema.xml and value would be your table names. 



Rgds
AJ

On 15-Sep-2013, at 5:38, Baskar Sikkayan baskar@gmail.com wrote:

 Hi,
  I am new to Solr and trying to use Solr java client instead of using the
 Data handler.
  Is there any configuration i need to do for this?
 
 I got the following sample code.
 
 SolrInputDocument doc = new SolrInputDocument();
 
  doc.addField(cat, book);
  doc.addField(id, book- + i);
  doc.addField(name, The Legend of the Hobbit part  + i);
  server.add(doc);
  server.commit();  // periodically flush
 
 I am confused here. I am going to index 3 different tables for 3 different
 kind of searches. Here i dont have any option to differentiate 3 kind of
 indexes.
 Am i missing anything here. Could anyone please shed some light here?
 
 Thanks,
 Baskar.S


Re: Solr Java Client

2013-09-14 Thread Amit Jha
Question is not clear to me.  Please be more elaborative in your query. Why do 
u want to store index to DB tables?

Rgds
AJ

On 15-Sep-2013, at 7:20, Baskar Sikkayan baskar@gmail.com wrote:

 How to add index to 3 diff tables from java ...
 
 
 On Sun, Sep 15, 2013 at 6:49 AM, Amit Jha shanuu@gmail.com wrote:
 
 Add a field called source in schema.xml and value would be your table
 names.
 
 
 
 Rgds
 AJ
 
 On 15-Sep-2013, at 5:38, Baskar Sikkayan baskar@gmail.com wrote:
 
 Hi,
 I am new to Solr and trying to use Solr java client instead of using the
 Data handler.
 Is there any configuration i need to do for this?
 
 I got the following sample code.
 
 SolrInputDocument doc = new SolrInputDocument();
 
 doc.addField(cat, book);
 doc.addField(id, book- + i);
 doc.addField(name, The Legend of the Hobbit part  + i);
 server.add(doc);
 server.commit();  // periodically flush
 
 I am confused here. I am going to index 3 different tables for 3
 different
 kind of searches. Here i dont have any option to differentiate 3 kind of
 indexes.
 Am i missing anything here. Could anyone please shed some light here?
 
 Thanks,
 Baskar.S
 


Re: Combining Solr score with customized user ratings for a document

2013-09-10 Thread Amit Jha
You can use DB for storing user preferences and later if you want you can flush 
them to solr as an update along with userid.

Or you may add a result pipeline filter 



Rgds
AJ

On 13-Feb-2013, at 17:50, Á_o chachime...@yahoo.es wrote:

 Hi:
 
 I am working on a proyect where we want to recommend our users products
 based on their previous 'likes', purchases and so on (typical stuff of a
 recommender system), while we want to let them browse freely the catalogue
 by search queries, making use of facets, more-like-this and so on (typical
 stuff of a Solr index).
 
 After reading here and there, I have reached the conclusion that's it's
 better to keep Solr Index apart from the database. Solr is for products
 (which can be reindexed from the DB as a nightly batch) while the DB is for
 everything else, including -the products and- user profiles. 
 
 So, given an user and a particular search (which can be as simple as q=*),
 on one hand we have Solr results (i.e. docs + scores) for the query, while
 on the other we have user predicted ratings (i.e. recommender scores) coming
 from the DB (though they could be cached elsewhere) for each of the products
 returned by Solr.
 
 And what I want is clear -to state-: combine both scores (e.g. by a simple
 product) so the user receives a sorted list of relevant products biased by
 his/her preferences.
 
 I have been googleing for the last days without finding which is the best
 way to achieve this.
 
 I think it's not a matter of boosting, or at least I can't see which
 boosting method could be useful as the boost should be user-based. I think
 that I need to extend -somewhere- Solr so I can alter the result scores by
 providing the user ID and connecting to the DB at query time, doing the
 necessary maths and returning the final score in a -quite- transparent way
 for the Web app.
 
 A less elegant solution could be letting Solr do its work as usual, and then
 navigate through the XML modifying the scores and reordering the whole list
 of products (or maybe just the first N results) by the new combined score.
 
 What do you think?
 A big THANKS in advance
 
 Álvaro
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Combining-Solr-score-with-customized-user-ratings-for-a-document-tp4040200.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-26 Thread Amit Jha
Hi,

I would suggest for the following. 

1. Create custom search connectors for each individual sources.
2. Connector will responsible to query the source of any type web, gateways 
etc. and get the results  write the top N results to a solr.
3. Query the same keyword to solr and display the result. 

Would you like to create something like
http://knimbus.com


Rgds
AJ

On 27-Aug-2013, at 2:28, Dan Davis dansm...@gmail.com wrote:

 One more question here - is this topic more appropriate to a different list?
 
 
 On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis dansm...@gmail.com wrote:
 
 I have now come to the task of estimating man-days to add Blended Search
 Results to Apache Solr.   The argument has been made that this is not
 desirable (see Jonathan Rochkind's blog entries on Bento search with
 blacklight).   But the estimate remains.No estimate is worth much
 without a design.   So, I am come to the difficult of estimating this
 without having an in-depth knowledge of the Apache core.   Here is my
 design, likely imperfect, as it stands.
 
   - Configure a core specific to each search source (local or remote)
   - On cores that index remote content, implement a periodic delete
   query that deletes documents whose timestamp is too old
   - Implement a custom requestHandler for the remote cores that goes
   out and queries the remote source.   For each result in the top N
   (configurable), it computes an id that is stable (e.g. it is based on the
   remote resource URL, doi, or hash of data returned).   It uses that id to
   look-up the document in the lucene database.   If the data is not there, it
   updates the lucene core and sets a flag that commit is required.   Once it
   is done, it commits if needed.
   - Configure a core that uses a custom SearchComponent to call the
   requestHandler that goes and gets new documents and commits them.   Since
   the cores for remote content are different cores, they can restart their
   searcher at this point if any commit is needed.   The custom
   SearchComponent will wait for commit and reload to be completed.   Then,
   search continues uses the other cores as shards.
   - Auto-warming on this will assure that the most recently requested
   data is present.
 
 It will, of course, be very slow a good part of the time.
 
 Erik and others, I need to know whether this design has legs and what
 other alternatives I might consider.
 
 
 
 On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
 The lack of global TF/IDF has been answered in the past,
 in the sharded case, by usually you have similar enough
 stats that it doesn't matter. This pre-supposes a fairly
 evenly distributed set of documents.
 
 But if you're talking about federated search across different
 types of documents, then what would you rescore with?
 How would you even consider scoring docs that are somewhat/
 totally different? Think magazine articles an meta-data associated
 with pictures.
 
 What I've usually found is that one can use grouping to show
 the top N of a variety of results. Or show tabs with different
 types. Or have the app intelligently combine the different types
 of documents in a way that makes sense. But I don't know
 how you'd just get the right thing to happen with some kind
 of scoring magic.
 
 Best
 Erick
 
 
 On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis dansm...@gmail.com wrote:
 
 I've thought about it, and I have no time to really do a meta-search
 during
 evaluation.  What I need to do is to create a single core that contains
 both of my data sets, and then describe the architecture that would be
 required to do blended results, with liberal estimates.
 
 From the perspective of evaluation, I need to understand whether any of
 the
 solutions to better ranking in the absence of global IDF have been
 explored?I suspect that one could retrieve a much larger than N set
 of
 results from a set of shards, re-score in some way that doesn't require
 IDF, e.g. storing both results in the same priority queue and
 *re-scoring*
 before *re-ranking*.
 
 The other way to do this would be to have a custom SearchHandler that
 works
 differently - it performs the query, retries all results deemed relevant
 by
 another engine, adds them to the Lucene index, and then performs the
 query
 again in the standard way.   This would be quite slow, but perhaps useful
 as a way to evaluate my method.
 
 I still welcome any suggestions on how such a SearchHandler could be
 implemented.
 


Solr admin search with wildcard

2013-06-27 Thread Amit Sela
I'm looking to search (in the solr admin search screen) a certain field
for:

*youtube*

I know that leading wildcards takes a lot of resources but I'm not worried
with that

My only question is about the syntax, would this work:

field:*youtube* ?

Thanks,

I'm using Solr 3.6.2


Re: Solr admin search with wildcard

2013-06-27 Thread Amit Sela
The stored and indexed string is actually a url like 
http://www.youtube.com/somethingsomething;.
It looks like removing the quotes does the job: iframe:*youtube* or am I
wrong ? For now, performance is not an issue, but accuracy is and I would
like to know for example how many URLS have iframe source leading to
YouTube for example. So query like: iframe:*youtube* with max rows 10 or
something will return in the response numFound field the total number of
pages that have a tag ifarme with a source matching *youtube, No ?


On Thu, Jun 27, 2013 at 3:24 PM, Jack Krupansky j...@basetechnology.comwrote:

 No, you cannot use wildcards within a quoted term.

 Tell us a little more about what your strings look like. You might want to
 consider tokenizing or using ngrams to avoid the need for wildcards.

 -- Jack Krupansky

 -Original Message- From: Amit Sela
 Sent: Thursday, June 27, 2013 3:33 AM
 To: solr-user@lucene.apache.org
 Subject: Solr admin search with wildcard


 I'm looking to search (in the solr admin search screen) a certain field
 for:

 *youtube*

 I know that leading wildcards takes a lot of resources but I'm not worried
 with that

 My only question is about the syntax, would this work:

 field:*youtube* ?

 Thanks,

 I'm using Solr 3.6.2



Re: Solr admin search with wildcard

2013-06-27 Thread Amit Sela
Forgive my ignorance but I want to  be sure, do I add copyField
source=iframe dest=text/ to solrindex-mapping.xml?
so that my solrindex-mapping.xml looks like this:
fields
field dest=content source=content/
field dest=title source=title/
field dest=iframe source=iframe/
field dest=host source=host/
field dest=segment source=segment/
field dest=boost source=boost/
field dest=digest source=digest/
field dest=tstamp source=tstamp/
field dest=id source=url/
copyField source=url dest=url/
*copyField source=iframe dest=text/ *
/fields
uniqueKeyurl/uniqueKey

And what do you mean by standard tokenization ?

Thanks!


On Thu, Jun 27, 2013 at 3:43 PM, Jack Krupansky j...@basetechnology.comwrote:

 Just copyField from the string field to a text field and use standard
 tokenization, then you can search the text field for youtube or even
 something that is a component of the URL path. No wildcard required.


 -- Jack Krupansky

 -Original Message- From: Amit Sela
 Sent: Thursday, June 27, 2013 8:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr admin search with wildcard


 The stored and indexed string is actually a url like 
 http://www.youtube.com/**somethingsomethinghttp://www.youtube.com/somethingsomething
 .
 It looks like removing the quotes does the job: iframe:*youtube* or am I
 wrong ? For now, performance is not an issue, but accuracy is and I would
 like to know for example how many URLS have iframe source leading to
 YouTube for example. So query like: iframe:*youtube* with max rows 10 or
 something will return in the response numFound field the total number of
 pages that have a tag ifarme with a source matching *youtube, No ?


 On Thu, Jun 27, 2013 at 3:24 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  No, you cannot use wildcards within a quoted term.

 Tell us a little more about what your strings look like. You might want to
 consider tokenizing or using ngrams to avoid the need for wildcards.

 -- Jack Krupansky

 -Original Message- From: Amit Sela
 Sent: Thursday, June 27, 2013 3:33 AM
 To: solr-user@lucene.apache.org
 Subject: Solr admin search with wildcard


 I'm looking to search (in the solr admin search screen) a certain field
 for:

 *youtube*

 I know that leading wildcards takes a lot of resources but I'm not worried
 with that

 My only question is about the syntax, would this work:

 field:*youtube* ?

 Thanks,

 I'm using Solr 3.6.2





Re: Restaurant availability from database

2013-05-23 Thread Amit Nithian
Hossman did a presentation on something similar to this using spatial data
at a Solr meetup some months ago.

http://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/

May be helpful to you.


On Thu, May 23, 2013 at 9:40 AM, rajh ron...@trimm.nl wrote:

 Thank you for your answer.

 Do you mean I should index the availability data as a document in Solr?
 Because the availability data in our databases is around 6,509,972 records
 and contains the availability per number of seats and per 15 minutes. I
 also
 tried this method, and as far as I know it's only possible to join the
 availability documents and not to include that information per result
 document.

 An example API response (created from the Solr response):
 {
 restaurants: [
 {
 id: 13906,
 name: Allerlei,
 zipcode: 6511DP,
 house_number: 59,
 available: true
 },
 {
 id: 13907,
 name: Voorbeeld,
 zipcode: 6512DP,
 house_number: 39,
 available: false
 }
 ],
 resultCount: 12156,
 resultCountAvailable: 55,
 }

 I'm currently hacking around the problem by executing the search again with
 a very high value for the rows parameter and counting the number of
 available restaurants on the backend, but this causes a big performance
 impact (as expected).




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Restaurant-availability-from-database-tp4065609p4065710.html
 Sent from the Solr - User mailing list archive at Nabble.com.



solr doesn't start on tomcat on aws

2013-05-15 Thread amit
I am installing solr on tomcat7 in aws using bitmani tomcat stack.My solr
server is not starting; below is the errorINFO: Starting service Catalina 
May 15, 2013 7:01:51 AM org.apache.catalina.core.StandardEngine
startInternal  INFO: Starting Servlet Engine: Apache Tomcat/7.0.39  May 15,
2013 7:01:51 AM org.apache.catalina.startup.HostConfig deployDescriptor 
INFO: Deploying configuration descriptor
/opt/bitnami/apache-tomcat/conf/Catalina/localhost/solr.xml  May 15, 2013
7:01:52 AM org.apache.catalina.startup.HostConfig deployDescriptor  SEVERE:
Error deploying configuration descriptor
/opt/bitnami/apache-tomcat/conf/Catalina/localhost/solr.xml 
java.lang.NullPointerException  at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:625) 
at
org.apache.catalina.startup.HostConfig$DeployDescriptor.run(HostConfig.java:1637)
 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)  at
java.util.concurrent.FutureTask.run(FutureTask.java:166)  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:722)  May 15, 2013 7:01:52 AM
org.apache.catalina.startup.HostConfig deployDescriptors  SEVERE: Error
waiting for multi-thread deployment of context descriptors to complete 
java.util.concurrent.ExecutionException: java.lang.NullPointerException  at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)  at
java.util.concurrent.FutureTask.get(FutureTask.java:111)  at
org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:579) 
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:475) 
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1402)  at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:318) 
at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
 
at
org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:90)
 
at
org.apache.catalina.util.LifecycleBase.setStateInternal(LifecycleBase.java:402) 
at org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:347) 
The /opt/bitnami/apache-tomcat/conf/Catalina/localhost/solr.xml looks like
this.?xml version=1.0 encoding=utf-8?   The contents of
/usr/share/solr/ also looks finebitnami@ip-10-144-66-148:/usr/share/solr$ ls
-ltotal 11384drwxr-xr-x 2 tomcat tomcat 4096 Jul 17  2012 bin  drwxr-xr-x 5
tomcat tomcat 4096 May 13 13:11 conf  drwxr-xr-x 9 tomcat tomcat 4096 Jul 17 
2012 contrib  drwxr-xr-x 2 tomcat tomcat 4096 May 13 13:20 data  drwxr-xr-x
2 tomcat tomcat 4096 May 13 13:21 lib  -rw-r--r-- 1 tomcat tomcat 2259 Jul
17  2012 README.txt  -rw-r--r-- 1 tomcat tomcat 11628199 May 14 12:58
solr.war-rw-r--r-- 1 tomcat tomcat 1676 Jul 17  2012 solr.xml  Not sure
what is wrong, but this is killing me :-(



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-doesn-t-start-on-tomcat-on-aws-tp4063448.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: writing a custom Filter plugin?

2013-05-14 Thread Amit Nithian
At first I thought you were referring to Filters in Lucene at query time
(i.e. bitset filters) but I think you are referring to token filters at
indexing/text analysis time?

I have had success writing my own Filter as the link presents. The key is
that you should write a custom class that extends TokenFilter (
http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/analysis/TokenFilter.html)
and write the implementation in your incrementToken() method.

My recollection of this is that instead of returning something of a Token
like you would have in earlier versions of Lucene, you set attribute values
on a notional current token. One obvious attribute is the term text
itself and perhaps any positional information. The best place to start is
to pick a fairly simple example from the Solr Source (maybe
lowercasefilter) and try and mimic that.

Cheers!
Amit


On Mon, May 13, 2013 at 1:33 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Does anyone know of any tutorials, basic examples, and/or documentation on
 writing your own Filter plugin for Solr? For Solr 4.x/4.3?

 I would like a Solr 4.3 version of the normalization filters found here
 for Solr 1.4: 
 https://github.com/billdueber/**lib.umich.edu-solr-stuffhttps://github.com/billdueber/lib.umich.edu-solr-stuff

 But those are old, for Solr 1.4.

 Does anyone have any hints for writing a simple substitution Filter for
 Solr 4.x?  Or, does a simple sourcecode example exist anywhere?



Re: Need solr query help

2013-05-14 Thread Amit Nithian
Is it possible instead to store in your solr index a bounding box of store
location + delivery radius, do a bounding box intersection between your
user's point + radius (as a bounding box) and the shop's delivery bounding
box. If you want further precision, the frange may work assuming it's a
post-filter implementation so that you are doing heavy computation on a
presumably small set of data only to filter out the corner cases around the
radius circle that results.

I haven't looked at Solr's spatial querying in a while to know if this is
possible or not.

Cheers
Amit


On Sat, May 11, 2013 at 10:42 AM, smsolr sms...@hotmail.com wrote:

 Hi Abhishek,

 I forgot to explain why it works.  It uses the frange filter which is
 mentioned here:-

 http://wiki.apache.org/solr/CommonQueryParameters

 and it works because it filters in results where the geodist minus the
 shopMaxDeliveryDistance is less than zero (that's what the u=0 means, upper
 limit=0), i.e.:-

 geodist - shopMaxDeliveryDistance  0
 -
 geodist  shopMaxDeliveryDistance

 i.e. the geodist is less than the shopMaxDeliveryDistance and so the shop
 is
 within delivery range of the location specified.

 smsolr



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Need-solr-query-help-tp4061800p4062603.html
 Sent from the Solr - User mailing list archive at Nabble.com.



edismax returns very less matches than regular

2013-04-08 Thread amit
I have a simple system. I put the title of webpages into the name field and
content of the web pages into the Description field.
I want to search both fields and give the name a little more boost.
A search on name field or description field returns records cloase to
hundreds.

http://localhost:8983/solr/select/?q=name:%28coldfusion^2%20cache^1%29fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id

But search on both fields using boost just gives 5 matches.

http://localhost:8983/solr/mindfire/?q=%28%20coldfusion^2%20cache^1%29*defType=edismaxqf=name^1.5%20description^1.0*fq=author:[*%20TO%20*]%20AND%20-author:chinmoypstart=0rows=10fl=author,score,%20id

I am wondering what is wrong, because there are valid results returned in
1st query which is ignored by edismax. I am on solr3.6



--
View this message in context: 
http://lucene.472066.n3.nabble.com/edismax-returns-very-less-matches-than-regular-tp4054442.html
Sent from the Solr - User mailing list archive at Nabble.com.


using edismax without velocity

2013-04-06 Thread amit
I am using solr3.6 and trying to use the edismax handler.
The config has a /browse requestHandler, but it doesn't work because of
missing class definition VelocityResponseWriter error.
queryResponseWriter name=velocity class=solr.VelocityResponseWriter
startup=lazy/ 
I have copied the jars to solr/lib following the steps here, but no luck
http://wiki.apache.org/solr/VelocityResponseWriter#Using_the_VelocityResponseWriter_in_Solr_Core

I just want  to search on multiple fields with different boost. *Can I use
edismax with the /select requestHandler?* If I write a query like below,
does it search in both the fields name and description?
Does the query below solves my purpose?
http://localhost:8080/solr/select/?q=(coldfusion^2
cache^1)*defType=edismaxqf=name^2 description^1*fq=author:[* TO *] AND
-author:chinmoypstart=0rows=10fl=author,score, id





--
View this message in context: 
http://lucene.472066.n3.nabble.com/using-edismax-without-velocity-tp4054190.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sharing index amongst multiple nodes

2013-04-06 Thread Amit Nithian
I don't understand why this would be more performant.. seems like it'd be
more memory and resource intensive as you'd have multiple class-loaders and
multiple cache spaces for no good reason. Just have a single core with
sufficiently large caches to handle your response needs.

If you want to load balance reads consider having multiple physical nodes
with a master/slaves or SolrCloud.


On Sat, Apr 6, 2013 at 9:21 AM, Daire Mac Mathúna daire...@gmail.comwrote:

 Hi. Wat are the thoughts on having multiple SOLR instances i.e. multiple
 SOLR war files, sharing the same index (i.e. sharing the same solr_home)
 where only one SOLR instance is used for writing and the others for
 reading?

 Is this possible?

 Is it beneficial - is it more performant than having just one solr
 instance?

 How does it affect auto-commits i.e. how would the read nodes know the
 index has been changed and re-populate cache etc.?

 Sole 3.6.1

 Thanks.



Re: how to skip test while building

2013-04-06 Thread Amit Nithian
If you generate the maven pom files you can do this I think by doing mvn
whtaever here -DskipTests=true.


On Sat, Apr 6, 2013 at 7:25 AM, Erick Erickson erickerick...@gmail.comwrote:

 Don't know a good way to skip compiling the tests, but there isn't
 any harm in compiling them...

 changing to the solr directory and just issuing
 ant example dist builds pretty much everything. You don't execute
 tests unless you specify ant test.

 ant -p shows you all the targets. Note that you have different
 targets depending on whether you're executing it in solr_home or
 solr_home/solr or solr_home/lucene.

 Since you mention Solr, you probably want to work in solr_home/solr to
 start.

 Best
 Erick

 On Sat, Apr 6, 2013 at 5:36 AM, parnab kumar parnab.2...@gmail.com
 wrote:
  Hi All,
 
I am new to Solr . I am using solr 3.4 . I want to build without
  building  lucene tests files in lucene and skip the tests to be fired .
 Can
  anyone please help where to make the necessary changes .
 
  Thanks,
  Pom



Re: Solr 4.2 single server limitations

2013-04-05 Thread Amit Nithian
There's a whole heap of information that is missing like what you plan on
storing vs indexing and yes QPS too. My short answer is try with one server
until it falls over then start adding more.

When you say multiple-server setup do you mean multiple servers where each
server acts as a slave storing the entire index so you have load balancing
across multiple servers OR do you mean multiple servers where each server
stores a portion of the data? If it's the former, sometimes a simple
master/slave setup in Solr 4.x works but the latter may mean SolrCloud.
Master/Slave is easy but I don't know much about SolrCloud.

Questions to think about (this is not exhaustive by any means)
1) When you say 5-10 pages per website (300+ websites) that you are
crawling 2x per hour, are you *replacing* the old copy of the web page in
your index or storing some form of history for some reason.
2) What are you planning on storing vs indexing which would dictate your
memory requirements.
3) You mentioned you don't know QPS but having some guess would help.. is
it mostly for storage and occasional lookup (where slow responses is
probably tolerable) or is this powering a real user-facing website (where
low latency is prob desired).

Again, I like to start simple and use one server until it dies then expand
from there.

Cheers
Amit


On Thu, Apr 4, 2013 at 7:58 AM, imehesz imeh...@gmail.com wrote:

 hello,

 I'm using a single server setup with Nutch (1.6) and Solr (4.2)

 I plan to trigger the Nutch crawling process every 30 minutes or so and add
 about 300+ websites a month with (~5-10 pages each). At this point I'm not
 sure about the query requests/sec.

 Can I run this on a single server (how long)?
 If not, what would be the best and most efficient way to have multiple
 server setup?

 thanks,
 --iM



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-2-single-server-limitations-tp4053829.html
 Sent from the Solr - User mailing list archive at Nabble.com.



unknown field error when indexing with nutch

2013-04-05 Thread Amit Sela
Hi all,

I'm trying to run a nutch crawler and index to Solr.
I'm running Nutch 1.6 and Solr 4.2.

I managed to crawl and index with that Nutch version into Solr 3.6.2 but I
can't seem to manage to run it with Solr 4.2

I re-built Nutch with the schema-solr4.xml and copied that file to
SOLR_HOME/example/solr/collection1/conf/schema.xml but the job fails when
trying to index:

SolrException: ERROR: [doc=
http://0movies.com/watchversion.php?id=3818link=1364879137] unknown field
'host'

It looks like Solr is not aware of the schema... Did I miss something ?

Thanks.


Re: unknown field error when indexing with nutch

2013-04-05 Thread Amit Sela
I'm using the solrconfig supplied with Sole 4.2 and I added the nutch
request handler. But I keep getting the same errors.
 On Apr 5, 2013 8:11 PM, Jack Krupansky j...@basetechnology.com wrote:

 Check your solrconfig.xml file for references to a host field.

 But maybe more importantly, make sure you use a Solr 4.1 solrconfig and
 merge in any of your application-specific changes.

 -- Jack Krupansky

 -Original Message- From: Amit Sela
 Sent: Friday, April 05, 2013 12:57 PM
 To: solr-user@lucene.apache.org
 Subject: unknown field error when indexing with nutch

 Hi all,

 I'm trying to run a nutch crawler and index to Solr.
 I'm running Nutch 1.6 and Solr 4.2.

 I managed to crawl and index with that Nutch version into Solr 3.6.2 but I
 can't seem to manage to run it with Solr 4.2

 I re-built Nutch with the schema-solr4.xml and copied that file to
 SOLR_HOME/example/solr/**collection1/conf/schema.xml but the job fails
 when
 trying to index:

 SolrException: ERROR: [doc=
 http://0movies.com/**watchversion.php?id=3818link=**1364879137http://0movies.com/watchversion.php?id=3818link=1364879137]
 unknown field
 'host'

 It looks like Solr is not aware of the schema... Did I miss something ?

 Thanks.



Re: solre scores remains same for exact match and nearly exact match

2013-04-04 Thread amit
Thanks Jack and Andre
I am trying to use edismax;but struck with the NoClassDefFoundError:
org/apache/solr/response/QueryResponseWriter
I am using solr 3.6
I have followed the steps here 
http://wiki.apache.org/solr/VelocityResponseWriter#Using_the_VelocityResponseWriter_in_Solr_Core

Just the jars are copied rest was already there in solrconfig.xml






--
View this message in context: 
http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053811.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: do SearchComponents have access to response contents

2013-04-04 Thread Amit Nithian
We need to also track the size of the response (as the size in bytes of the
whole xml response tat is streamed, with stored fields and all). I was a
bit worried cause I am wondering if a searchcomponent will actually have
access to the response bytes...

== Can't you get this from your container access logs after the fact? I
may be misunderstanding something but why wouldn't mining the Jetty/Tomcat
logs for the response size here suffice?

Thanks!
Amit


On Thu, Apr 4, 2013 at 1:34 AM, xavier jmlucjav jmluc...@gmail.com wrote:

 A custom QueryResponseWriter...this makes sense, thanks Jack


 On Wed, Apr 3, 2013 at 11:21 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  The search components can see the response as a namedlist, but it is
  only when SolrDispatchFIlter calls the QueryResponseWriter that XML or
 JSON
  or whatever other format (Javabin as well) is generated from the named
 list
  for final output in an HTTP response.
 
  You probably want a custom query response writer that wraps the XML
  response writer. Then you can generate the XML and then do whatever you
  want with it.
 
  The QueryResponseWriter class and queryResponseWriter in
 solrconfig.xml.
 
  -- Jack Krupansky
 
  -Original Message- From: xavier jmlucjav
  Sent: Wednesday, April 03, 2013 4:22 PM
  To: solr-user@lucene.apache.org
  Subject: do SearchComponents have access to response contents
 
 
  I need to implement some SearchComponent that will deal with metrics on
 the
  response. Some things I see will be easy to get, like number of hits for
  instance, but I am more worried with this:
 
  We need to also track the size of the response (as the size in bytes of
 the
  whole xml response tat is streamed, with stored fields and all). I was a
  bit worried cause I am wondering if a searchcomponent will actually have
  access to the response bytes...
 
  Can someone confirm one way or the other? We are targeting Sorl4.0
 
  thanks
  xavier
 



Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Amit Sela
Hi all,

I have a running Hadoop + HBase cluster and the HBase cluster is running
it's own zookeeper (HBase manages zookeeper).
I would like to deploy my SolrCloud cluster on a portion of the machines on
that cluster.

My question is: Should I have any trouble / issues deploying an additional
ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well
first of all HBase manages it so I'm not sure it's possible and second I
have HBase working pretty hard at times and I don't want to create any
connection issues by overloading ZooKeeper.

Thanks,

Amit.


Re: solre scores remains same for exact match and nearly exact match

2013-04-03 Thread amit
Thanks. I added a copy field and that fixed the issue.


On Wed, Apr 3, 2013 at 12:29 PM, Gora Mohanty-3 [via Lucene] 
ml-node+s472066n4053412...@n3.nabble.com wrote:

 On 3 April 2013 10:52, amit [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=4053412i=0
 wrote:
 
  Below is my query
  http://localhost:8983/solr/select/?q=subject:session management in
  phpfq=category:[*%20TO%20*]fl=category,score,subject
 [...]

 Add debugQuery=on to your Solr URL, and you will get an
 explanation of the score. Your subject field is tokenised, so
 that there is no a priori reason that an exact match should
 score higher. Several strategies are available if you want that
 behaviour. Try searching Google, e.g., for solr exact match
 higher score.

 Regards,
 Gora


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053412.html
  To unsubscribe from solre scores remains same for exact match and nearly
 exact match, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4053406code=YW1pdC5tYWxsaWtAZ21haWwuY29tfDQwNTM0MDZ8LTk5Njc5OTA3NA==
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053478.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr ZooKeeper ensemble with HBase

2013-04-03 Thread Amit Sela
Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and
maybe 2GB for Solr ? - or you mean CPU / disk ?


On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Hello, Amit:

 My guess is that, if HBase is working hard, you're going to have more
 trouble with HBase and Solr on the same nodes than HBase and Solr
 sharing a Zookeeper. Solr's usage of Zookeeper is very minimal.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote:
  Hi all,
 
  I have a running Hadoop + HBase cluster and the HBase cluster is running
  it's own zookeeper (HBase manages zookeeper).
  I would like to deploy my SolrCloud cluster on a portion of the machines
 on
  that cluster.
 
  My question is: Should I have any trouble / issues deploying an
 additional
  ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because,
 well
  first of all HBase manages it so I'm not sure it's possible and second I
  have HBase working pretty hard at times and I don't want to create any
  connection issues by overloading ZooKeeper.
 
  Thanks,
 
  Amit.



Re: solre scores remains same for exact match and nearly exact match

2013-04-03 Thread amit
when I use the copy field destination as text it works fine.
I get a boost for exact match.
But if I use some other field the score is not boosted for exact match.

field name=keywords type=text_general indexed=true stored=false
multiValued=true/
copyField source=subject dest=keywords/

Not sure if I am in the right direction.. I am new to solr please bear with
me
I checked this link http://wiki.apache.org/solr/SolrRelevancyCookbook
and trying to index same field multiple times to get exact match.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053718.html
Sent from the Solr - User mailing list archive at Nabble.com.


solre scores remains same for exact match and nearly exact match

2013-04-02 Thread amit

Below is my query
http://localhost:8983/solr/select/?q=subject:session management in
phpfq=category:[*%20TO%20*]fl=category,score,subject

The result is like below

?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeader
int name=status0/int
int name=QTime983/int
lst name=params
str name=fqcategory:[* TO *]/str
str name=qsubject:session management in php/str
str name=flcategory,score,subject/str
/lst
/lst
result name=response maxScore=0.8770298 start=0 numFound=2
doc
float name=score0.8770298/float
str name=categoryAnnapurnap/str
str name=subjectsession management in asp.net/str
/doc

doc
float name=score0.8770298/float
str name=categoryAnnapurnap/str
str name=subjectsession management in PHP/str
/doc
/result 
/response

The question is how come both have the same score when 1 is exact match and
the other isn't.
This is the schema
field name=subject type=text_en_splitting indexed=true
stored=true/
field name=category type=text_general indexed=true stored=true/





--
View this message in context: 
http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Why wouldn't SolrCloud help you here? You can setup shards and replicas etc
to have redundancy b/c HDFS isn't designed to serve real time queries as
far as I understand. If you are using HDFS as a backup mechanism to me
you'd be better served having multiple slaves tethered to a master (in a
non-cloud environment) or setup SolrCloud either option would give you more
redundancy than copying an index to HDFS.

- Amit


On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim ysli...@gmail.com wrote:

 Hi Upayavira,

 sure, let me explain. I am setting up Nutch and SOLR in hadoop environment.
 Since I am using hdfs, in the event if there is any crashes to the
 localhost(running solr), i will still have the shards of data being stored
 in hdfs.

 Thanks you so much =)

 On Thu, Mar 7, 2013 at 1:19 AM, Upayavira u...@odoko.co.uk wrote:

  What are you actually trying to achieve? If you can share what you are
  trying to achieve maybe folks can help you find the right way to do it.
 
  Upayavira
 
  On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
   Hello Otis ,
  
   Is there any configuration where it will index into hdfs instead?
  
   I tried crawlzilla and  lily but I hope to update specific package such
   as
   Hadoop only or nutch only when there are updates.
  
   That's y would prefer to install separately .
  
   Thanks so much. Looking forward for your reply.
  
   On Wednesday, March 6, 2013, Otis Gospodnetic wrote:
  
Hello Joseph,
   
You can certainly put them there, as in:
  hadoop fs -copyFromLocal localsrc URI
   
But searching such an index will be slow.
See also: http://katta.sourceforge.net/
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim ysli...@gmail.com
  javascript:;
wrote:
   
 Hi,
 Would like to know how can i put the indexed solr shards into hdfs?

 Thanks..

 Joseph
 On Mar 6, 2013 7:28 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.comjavascript:;

 wrote:

  Hi Joseph,
 
  What exactly are you looking to to?
  See http://incubator.apache.org/blur/
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim ysli...@gmail.com
  javascript:;
wrote:
 
   Hi I am running hadoop distributed file system, how do I put my
output
 of
   the solr dir into hdfs automatically?
  
   Thanks so much..
  
   --
   Best Regards,
   *Joseph*
  
 

   
  
  
   --
   Best Regards,
   *Joseph*
 



 --
 Best Regards,
 *Joseph*



Re: SOLR on hdfs

2013-03-06 Thread Amit Nithian
Joseph,

Doing what Otis said will do literally what you want which is copying the
index to HDFS. It's no different than copying it to a different machine
which btw is what Solr's master/slave replication scheme does.
Alternatively, I think people are starting to setup new Solr instances with
SolrCloud which doesn't have the concept of master/slave but rather a
series of nodes with the option of having replicas (what I believe to be
backup nodes) so that you have the redundancy you want.

Honestly HDFS in the way that you are looking for is probably no different
than storing  your solr index in a RAIDed storage format but I don't
pretend to know much about RAID arrays.

What exactly are you trying to achieve from a systems perspective? Why do
you want Hadoop in the mix here and how does copying the index to HDFS help
you? If SolrCloud seems complicated try just setting up a simple
master/slave replication scheme for that's really easy.

Cheers
Amit


On Wed, Mar 6, 2013 at 9:55 PM, Joseph Lim ysli...@gmail.com wrote:

 Hi Amit,

 so you mean that if I just want to get redundancy for solr in hdfs, the
 only best way to do it is to as per what Otis suggested using the following
 command

 hadoop fs -copyFromLocal localsrc URI

 Ok let me try out solrcloud as I will need to make sure it works well with
 nutch too..

 Thanks for the help..


 On Thu, Mar 7, 2013 at 5:47 AM, Amit Nithian anith...@gmail.com wrote:

  Why wouldn't SolrCloud help you here? You can setup shards and replicas
 etc
  to have redundancy b/c HDFS isn't designed to serve real time queries as
  far as I understand. If you are using HDFS as a backup mechanism to me
  you'd be better served having multiple slaves tethered to a master (in a
  non-cloud environment) or setup SolrCloud either option would give you
 more
  redundancy than copying an index to HDFS.
 
  - Amit
 
 
  On Wed, Mar 6, 2013 at 12:23 PM, Joseph Lim ysli...@gmail.com wrote:
 
   Hi Upayavira,
  
   sure, let me explain. I am setting up Nutch and SOLR in hadoop
  environment.
   Since I am using hdfs, in the event if there is any crashes to the
   localhost(running solr), i will still have the shards of data being
  stored
   in hdfs.
  
   Thanks you so much =)
  
   On Thu, Mar 7, 2013 at 1:19 AM, Upayavira u...@odoko.co.uk wrote:
  
What are you actually trying to achieve? If you can share what you
 are
trying to achieve maybe folks can help you find the right way to do
 it.
   
Upayavira
   
On Wed, Mar 6, 2013, at 02:54 PM, Joseph Lim wrote:
 Hello Otis ,

 Is there any configuration where it will index into hdfs instead?

 I tried crawlzilla and  lily but I hope to update specific package
  such
 as
 Hadoop only or nutch only when there are updates.

 That's y would prefer to install separately .

 Thanks so much. Looking forward for your reply.

 On Wednesday, March 6, 2013, Otis Gospodnetic wrote:

  Hello Joseph,
 
  You can certainly put them there, as in:
hadoop fs -copyFromLocal localsrc URI
 
  But searching such an index will be slow.
  See also: http://katta.sourceforge.net/
 
  Otis
  --
  Solr  ElasticSearch Support
  http://sematext.com/
 
 
 
 
 
  On Wed, Mar 6, 2013 at 7:50 AM, Joseph Lim ysli...@gmail.com
javascript:;
  wrote:
 
   Hi,
   Would like to know how can i put the indexed solr shards into
  hdfs?
  
   Thanks..
  
   Joseph
   On Mar 6, 2013 7:28 PM, Otis Gospodnetic 
otis.gospodne...@gmail.comjavascript:;
  
   wrote:
  
Hi Joseph,
   
What exactly are you looking to to?
See http://incubator.apache.org/blur/
   
Otis
--
Solr  ElasticSearch Support
http://sematext.com/
   
   
   
   
   
On Wed, Mar 6, 2013 at 2:39 AM, Joseph Lim 
 ysli...@gmail.com
javascript:;
  wrote:
   
 Hi I am running hadoop distributed file system, how do I
 put
  my
  output
   of
 the solr dir into hdfs automatically?

 Thanks so much..

 --
 Best Regards,
 *Joseph*

   
  
 


 --
 Best Regards,
 *Joseph*
   
  
  
  
   --
   Best Regards,
   *Joseph*
  
 



 --
 Best Regards,
 *Joseph*



Re: ping query frequency

2013-03-03 Thread Amit Nithian
We too run a ping every 5 seconds and I think the concurrent Mark/Sweep
helps to avoid the LB from taking a box out of rotation due to long pauses.
Either that or I don't see large enough pauses for my LB to take it out
(it'd have to fail 3 times in a row or 15 seconds total before it's gone).

The ping query does execute an actual query so of course you want to make
this as simple as possible (i.e. q=primary_key:value) so that there's
limited to no scanning of the index. I think our query does an id:0 which
would always return 0 docs but also any stupid-simple query is fine so long
as it hits the caches on subsequent hits. The goal, to me at least, is not
that the ping query yields actual docs but that it's a mechanism to remove
a solr server out of rotation without having to login to an ops
controlled device directly.

I'd definitely remove the ping per request (wouldn't the fact that you are
doing /select serve as the ping and hence defeat the purpose of the ping
query) and definitely do the frequent ping as we are describing if you want
to have your solr boxes behind some load balancer.


On Sun, Mar 3, 2013 at 8:21 AM, Shawn Heisey s...@elyograg.org wrote:

 On 3/3/2013 2:15 AM, adm1n wrote:

 I'm wonderring how frequent this query should be made. Currently it is
 done
 before each select request (some very old legacy). I googled a little and
 found out that it is bad practice and has performance impact. So the
 question is should I completely remove it or just do it once in some
 period
 of time.


 Can you point me at the place where it says that it's bad practice to do
 frequent pings?  I use the ping functionality in my haproxy load balancer
 that sits in front of Solr.  It executes a ping request against all my Solr
 instances every five seconds.  Most of the time, the ping request (which is
 distributed) finishes in single-digit milliseconds. If that is considered
 bad practice, I want to figure out why and submit issues to get the problem
 fixed.

 I can imagine that sending a ping before every query would be a bad idea,
 but I am hoping that the way I'm using it is OK.

 The only problem with ping requests that I have ever noticed was caused by
 long garbage collection pauses on my 8GB Solr heap.  Those pauses caused
 the load balancer to incorrectly mark the active Solr instance(s) as down
 and send requests to a backup.

 Through experimentation with -XX memory tuning options, I have now
 eliminated the GC pause problem.  For machines running Solr 4.2-SNAPSHOT, I
 have reduced the heap to 6GB, the 3.5.0 machines are still running with 8GB.

 Thanks,
 Shawn




Re: Poll: SolrCloud vs. Master-Slave usage

2013-03-01 Thread Amit Nithian
But does that mean that in SolrCloud, slave nodes are busy indexing
documents?


On Fri, Mar 1, 2013 at 5:37 AM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 Amit,

 NRT is not possible in a master-slave setup because of the necessity
 of a hard commit and replication, both of which add considerable
 delay.

 Solr Cloud sends each document for a given shard to each node hosting
 that shard, so there's no need for the hard commit and replication for
 visibility.

 You could conceivably get NRT on a single node without Solr Cloud, but
 there would be no redundancy.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Fri, Mar 1, 2013 at 1:22 AM, Amit Nithian anith...@gmail.com wrote:
  Erick,
 
  Well put and thanks for the clarification. One question:
  And if you need NRT, you just can't get it with traditional M/S setups.
  == Can you explain how that works with SolrCloud?
 
  I agree with what you said too because there was an article or
 discussion I
  read that said having high-availability masters requires some fairly
  complicated setups and I guess I am under-estimating how
  expensive/complicated our setup is relative to what you can get out of
 the
  box with SolrCloud.
 
  Thanks!
  Amit
 
 
  On Thu, Feb 28, 2013 at 6:29 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Amit:
 
  It's a balancing act. If I was starting fresh, even with one shard, I'd
  probably use SolrCloud rather than deal with the issues around the how
 do
  I recover if my master goes down question. Additionally, SolrCloud
 allows
  one to monitor the health of the entire system by monitoring the state
  information kept in Zookeeper rather than build a monitoring system that
  understands the changing topology of your network.
 
  And if you need NRT, you just can't get it with traditional M/S setups.
 
  In a mature production system where all the operational issues are
 figured
  out and you don't need NRT, it's easier just to plop 4.x in traditional
 M/S
  setups and not go to SolrCloud. And you're right, you have to understand
  Zookeeper which isn't all that difficult, but is another moving part and
  I'm a big fan of keeping the number of moving parts down if possible.
 
  It's not a one-size-fits-all situation. From what you've described, I
 can't
  say there's a compelling reason to do the SolrCloud thing. If you find
  yourself spending lots of time building monitoring or High
  Availability/Disaster Recovery tools, then you might find the
 cost/benefit
  analysis changing.
 
  Personally, I think it's ironic that the memory improvements that came
  along _with_ SolrCloud make it less necessary to shard. Which means that
  traditional M/S setups will suit more people longer G
 
  Best
  Erick
 
 
  On Thu, Feb 28, 2013 at 8:22 PM, Amit Nithian anith...@gmail.com
 wrote:
 
   I don't know a ton about SolrCloud but for our setup and my limited
   understanding of it is that you start to bleed operational and
   non-operational aspects together which I am not comfortable doing
 (i.e.
   software load balancing). Also adding ZooKeeper to the mix is yet
 another
   thing to install, setup, monitor, maintain etc which doesn't add any
  value
   above and beyond what we have setup already.
  
   For example, we have a hardware load balancer that can do the actual
 load
   balancing of requests among the slaves and taking slaves in and out of
   rotation either on demand or if it's down. We've placed a virtual IP
 on
  top
   of our multiple masters so that we have redundancy there. While we
 have
   multiple cores, the data volume is large enough to fit on one node so
 we
   aren't at the data volume necessary for sharding our indices. I
 suspect
   that if we had a sufficiently large dataset that couldn't fit on one
 box
   SolrCloud is perfect but when you can fit on one box, why add more
   complexity?
  
   Please correct me if I'm wrong for I'd like to better understand this!
  
  
  
  
   On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote:
  
I am doing research on SolrCloud.
   
   
   
--
View this message in context:
   
  
 
 http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
Sent from the Solr - User mailing list archive at Nabble.com.
   
  
 



Re: Poll: SolrCloud vs. Master-Slave usage

2013-02-28 Thread Amit Nithian
I don't know a ton about SolrCloud but for our setup and my limited
understanding of it is that you start to bleed operational and
non-operational aspects together which I am not comfortable doing (i.e.
software load balancing). Also adding ZooKeeper to the mix is yet another
thing to install, setup, monitor, maintain etc which doesn't add any value
above and beyond what we have setup already.

For example, we have a hardware load balancer that can do the actual load
balancing of requests among the slaves and taking slaves in and out of
rotation either on demand or if it's down. We've placed a virtual IP on top
of our multiple masters so that we have redundancy there. While we have
multiple cores, the data volume is large enough to fit on one node so we
aren't at the data volume necessary for sharding our indices. I suspect
that if we had a sufficiently large dataset that couldn't fit on one box
SolrCloud is perfect but when you can fit on one box, why add more
complexity?

Please correct me if I'm wrong for I'd like to better understand this!




On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote:

 I am doing research on SolrCloud.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Poll: SolrCloud vs. Master-Slave usage

2013-02-28 Thread Amit Nithian
Erick,

Well put and thanks for the clarification. One question:
And if you need NRT, you just can't get it with traditional M/S setups.
== Can you explain how that works with SolrCloud?

I agree with what you said too because there was an article or discussion I
read that said having high-availability masters requires some fairly
complicated setups and I guess I am under-estimating how
expensive/complicated our setup is relative to what you can get out of the
box with SolrCloud.

Thanks!
Amit


On Thu, Feb 28, 2013 at 6:29 PM, Erick Erickson erickerick...@gmail.comwrote:

 Amit:

 It's a balancing act. If I was starting fresh, even with one shard, I'd
 probably use SolrCloud rather than deal with the issues around the how do
 I recover if my master goes down question. Additionally, SolrCloud allows
 one to monitor the health of the entire system by monitoring the state
 information kept in Zookeeper rather than build a monitoring system that
 understands the changing topology of your network.

 And if you need NRT, you just can't get it with traditional M/S setups.

 In a mature production system where all the operational issues are figured
 out and you don't need NRT, it's easier just to plop 4.x in traditional M/S
 setups and not go to SolrCloud. And you're right, you have to understand
 Zookeeper which isn't all that difficult, but is another moving part and
 I'm a big fan of keeping the number of moving parts down if possible.

 It's not a one-size-fits-all situation. From what you've described, I can't
 say there's a compelling reason to do the SolrCloud thing. If you find
 yourself spending lots of time building monitoring or High
 Availability/Disaster Recovery tools, then you might find the cost/benefit
 analysis changing.

 Personally, I think it's ironic that the memory improvements that came
 along _with_ SolrCloud make it less necessary to shard. Which means that
 traditional M/S setups will suit more people longer G

 Best
 Erick


 On Thu, Feb 28, 2013 at 8:22 PM, Amit Nithian anith...@gmail.com wrote:

  I don't know a ton about SolrCloud but for our setup and my limited
  understanding of it is that you start to bleed operational and
  non-operational aspects together which I am not comfortable doing (i.e.
  software load balancing). Also adding ZooKeeper to the mix is yet another
  thing to install, setup, monitor, maintain etc which doesn't add any
 value
  above and beyond what we have setup already.
 
  For example, we have a hardware load balancer that can do the actual load
  balancing of requests among the slaves and taking slaves in and out of
  rotation either on demand or if it's down. We've placed a virtual IP on
 top
  of our multiple masters so that we have redundancy there. While we have
  multiple cores, the data volume is large enough to fit on one node so we
  aren't at the data volume necessary for sharding our indices. I suspect
  that if we had a sufficiently large dataset that couldn't fit on one box
  SolrCloud is perfect but when you can fit on one box, why add more
  complexity?
 
  Please correct me if I'm wrong for I'd like to better understand this!
 
 
 
 
  On Thu, Feb 28, 2013 at 12:53 AM, rulinma ruli...@gmail.com wrote:
 
   I am doing research on SolrCloud.
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 



Re: numFound is not correct while using Result Grouping

2013-02-26 Thread Amit Nithian
I need to write some tests which I hope to do tonight and then I think
it'll get into 4.2


On Tue, Feb 26, 2013 at 6:24 AM, Nicholas Ding nicholas...@gmail.comwrote:

 Thanks Amit, that's cool! So it will also be fixed on Solr 4.2, right?

 On Mon, Feb 25, 2013 at 6:04 PM, Amit Nithian anith...@gmail.com wrote:

  Yeah I had a similar problem. I filed and submitted this patch:
  https://issues.apache.org/jira/browse/SOLR-4310
 
  Let me know if this is what you are looking for!
  Amit
 
 
  On Mon, Feb 25, 2013 at 1:50 PM, Teun Duynstee t...@duynstee.com
 wrote:
 
   Ah, I see. The docs say Although this result format does not have as
  much
   information, it may be easier for existing solr clients to parse. I
  guess
   the ngroups value could be added to this format, but apparently it
  isn't. I
   do agree with you that to be usefull (as in possible to read for a
 client
   that doesn't know of the grouped format), the number should be that of
  the
   groups, not of the documents.
  
   A quick glance in the code learns that it is indeed not calculated in
  this
   case. But not completely trivial to fix. Could you use format=simple
   instead? That will work with ngroups.
  
   Teun
  
  
   2013/2/25 Nicholas Ding nicholas...@gmail.com
  
Thanks Teun and Carlos, I set group.ngroups=true, but I don't have
 this
ngroup number when I was using group.main = true.
   
On Mon, Feb 25, 2013 at 12:02 PM, Carlos Maroto 
cmar...@searchtechnologies.com wrote:
   
 Use group.ngroups, check it in the Solr wiki for FieldCollapsing

 Carlos Maroto
 Search Architect at Search Technologies (
 www.searchtechnologies.com)



 Nicholas Ding nicholas...@gmail.com wrote:


 Hello,

 I grouped the result, and set group.main=true. I was expecting the
numFound
 equals to the number of groups, but actually it was not.

 How do I get the number of groups?

 Thanks
 Nicholas

   
  
 



Re: numFound is not correct while using Result Grouping

2013-02-25 Thread Amit Nithian
Yeah I had a similar problem. I filed and submitted this patch:
https://issues.apache.org/jira/browse/SOLR-4310

Let me know if this is what you are looking for!
Amit


On Mon, Feb 25, 2013 at 1:50 PM, Teun Duynstee t...@duynstee.com wrote:

 Ah, I see. The docs say Although this result format does not have as much
 information, it may be easier for existing solr clients to parse. I guess
 the ngroups value could be added to this format, but apparently it isn't. I
 do agree with you that to be usefull (as in possible to read for a client
 that doesn't know of the grouped format), the number should be that of the
 groups, not of the documents.

 A quick glance in the code learns that it is indeed not calculated in this
 case. But not completely trivial to fix. Could you use format=simple
 instead? That will work with ngroups.

 Teun


 2013/2/25 Nicholas Ding nicholas...@gmail.com

  Thanks Teun and Carlos, I set group.ngroups=true, but I don't have this
  ngroup number when I was using group.main = true.
 
  On Mon, Feb 25, 2013 at 12:02 PM, Carlos Maroto 
  cmar...@searchtechnologies.com wrote:
 
   Use group.ngroups, check it in the Solr wiki for FieldCollapsing
  
   Carlos Maroto
   Search Architect at Search Technologies (www.searchtechnologies.com)
  
  
  
   Nicholas Ding nicholas...@gmail.com wrote:
  
  
   Hello,
  
   I grouped the result, and set group.main=true. I was expecting the
  numFound
   equals to the number of groups, but actually it was not.
  
   How do I get the number of groups?
  
   Thanks
   Nicholas
  
 



Re: [ANN] vifun: tool to help visually tweak Solr boosting

2013-02-25 Thread Amit Nithian
This is cool! I had done something similar except changing via JConsole/JMX:
https://issues.apache.org/jira/browse/SOLR-2306

We had something not as nice at Zvents but I wanted to expose these as
MBean properties so you could change them via any JMX UI like JVisualVM

Cheers!
Amit


On Mon, Feb 25, 2013 at 2:36 PM, jmlucjav jmluc...@gmail.com wrote:

 Apologies...instructions are wrong on the cd, these commands are to be run
 at the top level of the project...I fixed the doc to read:

 cd vifun
 griffon run-app



 On Mon, Feb 25, 2013 at 10:45 PM, Jan Høydahl jan@cominvent.com
 wrote:

  Hi,
 
  I actually tried ../griffonw run-app but it says griffon-app does not
  appear to be part of a Griffon application.
 
  I installed griffon and tried again griffon run-app inside of
  griffon-app, but same error.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
  Solr Training - www.solrtraining.com
 
  25. feb. 2013 kl. 19:51 skrev jmlucjav jmluc...@gmail.com:
 
   Jan, thanks for looking at this!
  
   - Running from source: would you care to send me the error you get (if
  any)
   when running from source? I assume you have griffon1.1.0 installed
 right?
  
   - Binary dist: the distrib is created by griffon, so I'll check if the
   permission issue (I develop on windows, and tested on a clean windows
  too,
   so I don't face the issue you mention) is known or can be fixed
 somehow.
   I'll update the doc anyway.
  
   - wt param: I am already overriding wt param (in order to use javabin).
   What I didn't allow is to choose the handler to be used when submitting
  the
   query. I guess any handler that does not have appends/invariants
 that
   would interfere would work fine, I just thought /select is mostly
  available
   in most installations and that is one thing less to configure. But
 yes, I
   could let the user configure it, I'll open an issue.
  
   xavier
  
   On Mon, Feb 25, 2013 at 3:10 PM, Jan Høydahl jan@cominvent.com
  wrote:
  
   Cool. I tried running from source (using the bundled griffonw), but I
   think the instructions may be wrong, had to download binary dist.
   The file permissions for bin/vifun in binary dist should have +w so
 you
   can execute it with ./vifun
  
   What about the ability to override the wt param, so that you can
 point
   it to the /browse handler directly?
  
   --
   Jan Høydahl, search solution architect
   Cominvent AS - www.cominvent.com
   Solr Training - www.solrtraining.com
  
   23. feb. 2013 kl. 15:12 skrev jmlucjav jmluc...@gmail.com:
  
   Hi,
  
   I have built a small tool to help me tweak some params in Solr
  (typically
   qf, bf in edismax). As maybe others find it useful, I am open
 sourcing
  it
   on github: https://github.com/jmlucjav/vifun
  
   Check github for some more info and screenshots. I include part of
 the
   github page below.
   regards
  
   Description
  
   Did you ever spend lots of time trying to tweak all numbers in a
   *edismax*
   handler *qf*, *bf*, etc params so docs get scored to your liking?
  Imagine
   you have the params below, is 20 the right boosting for *name* or is
 it
   too
   much? Is *population* being boosted too much versus distance? What
  about
   new documents?
  
 !-- fields, boost some --
 str name=qfname^20 textsuggest^10 edge^5 ngram^2
   phonetic^1/str
 str name=mm33%/str
 !-- boost closest hits --
 str name=bfrecip(geodist(),1,500,0)/str
 !-- boost by population --
 str name=bfproduct(log(sum(population,1)),100)/str
 !-- boost newest docs --
 str name=bfrecip(rord(moddate),1,1000,1000)/str
  
   This tool was developed in order to help me tweak the values of
  boosting
   functions etc in Solr, typically when using edismax handler. If you
 are
   fed
   up of: change a number a bit, restart Solr, run the same query to see
  how
   documents are scored now...then this tool is for you.
   https://github.com/jmlucjav/vifun#featuresFeatures
  
- Can tweak numeric values in the following params: *qf, pf, bf, bq,
boost, mm* (others can be easily added) even in *appends or
invariants*
- View side by side a Baseline query result and how it changes when
  you
gradually change each value in the params
- Colorized values, color depends on how the document does related
 to
baseline query
- Tooltips give you Explain info
- Works on remote Solr installations
- Tested with Solr 3.6, 4.0 and 4.1 (other versions would work too,
 as
long as wt=javabin format is compatible)
- Developed using Groovy/Griffon
  
   https://github.com/jmlucjav/vifun#requirementsRequirements
  
- */select* handler should be available, and not have any *appends
  or
invariants*, as it could interfere with how vifun works.
- Java6 is needed (maybe it runs on Java5 too). A JRE should be
  enough.
  
   https://github.com/jmlucjav/vifun#getting-startedGetting started

Re: Slaves always replicate entire index Index versions

2013-02-21 Thread Amit Nithian
A few others have posted about this too apparently and SOLR-4413 is the
root problem. Basically what I am seeing is that if your index directory is
not index/ but rather index.timestamp set in the index.properties a new
index will be downloaded all the time because the download is expecting
your index to be in solr_data_dir/index. Sounds like a quick solution
might be to rename your index directory to just index and see if the
problem goes away.

To confirm, look at line 728 in the SnapPuller.java file (in
downloadIndexFiles)

I am hoping that the patch and a more unified getIndexDir can be added to
the next release of Solr as this is a fairly significant bug to me.

Cheers
Amit

On Thu, Feb 21, 2013 at 12:56 AM, Amit Nithian anith...@gmail.com wrote:

 So the diff in generation numbers are due to the commits I believe that
 Solr does when it has the new index files but the fact that it's
 downloading a new index each time is baffling and I just noticed that too
 (hit the replicate button and noticed a full index download). I'm going to
 pop in to the source and see what's going on to see why unless there's a
 known bug filed about this?


 On Tue, Feb 19, 2013 at 1:48 AM, Raúl Grande Durán 
 raulgrand...@hotmail.com wrote:


 Hello.
 We have recently updated our Solr from 3.5 to 4.1 and everything is
 running perfect except the replication between nodes. We have a
 master-repeater-2slaves architecture and we have seen some things that
 weren't happening before:
 When a Slave (repeater or slaves) starts to replicate it needs to
 download the entire index. Even when some little changes has been made to
 the index at master. This takes such a long time since our index is more
 than 20 Gb.After replication cycle we have different index generations in
 master, repeater and slaves. For example:Master: gen. 64590Repeater: gen.
 64591Both slaves: gen. 64592
 My replicationHandler configuration is like this:requestHandler
 name=/replication class=solr.ReplicationHandler  lst
 name=master   str name=enable${enable.master:false}/str
 str name=replicateAftercommit/str   str
 name=replicateAfterstartup/str   str
 name=confFilesschema.xml,stopwords.txt/str /lst lst
 name=slave   str name=enable${enable.slave:false}/str
 str name=masterUrl${solr.master.url:http://localhost/solr}/str
 str name=pollInterval00:03:00/str /lst /requestHandler
 Our problems are very similar to those explained here:
 http://lucene.472066.n3.nabble.com/Problem-with-replication-td2294313.html
 Any ideas?? Thanks





Re: Slaves always replicate entire index Index versions

2013-02-21 Thread Amit Nithian
Thanks for the links... I have updated SOLR-4471 with a proposed solution
that I hope can be incorporated or amended so we can get a clean fix into
the next version so our operations and network staff will be happier with
not having gigs of data flying around the network :-)


On Thu, Feb 21, 2013 at 1:24 AM, raulgrande83 raulgrand...@hotmail.comwrote:

 Hi Amit,

 I have came across some JIRAs that may be useful in this issue:
 https://issues.apache.org/jira/browse/SOLR-4471
 https://issues.apache.org/jira/browse/SOLR-4354
 https://issues.apache.org/jira/browse/SOLR-4303
 https://issues.apache.org/jira/browse/SOLR-4413
 https://issues.apache.org/jira/browse/SOLR-2326

 Please, let us know if you find any solution.

 Regards.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4041817.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Slaves always replicate entire index Index versions

2013-02-21 Thread Amit Nithian
Sounds good I am trying the combination of my patch and 4413 now to see how
it works and will have to see if I can put unit tests around them as some
of what I thought may not be true with respect to the commit generation
numbers.

For your issue above in your last post, is it possible that there was a
commit on the master in that slight window after solr checks for the latest
generation of the master but before it downloads the actual files? How
frequent are the commits on your master?


On Thu, Feb 21, 2013 at 2:00 AM, raulgrande83 raulgrand...@hotmail.comwrote:

 Thanks for the patch, we'll try to install these fixes and post if
 replication works or not.

 I renamed 'index.timestamp' folders to just 'index' but it didn't work.
 These lines appeared in the log:
 INFO: Master's generation: 64594
 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Slave's generation: 64593
 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchLatestIndex
 INFO: Starting replication process
 21-feb-2013 10:42:00 org.apache.solr.handler.SnapPuller fetchFileList
 SEVERE: No files to download for index generation: 64594



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-Index-versions-tp4041256p4041827.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Anyone else see this error when running unit tests?

2013-02-14 Thread Amit Nithian
Okay so I think I found a solution if you are a maven user and don't
mind forcing the test codec to Lucene40 then do the following:

Add this to your pom.xml under the

build

 pluginManagement

 plugins section


   plugin

  groupIdorg.apache.maven.plugins/groupId

  artifactIdmaven-surefire-plugin/artifactId

  version2.13/version

  configuration

   argLine-Dtests.codec=Lucene40/argLine

  /configuration

  /plugin


If you are running in Eclipse, simply add this as a VM argument. The
default test codec is set to random and this means that there is a
possibility of picking Lucene3x if some random variable is  2 and other
conditions are met. For me, my test-framework jar must not be ahead of
the lucene one (b/c I don't control the classpath order and honestly this
shouldn't be a requirement to run a test) so it periodically bombed. This
little fix seems to have helped provided that you don't care about Lucene3x
vs Lucene40 for your tests (I am on Lucene40 so it's fine for me).

HTH!

Amit


On Mon, Feb 4, 2013 at 6:18 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Me too, it fails randomly with test classes. We use Solr4.0 for testing, no
 maven, only ant.
 --roman
 On 4 Feb 2013 20:48, Mike Schultz mike.schu...@gmail.com wrote:

  Yes.  Just today actually.  I had some unit test based on
  AbstractSolrTestCase which worked in 4.0 but in 4.1 they would fail
  intermittently with that error message.  The key to this behavior is
 found
  by looking at the code in the lucene class:
  TestRuleSetupAndRestoreClassEnv.
  I don't understand it completely but there are a number of random code
  paths
  through there.  The following helped me get around the problem, at least
 in
  the short term.
 
 
 
 @org.apache.lucene.util.LuceneTestCase.SuppressCodecs({Lucene3x,Lucene40})
  public class CoreLevelTest extends AbstractSolrTestCase {
 
  I also need to call this inside my setUp() method, in 4.0 this wasn't
  required.
  initCore(solrconfig.xml, schema.xml, /tmp/my-solr-home);
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Anyone-else-see-this-error-when-running-unit-tests-tp4015034p4038472.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: replication problems with solr4.1

2013-02-14 Thread Amit Nithian
I may be missing something but let me go back to your original statements:
1) You build the index once per week from scratch
2) You replicate this from master to slave.

My understanding of the way replication works is that it's meant to only
send along files that are new and if any files named the same between the
master and slave have different sizes then this is a corruption of sorts
and do this index.timestamp and send the full thing down. This, I think,
explains your index.timestamp issue although why the old index/ directory
isn't being deleted i'm not sure about. This is why I was asking about OS
details, file system details etc (perhaps something else is locking that
directory preventing Java from deleting it?)

The second issue is the index generation which is governed by commits and
is represented by looking at the last few characters in the segments_XX
file. When the slave downloads the index and does the copy of the new
files, it does a commit to force a new searcher hence why the slave
generation will be +1 from the master.

The index version is a timestamp and it may be the case that the version
represents the point in time when the index was downloaded to the slave? In
general, it shouldn't matter about these details because replication is
only triggered if the master's version  slave's version and the clocks
that all servers use are synched to some common clock.

Caveat however in my answer is that I have yet to try 4.1 as this is next
on my TODO list so maybe I'll run into the same problem :-) but I wanted to
provide some info as I just recently dug through the replication code to
understand it better myself.

Cheers
Amit


On Wed, Feb 13, 2013 at 11:57 PM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 OK then index generation and index version are out of count when it comes
 to verify that master and slave index are in sync.

 What else is possible?

 The strange thing is if master is 2 or more generations ahead of slave
 then it works!
 With your logic the slave must _always_ be one generation ahead of the
 master,
 because the slave replicates from master and then does an additional commit
 to recognize the changes on the slave.
 This implies that the slave acts as follows:
 - if the master is one generation ahaed then do an additional commit
 - if the master is 2 or more generations ahead then do _no_ commit
 OR
 - if the master is 2 or more generations ahead then do a commit but don't
   change generation and version of index

 Can this be true?

 I would say not really.

 Regards
 Bernd


 Am 13.02.2013 20:38, schrieb Amit Nithian:
  Okay so then that should explain the generation difference of 1 between
 the
  master and slave
 
 
  On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote:
 
  doesn't it do a commit to force solr to recognize the changes?
 
  yes.
 
  - Mark
 
 



Re: Boost Specific Phrase

2013-02-13 Thread Amit Nithian
Have you looked at the pf parameter for dismax handlers? pf does I think
what you are looking for which is to boost documents with the query term
exactly matching in the various fields with some phrase slop.


On Wed, Feb 13, 2013 at 2:59 AM, Hemant Verma hemantverm...@gmail.comwrote:

 Hi All

 I have a use case with phrase search.

 Let say I have a list of phrases in a file/dictionaries which are important
 as per our search content.
 One entry in the dictionary is lets say - project manager.
 If user's query contains any entry specified in dictionary then I want to
 boost the score of documents which have exact match of that entry.

 Lets take one example:-

 Now suppose user searches for (project manager in India with 2 yrs
 experience).
 There are words 'project manager' in the query in exact order as specified
 in dictionary then I want to boost the score of documents having 'project
 manager' as an exact match.

 This can be done at web application level after processing user query with
 dictionary and create query as below:
 q=project manager in India with 2 yrs experienceqf=titlebq=title:project
 manager^5

 I want to know is there any better solution available to this use case at
 Solr level.

 AFAIK there is something very similar available in FAST ESP know as Phrase
 Recognition.

 Thanks
 Hemant



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boost-Specific-Phrase-tp4040188.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: what do you use for testing relevance?

2013-02-13 Thread Amit Nithian
Ultimately this is dependent on what your metrics for success are. For some
places it may be just raw CTR (did my click through rate increase) but for
other places it may be a function of money (either it may be gross revenue,
profits, # items sold etc). I don't know if there is a generic answer for
this question which is leading those to write their own frameworks b/c it's
very specific to your needs. A scoring change that leads to an increase in
CTR may not necessarily lead to an increase in the metric that makes your
business go.


On Tue, Feb 12, 2013 at 10:31 PM, Steffen Elberg Godskesen 
steffen.godske...@gmail.com wrote:


 Hi Roman,

 If you're looking for regression testing then
 https://github.com/sul-dlss/rspec-solr might be worth looking at. If
 you're not a ruby shop, doing something similar in another language
 shouldn't be to hard.


 The basic idea is that you setup a set of tests like

 If the query is X, then the document with id Y should be in the first 10
 results
 If the query is S, then a document with title T should be the first
 result
 If the query is P, then a document with author Q should not be in the
 first 10 result

 and that you run these whenever you tune your scoring formula to ensure
 that you haven't introduced unintended effects. New ideas/requirements for
 your relevance ranking should always result in writing new tests - that
 will probably fail until you tune your scoring formula. This is certainly
 no magic bullet, but it will give you some confidence that you didn't make
 things worse. And - in my humble opinion - it also gives you the benefit of
 discouraging you from tuning your scoring just for fun. To put it bluntly:
 if you cannot write up a requirement in form of a test, you probably have
 no need to tune your scoring.


 Regards,

 --
 Steffen



 On Tuesday, February 12, 2013 at 23:03 , Roman Chyla wrote:

  Hi,
  I do realize this is a very broad question, but still I need to ask it.
  Suppose you make a change into the scoring formula. How do you
  test/know/see what impact it had? Any framework out there?
 
  It seems like people are writing their own tools to measure relevancy.
 
  Thanks for any pointers,
 
  roman





Re: replication problems with solr4.1

2013-02-13 Thread Amit Nithian
So just a hunch... but when the slave downloads the data from the master,
doesn't it do a commit to force solr to recognize the changes? In so doing,
wouldn't that increase the generation number? In theory it shouldn't matter
because the replication looks for files that are different to determine
whether or not to do a full download or a partial replication. In the event
of a full replication (an optimize would cause this), I think the
replication handler considers this a corruption and forces a full
download into this index.timestamp folder with the index.properties
pointing at this folder to tell solr this is the new index directory. Since
you mentioned you rebuild the index from scratch once per week I'd expect
to see this behavior you are mentioning.

I remember debugging the code to find out how replication works in 4.0
because of a bug that was fixed in 4.1 but I haven't read through the 4.1
code to see how much (if any) has changed from this logic.

In short, I don't know why you'd have the old index/ directory there..
that seems either like a bug or something was locking that directory in the
filesystem preventing it from being removed. What OS are you using and is
the index/ directory stored on a local file system vs NFS?

HTH
Amit


On Tue, Feb 12, 2013 at 2:26 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:


 Now this is strange, the index generation and index version
 is changing with replication.

 e.g. master has index generation 118 index version 136059533234
 and  slave  has index generation 118 index version 136059533234
 are both same.

 Now add one doc to master with commit.
 master has index generation 119 index version 1360595446556

 Next replicate master to slave. The result is:
 master has index generation 119 index version 1360595446556
 slave  has index generation 120 index version 1360595564333

 I have not seen this before.
 I thought replication is just taking over the index from master to slave,
 more like a sync?




 Am 11.02.2013 09:29, schrieb Bernd Fehling:
  Hi list,
 
  after upgrading from solr4.0 to solr4.1 and running it for two weeks now
  it turns out that replication has problems and unpredictable results.
  My installation is single index 41 mio. docs / 115 GB index size / 1
 master / 3 slaves.
  - the master builds a new index from scratch once a week
  - a replication is started manually with Solr admin GUI
 
  What I see is one of these cases:
  - after a replication a new searcher is opened on index.xxx
 directory and
the old data/index/ directory is never deleted and besides the file
replication.properties there is also a file index.properties
  OR
  - the replication takes place everything looks fine but when opening the
 admin GUI
the statistics report
  Last Modified: a day ago
  Num Docs: 42262349
  Max Doc:  42262349
  Deleted Docs:  0
  Version:  45174
  Segment Count: 1
 
  VersionGen  Size
  Master: 1360483635404  112  116.5 GB
  Slave:1360483806741  113  116.5 GB
 
 
  In the first case, why is the replication doing that???
  It is an offline slave, no search activity, just there fore backup!
 
 
  In the second case, why is the version and generation different right
 after
  full replication?
 
 
  Any thoughts on this?
 
 
  - Bernd
 

 --
 *
 Bernd FehlingBielefeld University Library
 Dipl.-Inform. (FH)LibTec - Library Technology
 Universitätsstr. 25  and Knowledge Management
 33615 Bielefeld
 Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

 BASE - Bielefeld Academic Search Engine - www.base-search.net
 *



Re: replication problems with solr4.1

2013-02-13 Thread Amit Nithian
Okay so then that should explain the generation difference of 1 between the
master and slave


On Wed, Feb 13, 2013 at 10:26 AM, Mark Miller markrmil...@gmail.com wrote:


 On Feb 13, 2013, at 1:17 PM, Amit Nithian anith...@gmail.com wrote:

  doesn't it do a commit to force solr to recognize the changes?

 yes.

 - Mark



Re: Boost Specific Phrase

2013-02-13 Thread Amit Nithian
Ah yes sorry mis-understood. Another option is to use n-grams so that
projectmanager is a term so any query involving project manager in india
with 2 years experience would match higher because the query would contain
projectmanager as a term.


On Wed, Feb 13, 2013 at 9:56 PM, Hemant Verma hemantverm...@gmail.comwrote:

 Thanks for the response.

 pf parameter actually boost the documents considering all search keywords
 mentioned in main query but I am looking for something which boost the
 documents considering few search keywords from the user query.
 Like as per the example, user query is (project manager in India with 2 yrs
 experience) and my dictionary contains one entry as 'project manager' which
 specifies if user query has 'project manager' in his query then boost those
 documents which contains 'project manager' as an exact match.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boost-Specific-Phrase-tp4040188p4040371.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Benefits of Solr over Lucene?

2013-02-12 Thread Amit Jha
Add to Jack reply, Solr can also be embed into the application and can run on 
same process. Solr, the server-I zation of lucene. The line is very blurred and 
solr is not a very thin wrapper around lucene library. 

Most solr features are distinct from lucene like 

- detailed breakdown of scoring mathematics
- text analysis phases
- solr adds to lucene's text analysis library and makes it configurable through 
XML
- introduce the notion of a field types
- runtime performance stats including cache hit/ miss rate


Rgds
AJ

On 12-Feb-2013, at 22:17, Jack Krupansky j...@basetechnology.com wrote:

 Here's yet another short list of benefits of Solr over Lucene (not that any 
 of them take away from Lucene since Solr is based on Lucene):
 
 - Multiple core index - go beyond the limits of a single lucene index
 - Support for multi-core or named collections
 - richer query parsers (e.g., schema-aware, edismax)
 - schema language, including configurable field types and configurable 
 analyzers
 - easier to do per-field/type analysis
 - plugin architecture, easily configured and customized
 - Generally, develop a search engine without writing any code, and what code 
 you may write is mostly easily configured plugins
 - Editable configuration file rather than hard-coded or app-specific 
 properties
 - Tomcat/Jetty container support enable system administration as corporate IT 
 ops teams already know it
 - Web-based Admin UI, including debugging features such as field/type analysis
 - Solr search features are available to any app written in any language, not 
 just Java. All you need is HTTP access. (Granted, there is SOME support for 
 Lucene in SOME other languages.)
 
 In short, if you want to embed search engine capabilities in your Java app, 
 Lucene is the way to go, but if you want a web architecture, with the 
 search engine in a separate process from the app in a multi-tier 
 architecture, Solr is the way to go. Granted, you could also use 
 ElasticSearch or roll your own, but Solr basically runs right out of the 
 box with no code development needed to get started and no Java knowledge 
 needed.
 
 And to be clear, Solr is not simply an extension of Lucene - Solr is a 
 distinct architectural component that is based on Lucene. In OOP terms, think 
 of composition rather than derivation.
 
 -- Jack Krupansky
 
 -Original Message- From: JohnRodey
 Sent: Tuesday, February 12, 2013 10:40 AM
 To: solr-user@lucene.apache.org
 Subject: Benefits of Solr over Lucene?
 
 I know that Solr web-enables a Lucene index, but I'm trying to figure out
 what other things Solr offers over Lucene.  On the Solr features list it
 says Solr uses the Lucene search library and extends it!, but what exactly
 are the extensions from the list and what did Lucene give you?  Also if I
 have an index built through Solr is there a non-HTTP way to search that
 index?  Because solr4j essentially just makes HTTP requests correct?
 
 Some features Im particularly interested in are:
 Geospatial Search
 Highlighting
 Dynamic Fields
 Near Real-Time Indexing
 Multiple Search Indices
 
 Thanks!
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Benefits-of-Solr-over-Lucene-tp4039964.html
 Sent from the Solr - User mailing list archive at Nabble.com. 


Re: Solr HTTP Replication Question

2013-01-25 Thread Amit Nithian
Okay one last note... just for closure... looks like it was addressed in
solr 4.1+ (I was looking at 4.0).


On Thu, Jan 24, 2013 at 11:14 PM, Amit Nithian anith...@gmail.com wrote:

 Okay so after some debugging I found the problem. While the replication
 piece will download the index from the master server and move the files to
 the index directory but during the commit phase, these older generation
 files are deleted and the index is essentially left in tact.

 I noticed that a full copy is needed if the index is stale (meaning that
 files in common between the master and slave have different sizes) but also
 I think a full copy should be needed if the slaves generation is higher
 than the master as well. In short, to me it's not sufficient enough to
 simply say a full copy is needed if the slave's index version is =
 master's index version. I'll create a patch and file a bug along with a
 more thorough writeup of how I got in this state.

 Thanks!
 Amit



 On Thu, Jan 24, 2013 at 2:33 PM, Amit Nithian anith...@gmail.com wrote:

 Does Solr's replication look at the generation difference between master
 and slave when determining whether or not to replicate?

 To be more clear:
 What happens if a slave's generation is higher than the master yet the
 slave's index version is less than the master's index version?

 I looked at the source and didn't seem to see any reason why the
 generation matters other than fetching the file list from the master for a
 given generation. It's too wordy to explain how this happened so I'll go
 into details on that if anyone cares.

 Thanks!
 Amit





Re: Solr HTTP Replication Question

2013-01-24 Thread Amit Nithian
Okay so after some debugging I found the problem. While the replication
piece will download the index from the master server and move the files to
the index directory but during the commit phase, these older generation
files are deleted and the index is essentially left in tact.

I noticed that a full copy is needed if the index is stale (meaning that
files in common between the master and slave have different sizes) but also
I think a full copy should be needed if the slaves generation is higher
than the master as well. In short, to me it's not sufficient enough to
simply say a full copy is needed if the slave's index version is =
master's index version. I'll create a patch and file a bug along with a
more thorough writeup of how I got in this state.

Thanks!
Amit



On Thu, Jan 24, 2013 at 2:33 PM, Amit Nithian anith...@gmail.com wrote:

 Does Solr's replication look at the generation difference between master
 and slave when determining whether or not to replicate?

 To be more clear:
 What happens if a slave's generation is higher than the master yet the
 slave's index version is less than the master's index version?

 I looked at the source and didn't seem to see any reason why the
 generation matters other than fetching the file list from the master for a
 given generation. It's too wordy to explain how this happened so I'll go
 into details on that if anyone cares.

 Thanks!
 Amit



  1   2   3   >