Re: Getting page number of result with tika

2013-04-13 Thread Erick Erickson
You can't assume that Fix Version/s 4.3 means anybody is actively working on it,
and the age of the patches suggests nobody is. The Fix Version/s gets updated
when releases are made, otherwise you'd have open JIRAs for, say, Solr 1.4.1.

Near as I can tell, that JIRA is dead, don't look for it unless
someone picks it up
again.

Best
Erick

On Thu, Apr 11, 2013 at 11:55 AM, Gian Maria Ricci
alkamp...@nablasoft.com wrote:
 As far as I know SOLR-380 https://issues.apache.org/jira/browse/SOLR-380
 deal with the problem of kowing page number with tika indexing. The issue
 contains a patch but it is really old, and I'm curious how is the status of
 this issue (since I see Fix Version/s 4.3, so it seems that it will be
 implemented in the next version).



 Anyone has a good workaround/patch/solution to search into tika indexed
 documents and having the list of pages where match was found?



 Thanks in advance.



 Gian Maria.





Re: Approximately needed RAM for 5000 query/second at a Solr machine?

2013-04-13 Thread Erick Erickson
bq: disk space is three times

True, I keep forgetting about compound since I never use it...

On Wed, Apr 10, 2013 at 11:05 AM, Walter Underwood
wun...@wunderwood.org wrote:
 Correct, except the worst case maximum for disk space is three times. --wunder

 On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote:

 You're mixing up disk and RAM requirements when you talk
 about having twice the disk size. Solr does _NOT_ require
 twice the index size of RAM to optimize, it requires twice
 the size on _DISK_.

 In terms of RAM requirements, you need to create an index,
 run realistic queries at the installation and measure.

 Best
 Erick

 On Tue, Apr 9, 2013 at 10:32 PM, bigjust bigj...@lambdaphil.es wrote:



 On 4/9/2013 7:03 PM, Furkan KAMACI wrote:
 These are really good metrics for me:
 You say that RAM size should be at least index size, and it is
 better to have a RAM size twice the index size (because of worst
 case scenario).
 On the other hand let's assume that I have a RAM size that is
 bigger than twice of indexes at machine. Can Solr use that extra
 RAM or is it a approximately maximum limit (to have twice size of
 indexes at machine)?
 What we have been discussing is the OS cache, which is memory that
 is not used by programs.  The OS uses that memory to make everything
 run faster.  The OS will instantly give that memory up if a program
 requests it.
 Solr is a java program, and java uses memory a little differently,
 so Solr most likely will NOT use more memory when it is available.
 In a normal directly executable program, memory can be allocated
 at any time, and given back to the system at any time.
 With Java, you tell it the maximum amount of memory the program is
 ever allowed to use.  Because of how memory is used inside Java,
 most long-running Java programs (like Solr) will allocate up to the
 configured maximum even if they don't really need that much memory.
 Most Java virtual machines will never give the memory back to the
 system even if it is not required.
 Thanks, Shawn


 Furkan KAMACI furkankam...@gmail.com writes:

 I am sorry but you said:

 *you need enough free RAM for the OS to cache the maximum amount of
 disk space all your indexes will ever use*

 I have made an assumption my indexes at my machine. Let's assume that
 it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will
 use RAM up to how much I define it as a Java processes. When we think
 about the indexes at storage and caching them at RAM by OS, is that
 what you talk about: having more than 5 GB - or - 10 GB RAM for my
 machine?

 2013/4/10 Shawn Heisey s...@elyograg.org


 10 GB.  Because when Solr shuffles the data around, it could use up to
 twice the size of the index in order to optimize the index on disk.

 -- Justin

 --
 Walter Underwood
 wun...@wunderwood.org





Re: Use of SolrJettyTestBase

2013-04-13 Thread Erick Erickson
I don't see anything obvious, can you set a breakpoint in any other
test and hit it? It's always worked for me if I set a breakpoint and
execute in debug mode...

Not much help,
Erick

On Thu, Apr 11, 2013 at 5:01 PM, Upayavira u...@odoko.co.uk wrote:
 On Tue, Apr 2, 2013, at 12:21 AM, Chris Hostetter wrote:
 : I've subclassed SolrJettyTestBase, and added a test method (annotated
 : with @test). However, my test method is never called. I see the

 You got an immediate failure from the tests setup, because you don'th ave
 assertions enabled in your JVM (the Lucene  Solr test frameworks both
 require assertions enabled to run tests because so many important things
 can't be sanity checked w/o them)...

 : Test class requires enabled assertions, enable globally (-ea) or for
 : Solr/Lucene subpackages only: com.odoko.ArgPostActionTest

 FYI: in addition to that txt being written to System.err, it would have
 immediately been thrown as an Exception as well.   (see
 TestRuleAssertionsRequired.java)

 So, I've finally found time to get past the enable assertions thingie.
 I've got that sorted. But my test still doesn't stop at breakpoints.

 I've got this:

 public class ArgPostActionTest extends SolrJettyTestBase {

   @BeforeClass
   public static void beforeTest() throws Exception {
 createJetty(ExternalPaths.EXAMPLE_HOME, null, null);
   }

   @Test
   public void testArgPostAction() throws SolrServerException {
   blah.blah.blah
   assertEquals(response.getResults().getNumFound(), 1);
   }
 }

 Neither of these methods get called when I execute the test. Any ideas
 what's up?

 Upayavira


Re: Not able to replicate the solr 3.5 indexes to solr 4.2 indexes

2013-04-13 Thread Erick Erickson
Please make a JIRA and attach as a patch if there aren't any JIRAs
for this yet.

Best
Erick

On Fri, Apr 12, 2013 at 1:58 AM, Montu v Boda
montu.b...@highqsolutions.com wrote:
 hi

 thanks for your reply.

 is anyone is going to fix this issue in new solr version? because there are
 so many guys facing the same problem while upgrading the solr index 3.5.0 to
 solr 4.2

 Thanks  Regards
 Montu v Boda



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Not-able-to-replicate-the-solr-3-5-indexes-to-solr-4-2-indexes-tp4055313p4055477.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.4: memory leak?

2013-04-13 Thread Dmitry Kan
Hi André,

Thanks a lot for your response and the relevant information.

Indeed, we have noticed the similar behavior when hot reloading a web-app
with solr after changing some of the classes. The only bad consequence of
this that luckily does not happen too often, is that the web app becomes
stale. So we prefer actually (re)deploying via tomcat restart.

Thanks,

Dmitry

On Thu, Apr 11, 2013 at 6:01 PM, Andre Bois-Crettez
andre.b...@kelkoo.comwrote:

 On 04/11/2013 08:49 AM, Dmitry Kan wrote:

 SEVERE: The web application [/solr] appears to have started a thread named
 [**MultiThreadedHttpConnectionMan**ager cleanup] but has failed to stop
 it.
 This is very likely to create a memory leak.
 Apr 11, 2013 6:38:14 AM org.apache.catalina.loader.**WebappClassLoader
 clearThreadLocalMap


 To my understanding this kind of leak only is a problem if the Java code
 is *reloaded* while the tomcat JVM is not stopped.
 For example when reloadable=true in the Context of the web application
 and you change files in WEB-INF or .war : what would happen is that each
 existing threadlocals would continue to live (potentially holding
 references to other stuff and preventing GC) while new threadlocals are
 created.

 http://wiki.apache.org/tomcat/**MemoryLeakProtectionhttp://wiki.apache.org/tomcat/MemoryLeakProtection

 If you stop tomcat entirely each time, you should be safe.


 --
 André Bois-Crettez

 Search technology, Kelkoo
 http://www.kelkoo.com/


 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



Re: Not able to replicate the solr 3.5 indexes to solr 4.2 indexes

2013-04-13 Thread Umesh Prasad
Hi Erick,
I have already created a Jira and also attached a Path. But no unit
tests. My local build is failing (building from solr 4.2.1 source jar).
Please see
https://issues.apache.org/jira/browse/SOLR-4703
.
--
Umesh


On Sat, Apr 13, 2013 at 7:24 PM, Erick Erickson erickerick...@gmail.comwrote:

 Please make a JIRA and attach as a patch if there aren't any JIRAs
 for this yet.

 Best
 Erick

 On Fri, Apr 12, 2013 at 1:58 AM, Montu v Boda
 montu.b...@highqsolutions.com wrote:
  hi
 
  thanks for your reply.
 
  is anyone is going to fix this issue in new solr version? because there
 are
  so many guys facing the same problem while upgrading the solr index
 3.5.0 to
  solr 4.2
 
  Thanks  Regards
  Montu v Boda
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Not-able-to-replicate-the-solr-3-5-indexes-to-solr-4-2-indexes-tp4055313p4055477.html
  Sent from the Solr - User mailing list archive at Nabble.com.




-- 
---
Thanks  Regards
Umesh Prasad


CloudSolrServer vs ConcurrentUpdateSolrServer for indexing

2013-04-13 Thread J Mohamed Zahoor
Hi

This question has come up many times in the list with lots of variations (which 
confuses me a lot).

Iam using Solr 4.1. one collection , 6 shards, 6 machines.
I am using CloudSolrServer  inside each mapper to index my documents…. While it 
is working fine , iam trying to improve the indexing performance.


Question is:  

1) is CloudSolrServer multiThreaded?

2) Will using ConcurrentUpdateSolr server increase indexing performance?

./Zahoor
 

Re: CloudSolrServer vs ConcurrentUpdateSolrServer for indexing

2013-04-13 Thread Mark Miller

On Apr 13, 2013, at 11:07 AM, J Mohamed Zahoor zah...@indix.com wrote:

 Hi
 
 This question has come up many times in the list with lots of variations 
 (which confuses me a lot).
 
 Iam using Solr 4.1. one collection , 6 shards, 6 machines.
 I am using CloudSolrServer  inside each mapper to index my documents…. While 
 it is working fine , iam trying to improve the indexing performance.
 
 
 Question is:  
 
 1) is CloudSolrServer multiThreaded?

No. The proper fast way to use it is to start many threads that all add docs to 
the same CloudSolrServer instance. In other words, currently, you must do the 
multi threading yourself. CloudSolrServer is thread safe.

 
 2) Will using ConcurrentUpdateSolr server increase indexing performance?

Yes, but at the cost of having to specify a server to talk to - if it goes 
down, so does your indexing. It's also not very great at reporting errors. 
Finally, using multiple threads and CloudSolrServer, you can approach the 
performance of ConcurrentUpdateSolr server.

- Mark

 
 ./Zahoor



Re: Easier way to do this?

2013-04-13 Thread William Bell
OK, is d in degrees or miles?


On Fri, Apr 12, 2013 at 10:20 PM, David Smiley (@MITRE.org) 
dsmi...@mitre.org wrote:

 Bill,

 I responded to the issue you created about this:
 https://issues.apache.org/jira/browse/SOLR-4704

 In summary, use {!geofilt}.

 ~ David


 Billnbell wrote
  I would love for the SOLR spatial 4 to support pt so that I can run # of
  results around a central point easily like in 3.6. How can I pass
  parameters to a Circle() ? I would love to send PT to this query since
 the
  pt is the same across multiple areas
 
  For example:
 
 
 http://localhost:8983/solr/core/select?rows=0q=*:*facet=truefacet.query={
 !
 
 key=.5}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.0072369))%22facet.query={!
 
 key=1}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.01447))%22facet.query={!
 
 key=5}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.0723))%22facet.query={!
 
 key=10}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.1447))%22{!
 
 key=25}facet.query=store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.361846))%22facet.query={!
 
 key=50}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.72369))%22facet.query={!
 
 key=100}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=1.447))%22





 -
  Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Easier-way-to-do-this-tp4055474p4055732.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Easier way to do this?

2013-04-13 Thread David Smiley (@MITRE.org)
Good question.  With geofilt it's kilometers.



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Easier-way-to-do-this-tp4055474p4055784.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Basic auth on SolrCloud /admin/* calls

2013-04-13 Thread Tim Vaillancourt

This JIRA covers a lot of what you're asking:

https://issues.apache.org/jira/browse/SOLR-4470

I am also trying to get this sort of solution in place, but it seems to 
be dying off a bit. Hopefully we can get some interest on this again, 
this question comes up every few weeks, it seems.


I can confirm the latest patch from this JIRA works as expected, 
although my primary concern is the credentials appear in the JVM 
command, and I'd like to move that to a file.


Cheers,

Tim

On 11/04/13 10:41 AM, Michael Della Bitta wrote:

It's fairly easy to lock down Solr behind basic auth using just the
servlet container it's running in, but the problem becomes letting
services that *should* be able to access Solr in. I've rolled with
basic auth in some setups, but certain deployments such as Solr Cloud
or sharded setups don't play well with auth because there's no good
way to configure them to use it.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Apr 11, 2013 at 1:19 PM, Raymond Wikerrwi...@gmail.com  wrote:

On Apr 11, 2013, at 17:12 , adfel70adfe...@gmail.com  wrote:

Hi
I need to implement security in solr as follows:
1. prevent unauthorized users from accessing to solr admin pages.
2. prevent unauthorized users from performing solr operations - both /admin
and /update.


Is the conclusion of this thread is that this is not possible at the moment?


The obvious solution (to me, at least) would be to (1) restrict access to solr to 
localhost, and (2) use a reverse proxy (e.g, apache) on the same node to provide 
authenticated  restricted access to solr. I think I've seen recipes for (1), somewhere, 
and I've used (2) fairly extensively for similar purposes.


Re: Approximately needed RAM for 5000 query/second at a Solr machine?

2013-04-13 Thread Furkan KAMACI
Hi Jack;

Due to I am new to Solr, can you explain this two things that you said:

1) when most people say index size they are referring to all fields,
collectively, not individual fields (what do you mean with Segments are on
a per-field basis  and all fields, individual fields.)
2) more cores might make the worst case scenario worse since it will
maximize the amount of data processed at a given moment


2013/4/13 Erick Erickson erickerick...@gmail.com

 bq: disk space is three times

 True, I keep forgetting about compound since I never use it...

 On Wed, Apr 10, 2013 at 11:05 AM, Walter Underwood
 wun...@wunderwood.org wrote:
  Correct, except the worst case maximum for disk space is three times.
 --wunder
 
  On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote:
 
  You're mixing up disk and RAM requirements when you talk
  about having twice the disk size. Solr does _NOT_ require
  twice the index size of RAM to optimize, it requires twice
  the size on _DISK_.
 
  In terms of RAM requirements, you need to create an index,
  run realistic queries at the installation and measure.
 
  Best
  Erick
 
  On Tue, Apr 9, 2013 at 10:32 PM, bigjust bigj...@lambdaphil.es wrote:
 
 
 
  On 4/9/2013 7:03 PM, Furkan KAMACI wrote:
  These are really good metrics for me:
  You say that RAM size should be at least index size, and it is
  better to have a RAM size twice the index size (because of worst
  case scenario).
  On the other hand let's assume that I have a RAM size that is
  bigger than twice of indexes at machine. Can Solr use that extra
  RAM or is it a approximately maximum limit (to have twice size of
  indexes at machine)?
  What we have been discussing is the OS cache, which is memory that
  is not used by programs.  The OS uses that memory to make everything
  run faster.  The OS will instantly give that memory up if a program
  requests it.
  Solr is a java program, and java uses memory a little differently,
  so Solr most likely will NOT use more memory when it is available.
  In a normal directly executable program, memory can be allocated
  at any time, and given back to the system at any time.
  With Java, you tell it the maximum amount of memory the program is
  ever allowed to use.  Because of how memory is used inside Java,
  most long-running Java programs (like Solr) will allocate up to the
  configured maximum even if they don't really need that much memory.
  Most Java virtual machines will never give the memory back to the
  system even if it is not required.
  Thanks, Shawn
 
 
  Furkan KAMACI furkankam...@gmail.com writes:
 
  I am sorry but you said:
 
  *you need enough free RAM for the OS to cache the maximum amount of
  disk space all your indexes will ever use*
 
  I have made an assumption my indexes at my machine. Let's assume that
  it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will
  use RAM up to how much I define it as a Java processes. When we think
  about the indexes at storage and caching them at RAM by OS, is that
  what you talk about: having more than 5 GB - or - 10 GB RAM for my
  machine?
 
  2013/4/10 Shawn Heisey s...@elyograg.org
 
 
  10 GB.  Because when Solr shuffles the data around, it could use up to
  twice the size of the index in order to optimize the index on disk.
 
  -- Justin
 
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 



Re: Which tokenizer or analizer should use and field type

2013-04-13 Thread anurag.jain
I tried both way.
(project AND assistant) OR manager 

project assistant~5 OR manager 


it is working properly.
but i got problem.

if i give query projec assistant, then it is not able to find out. 

and what is meaning of ~5 ?

If i write *projec assistant* then it is able to find out but it give
project or assistant. 

My objective is to search like - Mysql like operator, %search word% .

How to write query which is exactly like , Mysql like operator. 

Thanks 

Need help As soon as possible






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-tokenizer-or-analizer-should-use-and-field-type-tp4055591p4055833.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which tokenizer or analizer should use and field type

2013-04-13 Thread anurag.jain
Hi, If you can help me in. It will solve my problem.

keyword:(*assistant AND coach*) giving me 1 result.

keyword:(*iit AND kanpur*)  giving me 2 result.

But query:- 

keyword:(*assistant AND coach* OR (*iit AND kanpur*)) giving me only 1
result.

Also i tried. keyword:(*assistant AND coach* OR (*:* *iit AND kanpur*))
giving me only 1 result. Don't know why. 

How query should look like ?? please help me to find out solution. 

Thanks in advance.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-tokenizer-or-analizer-should-use-and-field-type-tp4055591p4055837.html
Sent from the Solr - User mailing list archive at Nabble.com.


Is any way to return the number of indexed tokens in a field?

2013-04-13 Thread Alexandre Rafalovitch
Hello,

We seem to have all sorts of functions around tokenized field content, but
I am looking for simple count/length that can be returned as a
pseudo-field. Does anyone know of one out of the box?

The specific situation is that I am indexing a field for specific regular
expressions that become tokens (in a copyField). Not every field has the
same number of those.

I now want to find the documents that have maximum number of tokens in that
field (for testing and review). But I can't figure out how.  Any help would
be appreciated.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)